Batch Normalization, Fully Convolutional Training & Gradient Accumulation

Etienne Perot

unread,

Jan 11, 2016, 12:41:05 PM1/11/16

to Caffe Users

Hello!

I'm currently trying to learn models from scratch using a Segmention Map & Batch Normalization to understand if it could accelerate training.

However, since Fully convolutional training is memory intensive, i wonder if gradient accumulation could help, and in this case, would Batch Norm Layer be still appropriate, or ... knows that the Training Batch Size is decoupled from the computational batch size?

Evan Shelhamer

unread,

Apr 17, 2016, 2:42:53 AM4/17/16

to Etienne Perot, Caffe Users

The Caffe batch norm only normalizes for the computational batch, that is, the number of instances input to a single forward pass.

Gradient accumulation decouples the computational batch size from the learning batch size *for the gradient* but it does not decouple batch normalization in the same way. The solver can simply accumulate gradients without further computation but batch norm would need to first compute the normalization and then recompute the data.

If you want to experiment with batch norm and FCNs, you can crop inputs to reduce their dimension without resizing. By taking random crops you can still cover all of the pixels, and by making crops larger than the receptive field you can still take partial advantage of the FCN forward/backward efficiency.

Evan Shelhamer

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/352a7b4d-656c-4860-9e2d-fabee4ecd761%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Etienne Perot

unread,

Apr 18, 2016, 5:24:17 AM4/18/16

to Caffe Users, et.p...@gmail.com

Thanks! That's pretty interesting. I was resizing the labelmap instead, which is dumb since a lot of deconvolution layers are not trained

However, I just fear that there is a "theoretical" issue for Spatial Batch Normalization end-to-end, that is that a lot of correlated inputs lead to distorted means & variances unless the batch size is pretty big. Then do you think something like additional dropout or something breaking the correlation could help?

What I see so far is that training from scratch with Batch Normalization, even with small batch size, works well for densely labeled datasets like CityScapes or Camvid, does that make sense?

Anyway, thanks again for your generous support.

Etienne

PS : Saw your conference @GTC, took a snapshot with my shitty webcam ^^

Reply all

Reply to author

Forward