batch size and overfitting

2,784 views
Skip to first unread message

Alex Orloff

unread,
Feb 16, 2016, 5:09:11 PM2/16/16
to Caffe Users
Hi,

Imagine you have batch size=256 and total train set = 1024.
So you actually have 4 mini-batches cause mini-batch(i) = mini-batch(i+4)
If I throw out 1 picture, then I'll have 1023 different mini-batches

Whould it help to avoid overfitting?

Jan C Peters

unread,
Feb 17, 2016, 5:27:15 AM2/17/16
to Caffe Users
The mini-batch size does not need to evenly divide the size of the training set in caffe. If for the current batch the data layer reaches the end of the data source, it will just rewind to the beginning to prove more samples. So in your case the last batch would just contain the first sample of your training set at the [255] position of the batch. Other than that I am not very clear what you are asking; I'll just go with the following:

Theoretically, you get the best gradient if you are using ALL training samples in a single batch. Since this is not technically feasible one uses smaller batches. Now these batches should contain examples from all classes in equal proportions if possible, to not steer the gradient in a direction favorable for only one predominant class, ignoring all the other classes. In reality, you try to get close to that scenario by shuffling your training set. But using a batch size of 1 is really not a great idea, it is the most "stochastic" gradient descent can get. And how all of this influences generalization is not at all clear: It will somehow influence it, but how exactly is very hard to predict and will probably differ for different training sets, networks and training settings.

Jan

ath...@ualberta.ca

unread,
Feb 17, 2016, 12:45:49 PM2/17/16
to Caffe Users
More on batch size...

Not considering hardware, "pure SGD" with the optimal batch size of 1 leads to the fastest training; batch sizes greater than 1 only slow down training. However, considering today's parallel hardware, larger batch sizes train faster with regard to actual clock time and that is why it is better to have batch sizes like 256 say.

Wikipedia has many falsehoods on this so on the morning of Mon., Dec. 7, 2015 at NIPS Deep Learning workshop in Montreal, I asked Yann LeCun about this after a talk and he said "optimal batch size is 1" and that higher batch sizes just slow down training (not considering hardware). The best mini-batch size is a completely hardware specific thing that you can tune as a hyper parameter.

My explanation is that although it may seem like having the gradient for all training examples before updating is better, the rather large downside is that no updates are made until all training examples are considered. With a batch size of 1, sure there are more local errors along the way but the huge advantage, and this wins out, is that in expectation, weights are moving toward a better solution earlier, during the first pass through the training set and not after.

Andy

Evan Zamir

unread,
Feb 17, 2016, 7:51:43 PM2/17/16
to Caffe Users
It would seem that the amount of learning in a mini-batch should be larger with more examples. Is that handled automatically with Caffe? If not, how much should one decrease the lr in going from, say, 256 to 1 images in a mini-batch?

Jan C Peters

unread,
Feb 18, 2016, 8:08:39 AM2/18/16
to Caffe Users
I guess it depends on what exactly you mean by "fast training". Having lots of possibly bad small updates instead fewer and larger "better" updates (good/bad meant wrt finding a better local minimum, if that helps generalization is a different question) does not seem generally "optimal" to me. On the other hand, saying that pure GD (maximal batch size) is always better than pure SGD (minimal batch size) is probably not generally true as well: Since GD is an iterative method that starts from a pretty much random location on the error-hypersurface, it is very hard to make general assertions about the best approach. But I think we can agree that for practical purposes a batchsize somewhere inbetween is a good choice considering both theoretical AND practical implications (such as hardware utilization efficiency), maybe between 10 and 1000 depending on the size of the samples. And this is what you usually see in papers about deep learning.

Jan

P.S. cf. the paper http://www.ics.uci.edu/~smyth/courses/cs274/readings/optimization/bottou.pdf supports my previous claim, that the general GD should converge faster than SGD, but that is not the complete story of course.

Eelco Hoogendoorn

unread,
Dec 14, 2017, 4:49:15 AM12/14/17
to Caffe Users

It strikes me that none of the answers found so far have address your question yet.

Indeed a batched optimization will never find the global optimum of the network as fed with the entire dataset at once. Instead what will happen with batching is that after the generalizing features have been fitted, which usually happens first, the solution will jump around trying to overfit the pecularities of each batch in turn, but never quite converging to either of those.

The question then is: is the 'average' of trying to fit many different and more pronounced non-generalizing maxima closer to the generalizing properties of the entire dataset of interest?

Ive come to this thread asking myself the very same question; and havnt gotten any more answers yet...

I think it is inevitable that batching provides some limit on overfitting; the large jumps in gradients experienced by going from batch to batch make it hard for the optimizer to ever get around to obsessing over much weaker gradients required to push each individual batch to its global overfitted maximum, so it will most effectively act on those features of the optimal solution that are common to each batch. However, if the features to be overfit in the batches tend 'not to compete' with eachother, they may still be overfitted eventually; just likely less efficiently so since the network is busy chasing the gradient jumps from batch to batch.

So I think the bottomline is: yes, it will tend to provide some limitation to overfitting, but not in a very controllable or dependable manner.

Note that I am shooting from the hip here though; my experience with limiting overfitting on large optimisation problems comes mostly from outside of a neural network context.

Eelco Hoogendoorn

unread,
Dec 20, 2017, 5:29:42 AM12/20/17
to Caffe Users

To add to that: having gotten my hands dirty with respect to the question asked, I have to strongly disagree with any of the above responses.

For my application (A selu activated FFN), batching matters A LOT. Not just with respect to overfitting; but with respect to being usable whatsoever.

Without any batching, validation error is and remains abysmal. It seems that the extra randomness introduced by the batching is essential to avoiding local minima in the fitting process.

I can imagine the experience of other people with different applications may vary however. For instance, I can see this being much less the case for CNNs, for much the same reason as to why dropout is either a good or bad either in CNN vs FFN.

So the best answer I suppose is: give it a try and see what happens.

Ahmet Selman Bozkır

unread,
Dec 25, 2017, 9:25:17 AM12/25/17
to Caffe Users
I have read your posts. Thank you.

But the comments you have made, made me to think about the failures that I have lived. For my case, ;I have an image classification task to classify 15 class dataset that has 300 training examples and varying validation examples. I have finetuned inception 1 and alexnet on my 1060 6GB gpu with this dataset and achieved 89% and 80% accuracy levels. These models have been built with the default batch sizes according to the protoxts. However, when I try it with resnet 50 or resnet 101, I had to reduce the batch size to 12 or 8 due to the memory constraints. Was this the reason for that I have achieved below 80%? (74% or 69% respectively). Resnet architectures are much deeper compared to inception 1 and alexnet.  Although I have attempted to use very different learning rates I have never got even close to alexnet. So my question is very simple. Could  the "batch size" be a  reason of this unsuccessful models?? Because I have reduced the batch sizes since my hardware could not be able to process such batch sizes of 32 or 64??

Your answers will be very helpful for those who are novice.

17 Şubat 2016 Çarşamba 00:09:11 UTC+2 tarihinde Alex Orloff yazdı:
Reply all
Reply to author
Forward
0 new messages