batch sizes and iterations

18,773 views
Skip to first unread message

Soda Heng

unread,
Jan 5, 2015, 12:00:41 AM1/5/15
to caffe...@googlegroups.com
Hey guys,

I'm a little confused on how i should set my training and test batch sizes compared to the test_iterations and maximum number of iterations.

In the MNIST example, their maximum iteration was only 10,000 even though their training set was 60,000. Did they just stop it early to avoid over fitting?

If I have 600,000 training examples and 20,000 validation examples, should the batch sizes for each one be different? Something like 256 for the train and 100 for the test?

How should i determine the test_iterations and maximum_iterations as well?

Bartosz Ludwiczuk

unread,
Jan 5, 2015, 3:20:39 AM1/5/15
to caffe...@googlegroups.com
Hi,
you write about few issue. But first of all, some terminology:
epoch: Forward and Backward pass of all training examples ( not used in Caffe)
batch: how many images in one pass
iterations: how many batches

1. Batch Size
Batch size in mainly depended to your memory in GPU/RAM. Most time it is used power of two (64,128,256). I always try to choose 256, because it works better with SGD. But for bigger network I use 64.
2. Number of Iterations
Number if iterations set number of epoch of learning. Here I will use MNIST example to explain it to you:
Training: 60k, batch size: 64, maximum_iterations= 10k. So, there will be 10k*64 = 640k images of learning. This mean, that there will be 10.6 of epochs.(Number if epochs is hard to set, you should stop when net does not learn any more, or it is overfitting)
Val: 10k, batch size: 100, test_iterations: 100, So, 100*100: 10K, exacly all images from validation base.

So, if you would like to test 20k images, you should set ex. batch_size=100 and test_iterations: 200. This will allow you to test all validation base in each testing procedure.
To sum up, parameters "test_iterations" and "batch size" in test depend on number of images in test database.
Parameters "maximum_iterations" and "batch size" in train depend on number of epochs you would like to train your net.
I hope, you understand this example.

Regards,
Bartosdz

Soda Heng

unread,
Jan 5, 2015, 10:36:38 AM1/5/15
to caffe...@googlegroups.com
That was exactly the information i was looking for and everything makes alot more sense. Thank you!

One follow up question I have is if the testing is performed on all validation data each time?

So back to MNIST, if my validation batch size: 100 and test iteration: 100 and test interval: 1000.... so on the 1000th iteration, it tests all 10,000 validation samples and again for the next 1000th iteration?

Bartosz Ludwiczuk

unread,
Jan 5, 2015, 11:00:25 AM1/5/15
to caffe...@googlegroups.com
In normal case, all testing in only done in Validation data. 
Periods of testing depend on "test_interval" value in solver.

Soda Heng

unread,
Jan 5, 2015, 11:09:20 AM1/5/15
to caffe...@googlegroups.com
Ok great, thanks again!

Antonio Paes

unread,
May 18, 2015, 4:36:21 PM5/18/15
to caffe...@googlegroups.com
very good explanation!

alameen...@gmail.com

unread,
Sep 4, 2015, 9:38:10 AM9/4/15
to Caffe Users
Thank you very much for the detailed explanation! this helped me a lot!

Prophecies

unread,
Sep 7, 2015, 2:18:21 PM9/7/15
to Caffe Users
Can I hijack this and ask my own question. What happens when you have a really odd number of data? Say 14623, I forgot the real number. But the problem is , even when I choose a small batch size like 10, I'm torn between using 1462 or 1463 iterations. What happens when I choose 1463? It does not seem to throw an error. Is it "circular" ? Also, since I'm either missing a few test data or repeating some, I'm not getting the exact accuracy on the test set. Can I write my solver file in a way that allows me to address this?

On Monday, January 5, 2015 at 11:09:20 AM UTC-5, Soda Heng wrote:
Ok great, thanks again!

On Monday, January 5, 2015 10:00:25 AM UTC-6, Bartosz Ludwiczuk wrote:
In normal case, all testing in only done in Validation data. 
Periods of testing depend on "test_interval" value in solver..number of epochs you would like to train your net.

robbyl...@gmail.com

unread,
Sep 7, 2015, 2:54:01 PM9/7/15
to Caffe Users
It is circular.
I would not worry about getting the exact error. Going over most of the data will give you pretty accurate estimation of the error, and it is insignificant if you do 1462 or 1463. To be honest I think you can even use much lower number to get pretty precise estimator (say 200). Keep in mind that your data sample is also random, so the error you are getting (even when going over the whole validation set) is just an estimation of the true error.

Prophecies

unread,
Sep 7, 2015, 3:38:51 PM9/7/15
to Caffe Users
Thank you for confirming. I agree on accuracy being pretty uniform beyond a certain number of iterations especially on larger datasets. However, I am worried about posting the results for papers and not having the exact accuracy might be construed as dishonesty. For everything I have done so far, I have not had accuracy change by more than 0.001 for changing between a few more or less iterations. Also, I could always set batch size of 1 and have however many iterations as I want in my final deployable network. I just wanted to clarify how the things are working under the hood. Thanks again for clarifying.

anigma

unread,
Sep 7, 2015, 6:23:49 PM9/7/15
to Caffe Users
Thank you to everyone for asking and explaining. Please find my comments below ...

I am quite frustrated with the Caffe from the lack of documentation point of view. It seems that most of the people are only guessing on what is going on. 
I don't see also that  C++ code is sufficiently commented to be easily understandable. It seems a small number of people really understand logging messages displayed during the network learning. I don't believe that without such deep understanding any serious results can be obtained and if to believe to papers using  Caffe.

To my understanding:
1. SGD with mini-batches implies that a single forward/backward pass in train_net is done on  a single mini-batch of training data [this corresponds to a training net params update] 
2. I guess the mini-batches are processed sequentially [BTW, theoretical why?] 
3. I guess that this elementary computation (net params update step) corresponds to one iteration. Then 'max_iter' should be a number of such net params update steps.
It means that the number of time we cycle over the entire training data  is [  max_iter/(training_data_batch_size*number_training_batch_sizes) ] 

However, I am not sure, I really understand the logger messages on loss displayed during training. Can someone help to understand this referring to C++ code?  
What do we see loss per mini-batch, some averaging results over some number of mini-batches??? BTW, what is the way to dump these messages to a logger file?

Please can also someone provide a reference to a rigorous math. paper on the choice of  batch_size parameter? 
The reference provided in the Caffe tutorial [1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade: Springer, 2012. was not a serious one.    

Thank you a lot in advance for any new piece of information, 
Anigma.

Axel Straminsky

unread,
Sep 7, 2015, 9:18:03 PM9/7/15
to Caffe Users
When I was making my first steps into Caffe, I found very confusing and unintuitive to think in terms of iterations, especially because I found that the concept itself was not well documented (if it was documented at all). Personally, I use DIGITS for training my nets, and DIGITS works with epochs, which as Bartosz said, 1 epoch is equivalent to 1 pass over the entire training set, so now I think only of epochs and don't bother to think in terms of iterations. Nevertheless, if someone is interested, the relationship between epochs and iterations is:

iters = epochs x (training_set / batch_size)

ath...@ualberta.ca

unread,
Sep 8, 2015, 1:07:22 PM9/8/15
to Caffe Users
Frustration often arises when expectations fail to match reality so I would recommend some expectation management.

1. Caffe has been/is being developed by students/researchers often while researching/writing papers/course deadlines/graduation deadlines so we should be so lucky that this non-trivial piece of code even exists at all. No one releasing this code has made any claim that the code is perfect or is perfectly documented. Since it's open source, please feel free to contribute and make it better.

2. Caffe is software that efficiently implements deep learning algorithms. No one has made any claim that somehow by downloading/using the software and reading documentation (code) that you will understand anything about deep learning (nevermind "easily understandable" as you say). For example, one cannot assume that using MS Word will teach you a specific Language (like English), keyboard skills, what A4 paper size is etc. In the same way, Caffe generally assumes (at least some) working knowledge of deep learning meaning that you have taken a course like (http://cs231n.stanford.edu/index.html) and read deep learning papers. Caffe makes no claim to teach you anything about dropout, it assumes you have read (http://arxiv.org/abs/1207.0580). It would be inappropriate (impossible) to include in the comments/documentation all the background required to understand all the algorithms implemented in Caffe.

3. Deep learning is progressing at a very rapid pace. You're impressed with DeepFace (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6909616) until you read FaceNet (http://arxiv.org/abs/1503.03832). There is R-CNN (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6909475) followed by Fast R-CNN (http://arxiv.org/abs/1504.08083) followed by Faster R-CNN (http://arxiv.org/abs/1506.01497). One year is an eternity and arxiv essential since by the time the conference comes around, everything has changed. There are some papers on theoretical aspects of deep nets but nothing near what your expecting since things are moving so fast (moving target) and very difficult due to their highly nonlinear nature. Bottou is a leading researcher in the theory of deep nets so saying the ref above is "not a serious one" speaks volumes about your expectation mis-match. If you have deeper insights then please feel free to publish them as we would all be grateful.

In summary, you are frustrated about numerous things that don't/can't exist so manage your expectations and you will find happiness.

Look at the bright side, deep learning is at a very early and exciting time and nobody knows what is ahead. I've heard that particle physicists dream of the early days when there were so many things to discover. Today, in particle physics, teams of hundreds work for decades to make a single discovery. Be careful what you wish for.

If you are looking for the warm comfort of a mathematics tome with theorems, lemmas and many proofs, refined over a hundred years, then you are in the wrong area. If you are looking for Caffe to teach you deep learning then you are, well, in for a rough ride. If you are looking for a piece of open source software that implements some deep learning algorithm efficiently, then Caffe is for you. We look forward to your upcoming improvements either to code, documentation or deep learning theory.

li kai

unread,
Sep 8, 2015, 11:01:48 PM9/8/15
to Caffe Users
Great answer.

在 2015年9月9日星期三 UTC+8上午1:07:22,ath...@ualberta.ca写道:

Tambet Matiisen

unread,
Sep 9, 2015, 4:53:10 AM9/9/15
to Caffe Users
I've learned to appreciate many choices Caffe has made:
1. If your training set is 500 000 images and all numbers in solver file would be expressed in epochs, then display=1 would show you intermediate results after an hour. You wouldn't be happy if you only learned that your networks has exploded after an hour. That's why I can't use Neon for anything serious.
2. Why they are using some obscure LMDB databases, instead of simple .mat or .npz? If your dataset (I mean those 500 000 images) doesn't fit into memory, then .mat or .npz is not an option. You could implement your own batching strategy (like Neon), but actually using simple and fast key-value database becomes reasonable option.

So basically Caffe is tuned for big datasets and that's what deep learning is meant for. For small datasets deep learning is going to overfit and shallow methods are more reliable.

  Tambet

anigma

unread,
Sep 9, 2015, 5:21:18 AM9/9/15
to Caffe Users
I appreciate a lot the work done by all the students and researchers, as well as existence and availability of Caffe to public; and a lot of thank you to developers!
However, comparing to other open source software such as PCL, OpenCV, the documentation is not at the same level and it is rather difficult to control/design experiments. 
Nobody is requiring to document papers and it is clear that people should learn theory of deep learning on their own or attending classes. 
I don't think that expressing frustration should raise such a negative response. 
There is a lot of experimental work in  deep learning and experiments should be well controlled like in experimental physics; this is why it is very important understanding control parameters of the Caffe and here probably documentation can be improved to assist people using it; if it is already released to public.

Kind Regards.

Benedetta Savelli

unread,
Feb 26, 2016, 4:52:48 PM2/26/16
to Caffe Users
I'm a bit confused about the division between training set, test set and validation set. In creating the solver.prototxt file that defines the train parameters i have to define test_iter parameter ( that seems to be the number of iterations, that is how many batches of test images were sent into the network ). Why i have to define this parameter for training? Is test set used during training as validation set? Or does caffe do training and test at the same time ? 

Jan C Peters

unread,
Feb 29, 2016, 3:00:43 AM2/29/16
to Caffe Users
Internally Caffe does not distinguish between test and validation sets. Think of it this way: Caffe enables you to run a so-called TEST phase after every test_iter training iterations. You can give a slightly different net for these TEST phases by specifying include { phase: TEST } in some layers of your network. Usually people use this to specify another data source, to monitor the generalization capabilities of the network. It seems to me that terminology is always a bit fuzzy there, so sometimes this other data source is called the "validation" set, sometimes "test" set (especially if there is no validation set). The point is, that it is not used for training, only for inference and accuracy estimation.

Jan

Fábio Ferreira

unread,
Mar 30, 2016, 7:57:02 AM3/30/16
to Caffe Users
I think you meant "test_interval" instead of "test_iter" according to the MNIST documentation:

# Carry out testing every 500 training iterations.
test_interval: 500

Jan

unread,
Mar 31, 2016, 2:57:23 AM3/31/16
to Caffe Users
Oh, yes you are absolutely right. Mixed that up somehow. Happens with all these very similar variable names ;-).

Jan

Ashutosh Singla

unread,
Apr 22, 2016, 10:14:20 AM4/22/16
to Caffe Users
Hi,

I have a question about the batch size.

If I have a training set of size 60, and choose a batch size of 32, how does caffe handle this.
Does it train with 0-31 images for one batch, and then 32-60 for the second batch, and then back to 0-31 for the third batch, etc.
Or, does it train with the first 0-31 for one batch, and then 32-60 + 61-63 for the second batch (wrapping around), and then 0-31 for the third batch, etc.

I tried to understand it from https://github.com/BVLC/caffe/blob/master/src/caffe/layers/data_layer.cpp#L181-L203 but couldn't get it.

It would be great help if someone can explain me.

Jan

unread,
Apr 25, 2016, 10:21:40 AM4/25/16
to Caffe Users
The data layers just rewind when necessary, the batch is always filled completely. So in your example the second batch contains images 32-59 and 0-3, in that order. The third batch then contains the images 4-35, and so on (all zero-based indices, all range endpoints inclusive). Think of the training data as a data stream, created by (infinitely many) subsequent copies of the given data source. The data layers always just return the next <batch_size> data points from that stream.

Jan

Jan

unread,
Apr 25, 2016, 10:26:01 AM4/25/16
to Caffe Users
I wanted to add: the actual loading if the data is somewhat difficult to understand through the code, but it basically works this way: There is a prefetching thread running in the background that actually loads the data. When Forward() is called on the data layer it queries the prefetch thread for the data and puts it in the blobs. This is done to somewhat compensate the slowness of disk memory, to make the data loading and the actual network computations concurrent and thus more parallel.

Jan

fengyunxiaozi

unread,
Jan 9, 2017, 2:36:55 AM1/9/17
to Caffe Users
So in test phase,   test_iter*TEST batch size>=the total number of test images is OK? Refer to the example you mentioned, if I set batch size: 100 and test_iterations: 120, then 12k images are tested. Settings like this are also OK? Thanks!

在 2015年1月5日星期一 UTC-8上午12:20:39,Bartosz Ludwiczuk写道:
Reply all
Reply to author
Forward
0 new messages