Multiple GPUs, GPU memory, batch size, and batch accumulation

Ken M Erney

unread,

Sep 14, 2016, 3:56:47 PM9/14/16

to DIGITS Users

I have a system setup with DIGITS installed from source. This includes DIGITS 4.1-dev and Caffe 0.15.13 with NCCL. I have two cards, both are smaller K2200 Quadro cards with 4GB of RAM each. I can run the KITTI example by setting the batch size to 2 and the batch accumulation to 5. I am trying to figure out what this actually means and how what I need to set these to for other networks so that they can work with my cards. I imagine the answer to this question could be involved as well as having some complex nuances. I was wondering if anybody had advice on some resources that I could read that would help me understand how these parameters relate to GPU resources.

Thanks,

Ken

Greg Heinrich

unread,

Sep 14, 2016, 6:26:49 PM9/14/16

to DIGITS Users

Hello,
with nv-caffe we are doing "strong" scaling i.e. if you have a mini batch size of 8 and train over 2 GPUs then each GPU will get to process 4 samples on every iteration. Mini-batch training obviously allows for a greater amount of the work to be parallelized therefore this will generally lead to faster processing on GPUs. Additionally, there is merit in mini-batch training in that the network is less likely to diverge since there is less variance in a mini batch than there is a single sample. Consequently, sometimes even though you don't have GPU resources (memory) to train on large mini batches you still want to perform the parameter updates only after processing a certain number of samples. This is when batch accumulation comes in: suppose you want to train on mini batches of 10 samples but your GPU can only process 2 samples at a time. You can reach a numerically identical solution if you use a mini batch size of 2 samples and batch accumulation of 5 iterations. I hope this helps.

A good paper to read on the subject is "Efficient back-prop": http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

Regards,
Greg.

parallel processing
. There is some litterature (e.g. ) in which the recomendation is

Ken M Erney

unread,

Sep 15, 2016, 7:53:11 AM9/15/16

to DIGITS Users

Thanks Greg, now I understand the batch size vs. the batch accumulation. The paper is also a good read. From that paper, I was able to find some additional resources that talk about batch size and noise. Thanks agin.

Reply all

Reply to author

Forward