Multiple GPUs, GPU memory, batch size, and batch accumulation

892 views
Skip to first unread message

Ken M Erney

unread,
Sep 14, 2016, 3:56:47 PM9/14/16
to DIGITS Users
I have a system setup with DIGITS installed from source.  This includes DIGITS 4.1-dev and Caffe 0.15.13 with NCCL.  I have two cards, both are smaller K2200 Quadro cards with 4GB of RAM each.  I can run the KITTI example by setting the batch size to 2 and the batch accumulation to 5.  I am trying to figure out what this actually means and how what I need to set these to for other networks so that they can work with my cards.  I imagine the answer to this question could be involved as well as having some complex nuances.  I was wondering if anybody had advice on some resources that I could read that would help me understand how these parameters relate to GPU resources.

Thanks,
Ken



Greg Heinrich

unread,
Sep 14, 2016, 6:26:49 PM9/14/16
to DIGITS Users
Hello,
with nv-caffe we are doing "strong" scaling i.e. if you have a mini batch size of 8 and train over 2 GPUs then each GPU will get to process 4 samples on every iteration. Mini-batch training obviously allows for a greater amount of the work to be parallelized therefore this will generally lead to faster processing on GPUs. Additionally, there is merit in mini-batch training in that the network is less likely to diverge since there is less variance in a mini batch than there is a single sample. Consequently, sometimes even though you don't have GPU resources (memory) to train on large mini batches you still want to perform the parameter updates only after processing a certain number of samples. This is when batch accumulation comes in: suppose you want to train on mini batches of 10 samples but your GPU can only process 2 samples at a time. You can reach a numerically identical solution if you use a mini batch size of 2 samples and batch accumulation of 5 iterations. I hope this helps.

A good paper to read on the subject is "Efficient back-prop": http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

Regards,
Greg.

parallel processing
. There is some litterature (e.g. ) in which the recomendation is

Ken M Erney

unread,
Sep 15, 2016, 7:53:11 AM9/15/16
to DIGITS Users
Thanks Greg, now I understand the batch size vs. the batch accumulation.  The paper is also a good read.  From that paper, I was able to find some additional resources that talk about batch size and noise.  Thanks agin.
Reply all
Reply to author
Forward
0 new messages