Hello,
with nv-caffe we are doing "strong" scaling i.e. if you have a mini batch size of 8 and train over 2 GPUs then each GPU will get to process 4 samples on every iteration. Mini-batch training obviously allows for a greater amount of the work to be parallelized therefore this will generally lead to faster processing on GPUs. Additionally, there is merit in mini-batch training in that the network is less likely to diverge since there is less variance in a mini batch than there is a single sample. Consequently, sometimes even though you don't have GPU resources (memory) to train on large mini batches you still want to perform the parameter updates only after processing a certain number of samples. This is when batch accumulation comes in: suppose you want to train on mini batches of 10 samples but your GPU can only process 2 samples at a time. You can reach a numerically identical solution if you use a mini batch size of 2 samples and batch accumulation of 5 iterations. I hope this helps.
A good paper to read on the subject is "Efficient back-prop":
http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdfRegards,
Greg.
parallel processing
. There is some litterature (e.g. ) in which the recomendation is