Large Batch-Size, Delaying Backprop for Nonseparable Loss Function

21 views
Skip to first unread message

S Bald

unread,
Feb 14, 2017, 2:25:59 PM2/14/17
to Caffe Users
I am using a non-separable loss function with my network, i.e., the loss cannot be expressed as individual losses over each training example which are then aggregated. I have set up the network so that each mini-batch is a group of "related" train examples, which the non-separable loss function jointly takes as input. The issue is that to group the related train examples together in one mini-batch, the batch size needs to be large, say, on the order of about 1000. For large-ish networks, this causes memory issues. I am wondering if it is possible to split the related batch into smaller batches (say 5 batches of 200), feed each one forward, and then only compute the loss and backprop after the 5 smaller batches have been fed forward. If anyone has other suggestions for how to deal with this, that would be appreciated as well, thanks.

Patrick McNeil

unread,
Feb 14, 2017, 2:53:27 PM2/14/17
to Caffe Users
Have you looked into the iter_size parameter in the sover.prototxt file?  I think this would do what you want.

You can change the batch_size in the model.prototxt to a smaller value (5 using your example) and then set the iter_size in the solver.prototxt to a larger value (200 for your example) and the effective batch size would be 1000 (5 * 200).  I used this to train bigger models on a smaller card.

Patrick


On Tuesday, February 14, 2017 at 1:25:59 PM UTC-6, S Bald wrote:

S Bald

unread,
Feb 14, 2017, 3:38:32 PM2/14/17
to Caffe Users
I am aware of `iter_size`. According to my understanding, this still computes the loss for each batch during each iteration, and after `iter_size` iterations, computes the average (with smoothing) over the losses as the final loss, see https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp#L209-L213

But I need to delay the computation of the *loss itself* until after `iter_size` iterations, since the loss is a function of the 1000 examples in the original batch.

Patrick McNeil

unread,
Feb 14, 2017, 7:38:13 PM2/14/17
to Caffe Users
Sorry, I missed that part of the question.

I am not sure of anyway of doing that short of iterating through the process and manually calculating the loss for each batch.  Just thinking out loud, maybe you could run an iteration, do a calculation, store the results (without updating the weights), and then run the next iteration.  After you have done a batch, do an update from the full batch.

Just a thought.

Patrick

S Bald

unread,
Feb 14, 2017, 10:18:04 PM2/14/17
to Caffe Users
Thanks; I figured this wouldn't be possible without engaging in a more manual process. That being said, would you mind illustrating how one would implement such a scheme of "storing the results" after running iterations, and "do an update" with all the stored results?

Patrick McNeil

unread,
Feb 16, 2017, 12:31:43 PM2/16/17
to Caffe Users
I have not tried to store the results in the past, so I am not sure how exactly that process would look in code. 

If you get it figured out, I would be interested in seeing how that would look.

Patrick
Reply all
Reply to author
Forward
0 new messages