Mini batch learning and gradient averaging

201 views
Skip to first unread message

Riddhiman Dasgupta

unread,
Jun 21, 2015, 2:56:51 AM6/21/15
to tor...@googlegroups.com

In some of the examples, like CIFAR, learning is usually done in a pseudo-batch fashion, as follows:

         -- evaluate function for complete mini batch
         
for i = 1,#inputs do
           
-- estimate f
           
local output = model:forward(inputs[i])
           
local err = criterion:forward(output, targets[i])
            f
= f + err

           
-- estimate df/dW
           
local df_do = criterion:backward(output, targets[i])
            model
:backward(inputs[i], df_do)

           
-- update confusion
            confusion
:add(output, targets[i])

         
end

         
-- normalize gradients and f(X)
         gradParameters
:div(#inputs)
         f
= f/#inputs

However, most of the above can be done in true batched fashion, instead of iterating over each sample, as follows:
         -- evaluate function for complete mini batch
         
local outputs = model:forward(inputs)
         
local f = criterion:forward(outputs, targets)

         
-- estimate df/dW
         
local df_do = criterion:backward(outputs, targets)
         model
:backward(inputs, df_do)

         
-- update confusion
         confusion
:batchAdd(outputs,targets)

My question is, do we still need to normalize the gradients:
         -- normalize gradients and f(X)
         gradParameters
:div(#inputs)
         f
= f/#inputs

Criterion modules usually have a sizeAverage field, so they will provide the averaged value of f but what about gradParameters? Do we just scale the learning rate according to the batchsize to compensate for this? 

Additionally, what would be the difference between the following:
         model:backward(inputs, df_do)
         model
:backward(inputs, df_do, 1/batchsize)


soumith

unread,
Jun 21, 2015, 2:58:42 AM6/21/15
to torch7 on behalf of Riddhiman Dasgupta
You do not need to normalize the gradients when directly learning mini-batches. Just scale the learning rate to compensate for the mini-batch.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

jonathan

unread,
Jun 22, 2015, 6:42:20 PM6/22/15
to tor...@googlegroups.com
smth chntla,

What do you mean by directly (as opposed to indirectly) learning mini-batches and what do you mean by scaling "the learning rate to compensate for the mini-batch".

Best.


On Sunday, June 21, 2015 at 2:58:42 AM UTC-4, smth chntla wrote:
You do not need to normalize the gradients when directly learning mini-batches. Just scale the learning rate to compensate for the mini-batch.

Felix

unread,
Nov 28, 2015, 5:53:55 AM11/28/15
to torch7
Hi Bob

I did an experiment processing batches of 128 images in one go. when I scaled the gradient and error the learning was quite poor. When I did ont scale them and dis not change the learning rate then I got comparable results as when processing one image after the other.

Regards, Felix

PS: Others seem as well not to scale or to adjust the learning rate:

dimkas

unread,
Apr 27, 2017, 3:39:18 AM4/27/17
to torch7
Hi, take a look at this:
Does anyone know if there is any official documentation for the necessity of scaling the LRate?
Reply all
Reply to author
Forward
0 new messages