Mini batch learning and gradient averaging

Riddhiman Dasgupta

unread,

Jun 21, 2015, 2:56:51 AM6/21/15

to tor...@googlegroups.com

In some of the examples, like CIFAR, learning is usually done in a pseudo-batch fashion, as follows:

         -- evaluate function for complete mini batch
         for i = 1,#inputs do
            -- estimate f
            local output = model:forward(inputs[i])
            local err = criterion:forward(output, targets[i])
            f = f + err

            -- estimate df/dW
            local df_do = criterion:backward(output, targets[i])
            model:backward(inputs[i], df_do)

            -- update confusion
            confusion:add(output, targets[i])

         end

         -- normalize gradients and f(X)
         gradParameters:div(#inputs)
         f = f/#inputs

However, most of the above can be done in true batched fashion, instead of iterating over each sample, as follows:

         -- evaluate function for complete mini batch
         local outputs = model:forward(inputs)
         local f = criterion:forward(outputs, targets)

         -- estimate df/dW
         local df_do = criterion:backward(outputs, targets)
         model:backward(inputs, df_do)

         -- update confusion
         confusion:batchAdd(outputs,targets)

My question is, do we still need to normalize the gradients:

         -- normalize gradients and f(X)
         gradParameters:div(#inputs)
         f = f/#inputs

Criterion modules usually have a sizeAverage field, so they will provide the averaged value of f but what about gradParameters? Do we just scale the learning rate according to the batchsize to compensate for this?

Additionally, what would be the difference between the following:

         model:backward(inputs, df_do)
         model:backward(inputs, df_do, 1/batchsize)

soumith

unread,

Jun 21, 2015, 2:58:42 AM6/21/15

to torch7 on behalf of Riddhiman Dasgupta

You do not need to normalize the gradients when directly learning mini-batches. Just scale the learning rate to compensate for the mini-batch.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

jonathan

unread,

Jun 22, 2015, 6:42:20 PM6/22/15

to tor...@googlegroups.com

smth chntla,

What do you mean by directly (as opposed to indirectly) learning mini-batches and what do you mean by scaling "the learning rate to compensate for the mini-batch".

Best.

On Sunday, June 21, 2015 at 2:58:42 AM UTC-4, smth chntla wrote:

You do not need to normalize the gradients when directly learning mini-batches. Just scale the learning rate to compensate for the mini-batch.

Felix

unread,

Nov 28, 2015, 5:53:55 AM11/28/15

to torch7

Hi Bob

I did an experiment processing batches of 128 images in one go. when I scaled the gradient and error the learning was quite poor. When I did ont scale them and dis not change the learning rate then I got comparable results as when processing one image after the other.

Regards, Felix

PS: Others seem as well not to scale or to adjust the learning rate:

https://github.com/torch/demos/blob/master/train-a-digit-classifier/train-on-mnist.lua

dimkas

unread,

Apr 27, 2017, 3:39:18 AM4/27/17

to torch7

Hi, take a look at this:

https://github.com/nagadomi/kaggle-cifar10-torch7/issues/7

Does anyone know if there is any official documentation for the necessity of scaling the LRate?

Reply all

Reply to author

Forward