Relation between batchsize and learning rate/momentum

599 views
Skip to first unread message

Sid

unread,
Jun 28, 2015, 11:51:35 PM6/28/15
to lasagn...@googlegroups.com
Sorry, this isn't Lasagne specific. I'd like to know if there's a rule of thumb that's used to adjust the learning rate and momentum while changing the batch size, other conditions remaining the same. Although, I understand the thorough way to do it is to find the hyper-parameters all over.

Sander Dieleman

unread,
Jun 29, 2015, 7:52:32 AM6/29/15
to lasagn...@googlegroups.com, spra...@umbc.edu
It strongly depends on whether you are averaging or summing the loss across minibatches. Both conventions are used.

If you're summing, obviously the magnitude of the updates will strongly depend on the batch size, so then you definitely have to adjust at least the learning rate accordingly.

If you're averaging, you might still need to adapt the hyperparameters, however. This is because larger minibatches reduce the variance of the gradient, which makes it possible to use a slightly larger learning rate. iirc a rule of thumb is the following: if the minibatch size goes up by a factor of g, then the learning rate can be raised by a factor of sqrt(g). In practice it's usually better to try some values and find the best one though. There are a lot of different interacting factors that influence this optimization process.

momentum can usually stay the same in my experience. In fact, I rarely optimize this parameter at all anymore, I usually leave it as 0.9.

Sander

vinita...@gmail.com

unread,
Jun 29, 2015, 8:01:14 AM6/29/15
to lasagn...@googlegroups.com
Great! Thanks for this. Also, should one use nesterov momentum over classical momentum? Why or why not?

Sander Dieleman

unread,
Jun 29, 2015, 4:58:34 PM6/29/15
to lasagn...@googlegroups.com, vinita...@gmail.com
I would say that for all practical purposes nesterov momentum is strictly better than classical momentum. The truth is probably a little more nuanced. Section 2 in this paper goes into this in a bit more detail: http://www.cs.toronto.edu/~fritz/absps/momentum.pdf

Sander
Reply all
Reply to author
Forward
0 new messages