Learning Rate Dynamics

100 views
Skip to first unread message

Alex Rothberg

unread,
Sep 16, 2014, 11:46:46 AM9/16/14
to pylear...@googlegroups.com
A number of papers use the technique where they keep a constant learning rate until the validation error plateaus at which point they cut the rate by some constant. They repeat this process some small number of times (2,3, etc).

For example:

Krizhevsky 2012:
We used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and reduced three times prior to termination.

and Simonyan 2014:
The learning rate was initially set to 102, and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs).

and Zeiler 2013: 
We anneal the learning rate throughout training manually when the validation error plateaus.

My questions are:
1) Is MonitorBasedLRAdjuster the right way to do this pylearn2?
2) Once the learning rate is cut, should learning continue from the most recent set of weights or the set of weights corresponding to the best validation error seen so far?
3) Hoes does the efficacy of this technique compare to AdaDelta?
-- Is there a good example of using AdaDelta in pylearn2?

Mehdi Mirza

unread,
Sep 16, 2014, 2:27:39 PM9/16/14
to pylear...@googlegroups.com
It should be done with the weights corresponding to the best validation score or at least that's how some of us did.
In order to do so, we started a new experiment using a new yaml file and loaded the model corresponded to best valid score. It's the way it's done in in scripts/papers/svhn.yaml and svhn2.yaml

arot...@4combinator.com

unread,
Sep 17, 2014, 10:29:26 AM9/17/14
to pylear...@googlegroups.com
Does anyone use either of MonitorBasedLRAdjuster or AdaDelta?

Let's say we get a new GPU that has twice the video RAM so I think to double the batch size. How should I change the learning rate or momentum in this case?

Madison May

unread,
Sep 17, 2014, 11:56:13 AM9/17/14
to pylear...@googlegroups.com
AdaDelta is definitely useful and typically provides an easy way to get decent results with minimal hyperparameter tweaking. It's worth noting that AdaDelta does level off near the end of learning, though, and can be beaten by momentum with well-tuned hyperparameters.  Would highly recommend reading Matthew Zeiler's white paper if you're interested in the details: http://arxiv.org/pdf/1212.5701v1.pdf



--
You received this message because you are subscribed to the Google Groups "pylearn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylearn-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Rothberg

unread,
Sep 17, 2014, 11:57:36 AM9/17/14
to pylear...@googlegroups.com
Do you use AdaDelta with pylearn2? Do you have an example yaml file?

--
You received this message because you are subscribed to a topic in the Google Groups "pylearn-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pylearn-users/ekeg5OfpNLI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pylearn-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages