A number of papers use the technique where they keep a constant learning rate until the validation error plateaus at which point they cut the rate by some constant. They repeat this process some small number of times (2,3, etc).
For example:
We used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and reduced three times prior to termination.
and Simonyan 2014:
The learning rate was initially set to 10−2, and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs).
and Zeiler 2013:
We anneal the learning rate throughout training manually when the validation error plateaus.
My questions are:
2) Once the learning rate is cut, should learning continue from the most recent set of weights or the set of weights corresponding to the best validation error seen so far?
3) Hoes does the efficacy of this technique compare to
AdaDelta?
-- Is there a good example of using AdaDelta in pylearn2?