Why learning rate in AdaDelta?

Jan C Peters

unread,

Oct 29, 2015, 6:29:30 AM10/29/15

to Caffe Users

This is addressed to the caffe devs:

The description of the AdaDelta solver on http://caffe.berkeleyvision.org/tutorial/solver.html and the caffe code itself suggest that the computed gradient is multiplied by a learning rate \alpha. Which changes according to the learning rate policy, if I see it correctly. But in the original paper on AdaDelta, which is also referenced on the solver page, there is no mention of an additional learning rate, on the contrary: In the title and the abstract it says that this method should actually free you from "manual" learning rate adjustments. So I have the following questions:

Why is this multiplication implemented in the caffe code, although there is no such multiplication in the original paper's pseudocode?
If I want to achieve the behavior that is described by the pseudocode in the paper, should I set the lr policy to "fixed" and base_lr to 1.0?
Could it actually make sense to use other lr policies or would that interfere with AdaDelta's adaption mechanism?

I have had a look at the respective PRs, but could not find anything related to these questions.

On a side note: The docs (and also the caffe.proto) could reflect the independence between (learning rate policy and associated parameters) and (solver type and associated parameters) a bit better. These parameters are a bit mixed up in the caffe.proto and looking at the code only helps marginally. On the solver page the explanation of the solver types is quite nice, but the possible lr policies are treated quite poorly. Don't misunderstand me, caffe is a great tool, probably the greatest there is today for deep learning. Sadly I don't have too much time to help improve it myself.

Jan

Evan Shelhamer

unread,

Oct 29, 2015, 5:00:43 PM10/29/15

to Jan C Peters, Caffe Users

1. Why

is this multiplication implemented in the caffe code, although there is no such multiplication in the original paper's pseudocode?

Although adaptive solvers strive to do away with learning rate tuning it seems that in practice the issue isn't completely solved. Setting learning rates can still help. Of course the adaptation can effectively counter the learning rate with its own scaling if the optimization directs it in that direction.

2. If

I want to achieve the behavior that is described by the pseudocode in the paper, should I set the lr policy to "fixed" and base_lr to 1.0?

Right: the "fixed" lr policy and `base_lr` of 1.0 is equivalent to no learning rate.

3.

Could it actually make sense to use other lr policies or would that interfere with AdaDelta's adaption mechanism?

I don't have any experience with this but I would expect that the lr policy would interfere with the solver adaptation since it will throw off the accumulated statistics. I have a seen effects like this in SGD + momentum since lr changes are not immediately incorporated into the scale of the momentum update.

The docs (and also the caffe.proto) could reflect the independence between (learning rate policy and associated parameters) and (solver type and associated parameters) a bit better

Yeah, a clear mapping between general solver fields and particular solver fields would be helpful documentation. Part of the confusion is that solver types do not have their own message types like the different layer types do.

Evan Shelhamer

--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/0120d011-f283-48d6-a26c-efdeb2f6802d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jan C Peters

unread,

Oct 30, 2015, 4:06:24 AM10/30/15

to Caffe Users, jcpet...@gmail.com

Thanks for the answers, Evan!

Reply all

Reply to author

Forward