lr_rate remains constant while using Adadelta in Caffe

39 views
Skip to first unread message

ayush....@tonboimaging.com

unread,
Dec 6, 2017, 1:35:17 AM12/6/17
to Caffe Users

This is the solver.protoxt I used for training.

# The train/test net protocol buffer definition
net: "train_val.prototxt"

# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.0001
lr_policy: "fixed"
momentum: 0.95
weight_decay: 0.0005
gamma:0.99
stepsize: 40

# Display every 100 iterations
display: 100

# The maximum number of iterations
max_iter: 65000

# snapshot intermediate results
snapshot: 10000
snapshot_prefix: "./train/save_"

# solver mode: CPU or GPU
solver_mode: GPU
type: "AdaDelta"
delta: 1e-8

The issue that I am facing is that the lr_rate remains constant throughout all the iterations. Although I do realize that this can be because I have kept lr_policy as fixed, I cannot remove the lr_policy ( Isnt the point of Adadelta that the learning rate be calculated during the learning process ?? ). Also I have noticed that despite keeping type as "AdaDelta" caffe calls sgd_solver.cpp. please suggest modifications in my solver so as to use adadelta for training.

Przemek D

unread,
Dec 6, 2017, 3:11:58 AM12/6/17
to Caffe Users
You keep seeing calls to sgd_solver.cpp because all solvers inherit from the base SGDSolver (see Sean Bell's answer for details). No worries, if you set type: "AdaDelta", you're training with it. As for the LR, AdaDelta (which extends AdaGrad) does not modify the global LR value - you will never see it changing in your output. What it does is calculates a per-weight learning rate - an individual one for each weight. A good explanation is on Sebastian Ruder's blog - read on AdaGrad first, then AdaDelta below that.

ayush....@tonboimaging.com

unread,
Dec 6, 2017, 4:19:34 AM12/6/17
to Caffe Users
So a follow up question that I want to pose after reading your answer ( and the blogs that you mentioned ): Since the update of all the parameters does not depend on learning rate at all, shouldnt learning rate be irrelevant when using Adadelta?. While using Adadelta I could see two different results ( in terms of reduction in loss )  for two different learning rates.

Przemek D

unread,
Dec 7, 2017, 3:17:37 AM12/7/17
to Caffe Users
Yeah because the individual, per-parameter LR depends on the global LR (termed η on Sebastian's blog). The relevant code: SGDSolver::ApplyUpdate retrieves the global LR and calls the virtual function ComputeUpdateValue, see the overloaded AdaDeltaSolver::ComputeUpdateValue.

ayush....@tonboimaging.com

unread,
Dec 7, 2017, 7:20:32 AM12/7/17
to Caffe Users
Thanks Prezemek D for the reply.
Reply all
Reply to author
Forward
0 new messages