Loss weight vs Learning rate

2,665 views
Skip to first unread message

Nam Vo

unread,
Feb 20, 2016, 6:03:38 PM2/20/16
to Caffe Users
Anybody knows how is loss weight used by caffe? I was under the impression that increasing loss weights by 10x is equivalent to increase learning rate by 10x but it seems that it's not the case.

Hossein Hasanpour

unread,
Feb 21, 2016, 1:29:23 AM2/21/16
to Caffe Users
Did you read caffe layer catalogs ? : http://caffe.berkeleyvision.org/tutorial/layers.html
Also have a look here:  https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto

read the comments, you get the idea how stuff work in caffe :
for example for the learning rate policy check this out :

 // The learning rate decay policy. The currently implemented learning rate
 
// policies are as follows:
 
//    - fixed: always return base_lr.
 
//    - step: return base_lr * gamma ^ (floor(iter / step))
 
//    - exp: return base_lr * gamma ^ iter
 
//    - inv: return base_lr * (1 + gamma * iter) ^ (- power)
 
//    - multistep: similar to step but it allows non uniform steps defined by
 
//      stepvalue
 
//    - poly: the effective learning rate follows a polynomial decay, to be
 
//      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
 
//    - sigmoid: the effective learning rate follows a sigmod decay
 
//      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
 
//
 
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
 
// in the solver parameter protocol buffer, and iter is the current iteration.

Nam Vo

unread,
Feb 21, 2016, 2:41:26 AM2/21/16
to Caffe Users
That's not what I want to ask. Loss layer has this parameter called "loss_weight" and I want to know how it is used (mathematically)
I've already looked at those links; they don't provide a clear description of this parameter.
Caffe.INFO output makes it looks like a scaling factor of the loss, while it seems that it's actually not.

Jan C Peters

unread,
Feb 22, 2016, 5:35:05 AM2/22/16
to Caffe Users
Afaik the loss weights is (as you said) just scaling the loss, but BEFORE it is backpropagated through the network! The learning rate is only applied in the UPDATE step. Since in the backpropagation the loss is not "spread" linearly and does not affect all parameters equally, changing the learning rate (which does affect all parameters equally) can lead to totally different outcomes.

Jan

Nam Vo

unread,
Feb 22, 2016, 8:49:09 PM2/22/16
to Caffe Users
It's "linear". That's how derivative is calculated.
Say L2(x) = 10 * L1(x),
then d(L2)/dx = 10 * d(L1)/dx

I ended up digging into the code. It turned out that the loss_weight is actually the scaling factor. So 10x loss weight is equivalent to 10x the learning rate.
However caffe's implementation is weird in that the solver would do the scaling and accumulate the loss in the forward phase, but in the backward phase, the layer has to scale the gradient themselves in their backward function (I didn't realize this as I was using my own loss function). Someone should fix this inconsistency.

Jan C Peters

unread,
Feb 23, 2016, 3:35:58 AM2/23/16
to Caffe Users
Actually you are right about the linearity, I confused that with the non-linearity of the activation functions. But when thinking of the loss_weighs as factors, the scaling itself is indeed linear everywhere.

But I a still not convinced that scaling learning rate and loss weight are basically the same thing: It is if you only consider the loss weight of the single topmost loss layer in your network for providing loss (which is admittedly the most common case I guess). If you have additional layers providing loss, changing their loss_weights may have a nontrivial effect on the training, different from anything you could achieve by changing the learning rate. And this is what the loss_weight was actually meant for: weighing several loss in relation to each other.

Jan

Nam Vo

unread,
Feb 23, 2016, 5:03:02 PM2/23/16
to Caffe Users
Yeah you are right, I meant if you have 1 loss function and every layers can be back-propagated from that loss, then learning rate & loss weight are equivalent.
For my task, I'm having 2 loss functions of different nature; individually they work at substantially different learning rates. I must rely on loss-weight for the training to make stable progress on both losses,

Amir Abdi

unread,
Feb 23, 2016, 5:35:45 PM2/23/16
to Caffe Users
As far as I know, loss weight associated with a layer and it can be associated with every layer (not only loss layer).
All the layers have an implicit loss_weight of zero, so they don't affect the loss. But you can define a specific loss_weight for a layer and it will affect all the layers below during training.

And Nam Vo is correct in the sense that: for the loss layer (final layer), it wouldn't matter if you change the loss_weight or the learning_rate. Loss layer has an implicit loss_weight of 1, and if you assign any other loss_weight to it, you are actually changing the learning_rate. But, it is not the same for other layers; because changing the loss_weight of a middle layer, will have no effect on the learning of its top layers.

Jan C Peters

unread,
Feb 24, 2016, 6:58:51 AM2/24/16
to Caffe Users
@Amir Abdi

Yes, we know that. But additionally you could change the layer's individual learning rate multiplier which is then again somehow similar to changing the loss_weight, but not quite (as the computed loss has a global effect, rather than the layer's individual learning rate multiplier).

There is another difference (in concept): The learning rate will usually decrease following a specific scheme, whereas the loss_weights stay constant during training. I am sure you know that, just wanted to point at it.

Jan

An Tran

unread,
May 11, 2016, 3:43:00 AM5/11/16
to Caffe Users
Hi Jan and authors in this thread.
I found this particular thread is very useful to me because I have the same question about loss weight. Furthermore, I have a set of questions like this:
How to compute layer_loss? And why? Any theoretical guarantees for this procedure?
Why loss is compute like this?
layer_loss = top_data * top_diff;
      const int count = top[top_id]->count();
      const Dtype* data = top[top_id]->cpu_data();
      const Dtype* loss_weights = top[top_id]->cpu_diff();
      loss += caffe_cpu_dot(count, data, loss_weights);
Thanks,
@An

Jan

unread,
May 11, 2016, 5:09:11 AM5/11/16
to Caffe Users
Actually that is a very good question. I am not quite sure what that paragraph of code does exactly. I would would have expected loss_weights to be something like this->loss(top_id), but that value seems to be not really used anywhere... Maybe the core devs can shed some light on that? Evan?

Jan

An Tran

unread,
May 12, 2016, 1:40:58 AM5/12/16
to Caffe Users
I think we need to figure out by ourself. The forum is very popular, Evan might not have time to explain it.
In the tutorial, they mention some hints about implementations, but it is rather confused to me (section loss_weights http://caffe.berkeleyvision.org/tutorial/loss.html).
Best regards,
@An

Evan Shelhamer

unread,
May 12, 2016, 2:09:31 AM5/12/16
to An Tran, Caffe Users
The loss weights are kept in the top diff of layers that contribute to the loss: https://github.com/BVLC/caffe/blob/master/include/caffe/layer.hpp#L410-L428. This is just overloading the use of the top blob of loss layers.

I think we need to figure out by ourself. The forum is very popular, Evan might not have time to explain it.

​Thanks for the DIY attitude An!​ Now back to the NIPS deadline for me...

Evan Shelhamer





--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/027094fa-369d-4cb5-9070-f10dd3ec52a1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jan

unread,
May 12, 2016, 6:32:40 AM5/12/16
to Caffe Users, tran...@gmail.com
So... does that mean you can only use the loss_weight with loss layers? I was under the impression it can be used with any kind of layer, but have never tried it. And when I think about it, it does not really make much sense to use it with a non-loss layer. Ok, then everything makes sense again :-).

Jan

P.S. I am all for the DIY attitude, but I'd rather ask somebody who knows the answer instead of spending hours trying to find out something. Usually you can find out things about cafe quickly by looking at the proto and the code, but this one really confused me...

Evan Shelhamer

unread,
May 12, 2016, 7:35:55 PM5/12/16
to Jan, Caffe Users, An Tran
So... does that mean you can only use the loss_weight with loss layers?

Every loss layer has a top with data the loss and diff the loss weight. Loss layers are exactly those layers with loss weight > 0. By default all standard `*Loss` layers like `SoftmaxWithLoss​`, `EuclideanLoss`, and so on have loss weight == 1. Other layers declared with a loss weight > 0 are interpreted the same way as standard loss layers with top data == the loss and top diff == the loss weight.

For more detail see the PR that introduced this loss generalization: https://github.com/BVLC/caffe/pull/686.

Hope that helps,

Evan Shelhamer





Jan

unread,
May 13, 2016, 4:10:44 AM5/13/16
to Caffe Users, jcpet...@gmail.com, tran...@gmail.com
Now I realized what really confused me about this: If you'd set the loss_weight of a regular hidden layer to 1 for instance, which has a layer underneath and a layer above, then the top diffs would be needed to store the gradients / deltas during the backward pass. So it cannot simultaneously store the loss_weight there, right? I am not sure whether a setup/setting like this even makes any sense, I am just wondering whether it is "programmatically" possible. [The alternative would be that layers contributing to the loss always need to be "topmost" layers, i.e. their top blobs are not used as any other layer's bottoms. Which kind of makes sense I think.]

Mhm, now that I have read through the PR I am confused again. The follwing section (taken from the PR) seems to be connected to my question, but I am not sure how to interpret it...

[...] The scale parameter is stored in the diff() of the top blob -- in the case of the loss layers that top blob is a singleton, so the loss layers had to be modified to multiply their gradients by a scale parameter specified by the singleton top blob diff, but all the other layers already knew how to backprop their diffs and could just be used as is. The only annoying thing was that to get top blobs to be both inputs to other layers and losses, I had to use split layers, as it's functionally the same thing as sending the output to two different layers [....]


It seems that the setup I constructed in the first paragraph can be done in caffe, but I am not clear how to store both the loss_weight and the gradients in the top blobs?

Jan

P. S. by the way, thanks for your time Evan. I'd completely understand if you're too busy with NIPS prep to answer such questions ;-).
Message has been deleted

Evan Shelhamer

unread,
May 24, 2016, 2:42:10 PM5/24/16
to An Tran, Caffe Users, Jan C Peters
eventually merge into Caffe master branch. From your provided link, I see it is merged into Caffe dev branch

It's already there. ​Everything that was in the dev branch is in the master branch. The dev branch was discontinued and everything was merged to master.

There was too much overhead in a dual branch workflow so now everything branches from and merges to master.​
 

Evan Shelhamer





On Fri, May 13, 2016 at 1:36 AM, An Tran <tran...@gmail.com> wrote:
Hi,
Thank Evan for the link and explanations. We all appreciate core devs who spend time to develop the code and spend time to answer questions.

Hi Jan, what the paragraph mean is that Caffe will split the topblob of a layer that want to produce loss into two top blob internally, one goes to compute loss ("top most") and one goes as output into latter layer. And only it makes sense. I think this also causes confusions for user in documentations (section loss weights http://caffe.berkeleyvision.org/tutorial/loss.html)

However, any layer can be used as a loss by adding a field loss_weight: <float> to a layer definition for each top blob produced by the layer.

Hi Evan, I would like to know whether PR686 (https://github.com/BVLC/caffe/pull/686) eventually merge into Caffe master branch. From your provided link, I see it is merged into Caffe dev branch. I am keen to know why core devs decided to follow the internal splitting design. Instead, we can let users defining a loss layer on an intermediate layer that they want it to produce loss. The latter approach is more transparent and safe to users.

Best regards,
@An
Reply all
Reply to author
Forward
0 new messages