How to use (a) sum of two loss functions and (b) multiply pre-softmax output with a constant

1,111 views
Skip to first unread message

Bharat Bhusan Sau

unread,
Mar 10, 2016, 2:35:22 PM3/10/16
to Caffe Users, Adepu Ravi Sankar

I am facing following difficulties while implementing Dark Knowledge paper in Caffe:

1. How to divide/multiply  pre-softmax output(i.e. output of the last hidden layer before softmax layer) by some value T>1 ? (Its required to soften the softmax probabilistic output)

2. How to use sum of two loss functions in Caffe as my new loss function? (New Loss function= cross_entropy(hard_labels,predicted_labels) + lamda*euclidean_loss(soft_output,predicted_output) )

Section 2.1 in this paper describes the Dark Knowledge method in compact way.

Please comment if anybody knows any tricks to solve this issue.

Nam Vo

unread,
Mar 10, 2016, 8:18:25 PM3/10/16
to Caffe Users, cs14res...@iith.ac.in
1) Write your own layer to do that, though I don't think that softening thing will affect the learning at all.
2) Just define 2 loss layers (and loss_weight accordingly), caffe will auto sum them

Bharat Bhusan Sau

unread,
Mar 11, 2016, 10:14:43 AM3/11/16
to Caffe Users, cs14res...@iith.ac.in

Thanks Nam. Its really nice answer.
I now solved both issue 1 and 2.  Only thing is: Can I tune the loss_weight automatically ? (i.e. I want the network to learn loss_weight automatically).

Thanks again.

Bharat Bhusan Sau

unread,
Mar 11, 2016, 10:15:40 AM3/11/16
to Caffe Users, cs14res...@iith.ac.in
Bdw, I used "Eltwise" layer with "PROD" to solve the first issue.

uttisht

unread,
Jun 6, 2016, 11:25:01 AM6/6/16
to Caffe Users, cs14res...@iith.ac.in
Hi Bharat, have you implemented it? I would like to know the approach.

Bharat Bhusan Sau

unread,
Jun 6, 2016, 11:32:58 AM6/6/16
to uttisht, Caffe Users, Adepu Ravi Sankar
Do you mean dark knowledge method or something else ?

--
You received this message because you are subscribed to a topic in the Google Groups "Caffe Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/caffe-users/gPhaNGY0kgU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/9f22285b-d58a-499e-9e2f-62c783fcecb1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

uttisht

unread,
Jun 6, 2016, 11:52:02 AM6/6/16
to Caffe Users, rupes...@gmail.com, cs14res...@iith.ac.in
Hi Bharat, 

Thanks for your reply. Yes, Dark Knowledge. 

Bharat Bhusan Sau

unread,
Jun 6, 2016, 2:29:26 PM6/6/16
to uttisht, Caffe Users, Adepu Ravi Sankar
I attached the training prototxt for reference. I did the tests on MNIST dataset.

mnist_dark_knowledge_train.prototxt

uttisht

unread,
Jun 6, 2016, 3:25:00 PM6/6/16
to Caffe Users, rupes...@gmail.com, cs14res...@iith.ac.in
Thank you. I will try this out.

uttisht

unread,
Jun 8, 2016, 12:17:35 PM6/8/16
to Caffe Users, rupes...@gmail.com, cs14res...@iith.ac.in
Hi Bharat,

I have a doubt. Do I need to train both models separately? I tried to train together using single training protxt file. But, didn't work. 

Bharat Bhusan Sau

unread,
Jun 8, 2016, 2:02:13 PM6/8/16
to Caffe Users, rupes...@gmail.com, cs14res...@iith.ac.in
The prototxt file I provided contains both teacher and student net. But learning rate of each layer of teacher net is zero implying during training there will be no update to the teacher net. So what you have to do is :

1. At first train a teacher net which will have good accuracy.

2. Once the teacher net is trained, take the teacher model by using '-weights' in the terminal execution command, as a result the learnt weights of the teacher model will be shared in the new model which contains both teacher and student. You can always extract the student model from this hybrid model later. The weights are shared between two layers when their name are same. You get output from the softmax layer of teacher net and feed it to student model for matching.

3. To test the model, only use the student architecture in the test.prototxt.

This method makes it easier to do train the  student model in case we have to change the temperature param many times. However, for more difficult datasets like CIFAR10, where the teacher model is cumbersome, it will take lot of time to train. In that case you may want to modify the method I suggested. For that case, I follow another way, which requires little bit of coding task, but is faster than the previous method:

1. Train Teacher model.
2. Get soft output of the training data from the teacher model.
3. Create hdf5 dataset which contains training image, ground truth label, soft outputs
4. Use this hdf5 dataset for training the student model.

You should be very careful while creating the hdf5 dataset. All the required preprocessing should be done properly. Then only it can be used for training.

uttisht

unread,
Jun 9, 2016, 1:53:44 PM6/9/16
to Caffe Users, rupes...@gmail.com, cs14res...@iith.ac.in
Thanks a lot Bharat. Everything is clear now.

Bob Zigon

unread,
Jun 17, 2016, 12:02:13 AM6/17/16
to Caffe Users, cs14res...@iith.ac.in
Nam Vo
I would like to ask you a question about your response #2. I have a network that splits into 2 pieces.
Each piece uses the softmax loss to produce 8 possible outputs.
If I weight each layer with a coef of 0.5, are you saying that the output of the 2 softmax layers will
be identical because they will have been averaged internally and then the results are made available to
the 16 outputs?

sjala...@gmail.com

unread,
Jan 5, 2018, 12:12:25 AM1/5/18
to Caffe Users
hello Sau ,Have you solved the problem that automatically tune loss_weight ?

在 2016年3月12日星期六 UTC+11上午2:14:43,Bharat Bhusan Sau写道:
Reply all
Reply to author
Forward
0 new messages