Implementing Knowledge distillation

1,408 views
Skip to first unread message

Suraj Srinivas

unread,
Aug 5, 2015, 8:54:39 AM8/5/15
to Caffe Users
Hello all,

Has anyone here implemented Knowledge distillation in caffe? Specifically, has anyone tried doing something like the following: 

layers {
  name: "loss"
  type: CROSS_ENTROPY_LOSS
  bottom: "fc8_small"
  bottom: "fc8_alexnet"
  top: "loss"
  loss_weight: 1
  include: { phase: TRAIN } 
}

I wanted to discuss the best approach to set hyperparameters for this. 

Thanks

Nanne van Noord

unread,
Aug 5, 2015, 10:09:39 AM8/5/15
to Caffe Users
I've been messing around with implementing this for a bit. You can find my branch with the code I'm using here: https://github.com/Nanne/caffe/tree/kdistil

I am using both a hard and a soft loss, so I've got one LMDB with images + labels and one LMDB (using convert_vectors) with the average logits of my ensemble. Then I have two loss layers in my prototxt like this:

layer {
  name: "hard_loss"
  type: "SoftmaxWithLoss"
  bottom: "distil_logits"
  bottom: "label"
  top: "hard_loss"
  include: { phase: TRAIN }
  loss_weight: 1
}
layer {
  name: "soft_loss"
  type: "TempSoftmaxCrossEntropyLoss"
  bottom: "distil_logits"
  bottom: "cumbersome_logits"
  top: "soft_loss"
  include: { phase: TRAIN }
  tempsoftmax_param {
    temperature: 6
  }
  # Loss_weight should be temp^2 according to Hinton et al.
  loss_weight: 36 
}

The TempSoftmaxCrossEntropyLoss layer is my elegantly named implementation of a layer that takes two logit vectors as input, zero-means them, divides by temperature and softmaxes them by passing them through a MVN and Softmax layer.  It calculates the gradient as described in the Hinton Distilling Knowledge paper. 

I haven't added any tests for it yet though, I mostly just wrote the code and started training a model with it. So far the hard_loss seems to go down quite steadily, but the soft_loss seems to be lagging behind a bit. I am not sure if this is normal behaviour, if the soft loss is only going down as a result of the hard loss being optimised, or that something is wrong with my implementation.

Not sure if this helps you in any way, but it might give you a starting point.  

Suraj Srinivas

unread,
Aug 5, 2015, 1:30:45 PM8/5/15
to Caffe Users
Thanks for the reply Nanne! :)

I am facing the same kind of issue - the network appears to learn when both hard-loss and soft-loss are present. However, when I start to train with only the soft-loss, things seem to go haywire. 

I implemented it in the following way (I'm really lazy) (Comparing notes):
I wrote specifications for 2 networks in the same prototxt file. The image passes through alexnet, and through small network at the same time. I use the softened fc8 probabilites as 'labels' for the small network. 
To apply temperature, I simply used a 'power' layer and multiplied it to the fc8 pre-activation before softmax. I also slightly modified the sigmoid cross entropy loss layer to not do the sigmoid thingy. 

The only thing I'm missing is the MVN part. Is it necessary? The paper doesn't seem to mention that.

Nanne van Noord

unread,
Aug 6, 2015, 3:10:50 AM8/6/15
to Caffe Users
I implemented the cross entropy gradient as described in equation 4 in the paper, which assumes that the logits have been zero-meaned, that's why I pass it through the MVN layer (I set the flag to not normalise the variance).

I figured it was probably best to implement it as a single layer so I could have more control over the backwards step, as in your case each individual layer also modifies the gradient before it reaches your fc8. Could this be a/the problem in your case?

Siddharth Mohan

unread,
Nov 9, 2015, 1:36:24 AM11/9/15
to Caffe Users
I found this thread to be very useful. @Nanne, I am going to try your implementation. 
" So far the hard_loss seems to go down quite steadily, but the soft_loss seems to be lagging behind a bit." On the above, could it be because the relative weighting of hard_loss and soft_loss is not optimal? Hinton mentions in his paper that best results were attained with a lower weight on the hard_loss objective function. This is in addition to him asking to multiply by temp^2. So perhaps the loss weight should be 0.1 on hard_loss and 0.9*36 on the soft_loss (for a 10%,90% split). 
Could this be a reason? Wondering if you had any further insights. 

Siddharth Mohan

unread,
Nov 9, 2015, 5:42:52 PM11/9/15
to Caffe Users
Nanne: one more question. How did you generate soft targets with a high temperature for the cumbersome model? For eg: can I just run googlenet with some changes in the deploy.prototxt file to generate the softmax outputs with high temperature? 
I looked at your code changes and I can find the softmax cross entropy loss implemented with high temperature but not softmax with a high temperature. Just checking to see what's the easiest way here (or if I should implement something myself).  

劉博獻

unread,
Dec 11, 2015, 3:57:56 AM12/11/15
to Caffe Users
Hello Nanne,
Recently, I also wanna try to implement distillation, and I found this discussion and your repo. 
Your repo is nice and I also tried it, but I have a trouble when soft target read into TempSoftmaxCrossEntropyLayer.
I already use your convert_vector and followed the rule to convert all soft target into lmdb, but how can I read these soft-target?
which data layer should I use? it seems I need to implement a data layer to read it? 
Thanks in advance.

uttisht

unread,
Jun 6, 2016, 11:21:26 AM6/6/16
to Caffe Users
Hi, has anyone implemented it? I would like to know the approach!
Reply all
Reply to author
Forward
0 new messages