I've been messing around with implementing this for a bit. You can find my branch with the code I'm using here:
I am using both a hard and a soft loss, so I've got one LMDB with images + labels and one LMDB (using convert_vectors) with the average logits of my ensemble. Then I have two loss layers in my prototxt like this:
layer {
name: "hard_loss"
type: "SoftmaxWithLoss"
bottom: "distil_logits"
bottom: "label"
top: "hard_loss"
include: { phase: TRAIN }
loss_weight: 1
}
layer {
name: "soft_loss"
type: "TempSoftmaxCrossEntropyLoss"
bottom: "distil_logits"
bottom: "cumbersome_logits"
top: "soft_loss"
include: { phase: TRAIN }
tempsoftmax_param {
temperature: 6
}
# Loss_weight should be temp^2 according to Hinton et al.
loss_weight: 36
}
The TempSoftmaxCrossEntropyLoss layer is my elegantly named implementation of a layer that takes two logit vectors as input, zero-means them, divides by temperature and softmaxes them by passing them through a MVN and Softmax layer. It calculates the gradient as described in the Hinton Distilling Knowledge paper.
I haven't added any tests for it yet though, I mostly just wrote the code and started training a model with it. So far the hard_loss seems to go down quite steadily, but the soft_loss seems to be lagging behind a bit. I am not sure if this is normal behaviour, if the soft loss is only going down as a result of the hard loss being optimised, or that something is wrong with my implementation.
Not sure if this helps you in any way, but it might give you a starting point.