In the code
SoftmaxWithLossLayer<
Dtype>::
Backward_cpu
after compute the bottom_diff as bottom_diff[i * dim + label_value * inner_num_ + j] -= 1; there is another operation to scale the gradient :
Dtype loss_weight = top[0]->cpu_diff()[0] /get_normalizer(normalization_, count);
caffe_scal(prob_.count(), loss_weight, bottom_diff);
But as I know, acoording to the equation of softmax loss , its bottom_diff should be calculated as this (from ufldl deep learning tutorial):
in the equation above, bottom_diff only scale with count without the multiply with the loss:top[0]->cpu_diff()[0], so i am curious why caffe implement as this ,is there any other considerations or did I have wrong understanding of the theory
Thanks, any reply would be greatly appreciated.