In the code
SoftmaxWithLossLayer<
Dtype>::
Backward_cpu
after compute the bottom_diff as bottom_diff[i * dim + label_value * inner_num_ + j] -= 1; there is another operation to scale the gradient :
Dtype loss_weight = top[0]->cpu_diff()[0] /get_normalizer(normalization_, count);
caffe_scal(prob_.count(), loss_weight, bottom_diff);
But as I know, acoording to the equation of softmax loss , its bottom_diff should be calculated as this (from ufldl deep learning tutorial):
![\begin{align}
\nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] }
\end{align}](https://ci3.googleusercontent.com/proxy/2_7FtvN1PZ5eedNQiGc7Uzxa3eYcpxcmTPFQYB-9SisHoUryeniXtCjBpcKqlKdapDBI4XM-D4JGcosCo-U4dfyonn6CeYtsqSy_3hy9qzcrJACT2wiyGj6RxV1trA6rcSHHThqR-VQ=s0-d-e1-ft#http://ufldl.stanford.edu/wiki/images/math/5/9/e/59ef406cef112eb75e54808b560587c9.png)
in the equation above, bottom_diff only scale with count without the multiply with the loss:top[0]->cpu_diff()[0], so i am curious why caffe implement as this ,is there any other considerations or did I have wrong understanding of the theory
Thanks, any reply would be greatly appreciated.