backpropagation LRN layer

259 views

Skip to first unread message

Vincent

unread,

Nov 19, 2014, 6:04:35 AM11/19/14

to caffe...@googlegroups.com

Can anyone explain me how the gradients are calculated in the LRN layer (LRNComputeDiff in lrn_layer.cu).

I understand the forward path with x/[1+(α/n)∑i(xi^2)]^β with i defining the channel neighborhood; but can't figure out how to calculate the back-propagation.

Thank you

Vincent

Andrei Pokrovsky

unread,

Apr 9, 2015, 4:27:41 PM4/9/15

to caffe...@googlegroups.com

I've annotated the Caffe LRNComputeDiff code, hopefully this will help you. I think tthis also sheds some light on why the LRN function is designed the way it is (since the derivative reuses a portion of forward calculation, thus accelerating the backward pass).

HTH.

Dtype accum_ratio = 0;

// forward pass must be performed first as a prerequisite for this function to work correctly

// scale should be filled with k+alpha/size * sum( a[i]^2 ) from forward pass

// accumulate values

while (head < post_pad) {

// top_data is b[n_wh] = a[n_wh]*(k+alpha*sum(a[n_wh]^2))^-beta

accum_ratio += top_diff[head * step] * top_data[head * step] / scale[head * step];

// top_data is b[n_wh] = a[n_wh]*(k+alpha*sum(a[n_wh]^2))^-beta, so divided by scale..

// accum_ratio += top_diff * a[n_wh]*(k+alpha*sum(a[n_wh]^2))^(-beta-1)

++head;

}

// until we reach size, nothing needs to be subtracted

while (head < size) {

accum_ratio += top_diff[head * step] * top_data[head * step] / scale[head * step];

// continue adding top_diff * a[n_wh]*(k+alpha*sum(a[n_wh]^2))^(-beta-1)

bottom_diff[(head - post_pad) * step] =

// top_diff * ( k+alpha*sum(a[n_wh]^2) )^-beta

top_diff[(head - post_pad) * step] * pow(scale[(head - post_pad) * step], negative_beta)

// cache_ratio = Dtype(2. * alpha_ * beta_ / size_)

// accum_ratio = sum( top_diff[n_wh] * a[n_wh]*(k+alpha*sum(a[n_wh]^2))^(-beta-1) )

// a[n_wh] is bottom_data

// so 2*alpha*beta/size * a[n_wh] * a[n_wh]*(k+alpha*sum())^(-beta-1)

// this matches the derivative d/da[i] ( LRN( a[i]+a[not i] ) )

- cache_ratio * bottom_data[(head - post_pad) * step] * accum_ratio;

// so after all it looks like the full formula is correct:

// top_diff * (k+alpha*sum(a[n_wh]^2)^-beta

// - 2*alpha*beta/size * a[n_wh] * top_diff * sum( a[n_wh]*(k+alpha*sum(a[n_wh]^2))^(-beta-1) )

// from differentiation:

// dLrn/da[center] = (sumsq[a_all]*alpha+k)^-beta - 2*alpha*beta*a[center]^2*(sumsq[a_all]*alpha+k)^(-beta-1)

// dLrn/da[other] = - 2*alpha*beta* a[k]* a[other] * (sumsq[a_all]*alpha+k)^(-beta-1)

// so it appears that the gradient is summed correctly over d/da[k]

++head;

Reply all

Reply to author

Forward

0 new messages