Gradient computation after L1 norm

43 views
Skip to first unread message

Brij Mohan Lal Srivastava

unread,
Jan 28, 2021, 6:40:27 AM1/28/21
to kaldi-developers
Dear Dan,

I have implemented a simple component (L1NormComponent) which divides the input to the layer by its L1 norm. I am doubtful about my implementation of the Backprop function for this component.

I simply multiply the gradients by out_value and divide by in_value, which is just scaling the gradients independent of the L1 norm division operation performed in the Propagate function.

Another thought is to take partial derivative of the operation, let X be the input then:

\partial (X/|X|) = \partial sgn(X), which is discontinuous at 0.

But perhaps the derivative of sign can be computed using Heaviside operation present in Kaldi.

Could you please direct how to use this for the gradient computation?

Thanks,
Brij

Daniel Povey

unread,
Jan 28, 2021, 6:44:57 AM1/28/21
to kaldi-developers
Looks plausible in general, but might generate NaNs for zero components of the input.
That is what NormalizeComponent already does, so I don't think it's very necessary.

--
visit http://kaldi-asr.org/forums.html to find out how to join.
---
You received this message because you are subscribed to the Google Groups "kaldi-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-develope...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-developers/bd26780a-a43c-4e65-8ac5-b5f241f943e3n%40googlegroups.com.

Brij Mohan Lal Srivastava

unread,
Jan 28, 2021, 8:12:55 AM1/28/21
to kaldi-developers
Ok so do you think simply scaling the gradient (out_deriv * out_value / in_value) by means of SetMatMatDivMat is sufficient to get the correct gradient for the weights before this layer?

I am confused because L1 norm is a data-dependent operation, if we look at element-wise operation for row X, then it is `f(X_i) = X_i / (|X_i| + C)`
where C is sum of absolute value of rest of the elements in the row which can be considered constant if we think in terms of partial derivative.

So may be we cannot think of this operation as simply scaling of input by a constant. Please let me know what you think.

Daniel Povey

unread,
Jan 28, 2021, 8:31:28 AM1/28/21
to kaldi-developers
Oh yes you're right, there is another term in the derivative.   See what NormalizeLayer() does.

Brij Mohan Lal Srivastava

unread,
Jan 29, 2021, 1:35:22 PM1/29/21
to kaldi-developers
Thanks! I have manually computed the gradient and it is nothing but : old_gradient * (1- |out|).
Reply all
Reply to author
Forward
0 new messages