Separate Affine Layer for Chain Training Xent Regularization

374 views
Skip to first unread message

al...@i2x.ai

unread,
May 15, 2018, 12:27:52 PM5/15/18
to kaldi-help

From the configs for the tdnn model in the swbd recipe it looks like there are separate affine layers for each head, one for the nnet outputs that are fed to `ComputeChainObjfAndDeriv` and another for the xent outputs.  As I understand it, the xent regularization is performed by driving a scaled xent gradient (computed as part of the chain loss op) back through the network.  This makes sense to me, but I don't understand what the advantage of heaving separate affine layers for each head.  Wouldn't it make more sense to share the parameters for each head?  Or is this what it is actually doing underneath?

Daniel Povey

unread,
May 15, 2018, 12:48:24 PM5/15/18
to kaldi-help
It's not sharing the parameters for the regular output vs. xent output.
Empirically having different outputs is better. Possibly because the
regular output does not want values that are close to what the xent
output needs, so to avoid damaging it you'd have to set the xent
regularization scale so low as to be useless.
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/edfb6281-8cd4-4e3e-b7b0-d58cc8de6645%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

al...@i2x.ai

unread,
May 16, 2018, 5:46:58 AM5/16/18
to kaldi-help
Are the affine layers for each head initialized with the same parameters?  Otherwise the xent gradients just seem mismatched with the final affine layer in the xent head i.e the gradients depend only on the outputs of the chain loss head, which could be produced from a completely different affine transform than that in the xent head.

Daniel Povey

unread,
May 16, 2018, 1:18:52 PM5/16/18
to kaldi-help
They are not initialized with the same parameters.
The posterior distribution over states is determined by the numerator
lattice or supervision FST (which reflects the transcript) and the
chain output (which affects the alignment). But the gradient depends
more strongly on the xent output because that determines the
posterior... if it matches the distribution from the soft-alignment of
the transcript given the chain output, there would be no gradient.

The chain output won't affect the distribution super strongly because
it's highly constrained by the transcript.


Dan
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/412e9dfe-aac0-47ba-8b96-b00f6ac0bea1%40googlegroups.com.

Ilya Edrenkin

unread,
May 17, 2018, 4:11:17 PM5/17/18
to kaldi-help
Hi Dan, thanks for the answer!

Frankly, I still fail to understand what is going on with the xent regularization. Apparently the function that computes the chain loss and both chain and xent derivatives only has one input, nnet_output. I assume it to be the output of the chain head of the network. https://github.com/kaldi-asr/kaldi/blob/f8b678a61e932f4858115dbe2d11caed48a7dbac/src/chain/chain-training.h#L118

However, if xent regularization is used, xent gradients are pushed through another head of the network, if I get it correctly: https://github.com/kaldi-asr/kaldi/blob/f8b678a61e932f4858115dbe2d11caed48a7dbac/src/nnet3/nnet-chain-training.cc#L343

I don't get why it could work. Given that the supervision head is different from the xent head, I don't understand how we can compute the gradient of xent loss wrt supervision output. Probably my reasoning is wrong at some earlier point?

Best,
Ilya

Daniel Povey

unread,
May 17, 2018, 4:14:11 PM5/17/18
to kaldi-help
Oh-- the LogSoftmax is part of the network. The network outputs
normalized log-probs, and the derivatives w.r.t. those do not depend
on the log-probs themselves, only on the supervision.

Dan
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/73e32a71-48ec-4f46-a8f2-ac85cedbbbb8%40googlegroups.com.

Ilya Edrenkin

unread,
May 17, 2018, 4:39:40 PM5/17/18
to kaldi-help
Thanks! Yes, I see that logsoftmax is part of the network, namely of the xent head. However the xent head seems to be unused in computing xent_deriv matrix.

The xent output seems to be fetched at this line: https://github.com/kaldi-asr/kaldi/blob/f8b678a61e932f4858115dbe2d11caed48a7dbac/src/nnet3/nnet-chain-training.cc#L316 . As it doesn't seem to influence xent_deriv, it looks like the xent head gets updated with a great that might be not correct.

Consider an extreme case where the computation performed by the xent head is the negation of the computation performed by the supervision head. Then, if we estimate gradients for the xent head using supervision chain output, we are actually increasing the loss.
Realistically in most cases I would expect xent and supervision heads to be uncorrelated, turning this update into random gradient noise.

I believe I am missing something trivial early on that would explain the rationale?

Daniel Povey

unread,
May 17, 2018, 4:44:20 PM5/17/18
to kaldi-help
The gradients that get propagated back through the LogSoftmax will
depend strongly on the xent output values, but it happens that the
gradients after the LogSoftmax are invariant to them.
That's just the way the cross-entopy objective works.
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a40456b7-aad7-45ac-b6bd-c8ee4aec93a3%40googlegroups.com.

Ilya Edrenkin

unread,
May 17, 2018, 5:18:57 PM5/17/18
to kaldi-help
Thanks, I understand what's the derivative of cross-entropy; however it doesn't really address my concern.

My current understanding is that in swbd nnet3-chain config we see a network with common stem, which splits into two branches after certain layer, close to the top. These both branches are parameterized and don't share parameters. One of them ends with "supervision" or "chain" output (linear), another ends with xent output (last transformation on xent branch is LogSoftmax).

When we call ComputeChainObjfAndDeriv we pass the output of the first head as an argument, nnet_output matrix. We get back two matrices with gradients, nnet_output_deriv and xent_deriv. Later xent_deriv is not modified (except for scaling with weights).

xent_output is used to compute the xent_objf, but doesn't affect the xent_deriv: https://github.com/kaldi-asr/kaldi/blob/f8b678a61e932f4858115dbe2d11caed48a7dbac/src/nnet3/nnet-chain-training.cc#L320
I assume that at this point xent_deriv is not really the derivative, but soft alignment (per-frame posteriors wrt the output of supervision head?). The comment seems to be also saying these are posteriors, not gradients.

It looks plausible that backprop of logsoftmax handles these posteriors correctly as its input. But: I still find it strange that we update the xent head with the gradient whose computation didn't use the parameters of the xent head (e.g. affine transforms in it) in the forward pass.

Could you please correct my understanding?

Thanks for your patience!
Ilya

Ilya Edrenkin

unread,
May 17, 2018, 6:06:33 PM5/17/18
to kaldi-help
OK, seems like I grasped why this is correct. If xent_deriv is indeed not a gradient but just a soft alignment (actually targets for cross-entropy objective), passing it back through the xent head should work. Actually what happens here is that the xent head adapts to the output of the chain head (constrained by the transcript).
It was just the naming xent_deriv that confused me, but now I see why the procedure makes sense.
Thanks a lot Dan!

Daniel Povey

unread,
May 17, 2018, 6:14:09 PM5/17/18
to kaldi-help
The derivative of the objective function w.r.t. the logs of the xent
branch *are* the soft posteriors. That's how the cross-entropy
objective works. Write down the equations and you'll see it.
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bb3cc76e-211e-4017-a55e-b32b581a6bbf%40googlegroups.com.

Ilya Edrenkin

unread,
May 22, 2018, 10:50:02 AM5/22/18
to kaldi-help
Makes perfect sense, thanks Dan!

(The source of confusion for me was that I for some reason erroneously thought that xent_deriv contains log-posteriors, not properly normalized probabilities.)
Reply all
Reply to author
Forward
0 new messages