Got different output of stats-pooling-layer using Pytorch compared with nnet3-xvector-compute

Michael Lee

unread,

Aug 19, 2018, 7:17:01 AM8/19/18

to kaldi-help

Hi,

I am now using Pytorch to replicate the xvector system which is shown in the egs/sre16/v2 recipe. The frame level part before statistics pooling layer is already done and I can assure you that it get exactly the same output compared with kaldi nnet3. But when it comes to the statistics pooling layer, the output is a little bit different. I just use torch.mean and torch.std to calculate the mean and standard deviation of the accumulated inputs which are fed by the preceding tdnn3.batchnorm layer. I print the information of my final.raw here:

input-node name=input dim=23
component-node name=tdnn1.affine component=tdnn1.affine input=Append(Offset(input, -2), Offset(input, -1), input, Offset(input, 1), Offset(input, 2))
component-node name=tdnn1.relu component=tdnn1.relu input=tdnn1.affine
component-node name=tdnn1.batchnorm component=tdnn1.batchnorm input=tdnn1.relu
component-node name=tdnn2.affine component=tdnn2.affine input=Append(Offset(tdnn1.batchnorm, -2), tdnn1.batchnorm, Offset(tdnn1.batchnorm, 2))
component-node name=tdnn2.relu component=tdnn2.relu input=tdnn2.affine
component-node name=tdnn2.batchnorm component=tdnn2.batchnorm input=tdnn2.relu
component-node name=tdnn3.affine component=tdnn3.affine input=Append(Offset(tdnn2.batchnorm, -3), tdnn2.batchnorm, Offset(tdnn2.batchnorm, 3))
component-node name=tdnn3.relu component=tdnn3.relu input=tdnn3.affine
component-node name=tdnn3.batchnorm component=tdnn3.batchnorm input=tdnn3.relu
component-node name=stats-extraction-0-10000 component=stats-extraction-0-10000 input=tdnn3.batchnorm
component-node name=stats-pooling-0-10000 component=stats-pooling-0-10000 input=stats-extraction-0-10000
component-node name=tdnn4.affine component=tdnn4.affine input=Round(stats-pooling-0-10000, 1)
component-node name=tdnn4.relu component=tdnn4.relu input=tdnn4.affine
component-node name=tdnn4.batchnorm component=tdnn4.batchnorm input=tdnn4.relu
component-node name=output.affine component=output.affine input=tdnn4.batchnorm
component-node name=output.log-softmax component=output.log-softmax input=output.affine
output-node name=output input=output.log-softmax objective=linear

When I compare the frame level output of pytorch and kaldi-xvector, I use the nnet3-compute to decode one utterance and use "nnet3-copy --nnet-config=extract.config exp/xvector_nnet_2c/final.raw - |" to modify the final.raw to make sure the output is from the "tdnn3.batchnorm" layer. There are totally 326 frames of the input utterances and the output of nnet3-compute and pytorch model are exactly same.

Then I use nnet3-xvector-compute to decode the utterance and also edit extract.config to make sure the output is from "stats-pooling-0-10000" layer. The output of the nnet3-xvector-compute is:

foo [ 4.334455 2.886454 -0.5860164 -0.5215476 -0.00700723 -0.4594012 -0.4913049 -0.5872197 1.665164 -0.3344883 ...]

I just show the first 10 values to make it readable and the output from pytorch model is:

foo [4.277137998923537, 2.8752495376927265, -0.5860262049083739, -0.5234862913436259, 0.005449407837680619, -0.4635228347653537, -0.4941208037070489, -0.5888299701448544, 1.692088950771982, -0.3419027789861246 ...]

You can tell that there is a slight difference between the two outputs and I don't think it is because of the float-precision. To be honest, I am not efficient enough to think it through by reading the function in the "nnet-general-component.cc" source file. I thought when the configuration of the "stats-layer" is "mean+stddev(0:1:1:10000)", the operation is simply calculating the mean and std of the outputs given by the preceding layer and append these two vectors.

Is there some other tricky operations that I have ignored? Any advice would be highly appreciated. Thanks.

Micheal

David Snyder

unread,

Aug 19, 2018, 12:38:35 PM8/19/18

to kaldi-help

When I compare the frame level output of pytorch and kaldi-xvector, I use the nnet3-compute to decode one utterance and use "nnet3-copy --nnet-config=extract.config exp/xvector_nnet_2c/final.raw - |" to modify the final.raw to make sure the output is from the "tdnn3.batchnorm" layer. There are totally 326 frames of the input utterances and the output of nnet3-compute and pytorch model are exactly same.

Are you sure that all frames of tdnn3.batchnorm are identical? It seems likely to me that there would be some differences in how the frame-level temporal context is handled, possibly causing differences in frames at the beginning or end of the recording.

Also, it might be helpful to dump the output of nnet3-compute to disk, then compute the mean "manually," to see if it's the same you'd get from nnet3-xvector-compute.

By the way, the output of the stats layer is [ mean stddev], so if you're only comparing the first 10 values, you're only looking at the mean part.

Is there some other tricky operations that I have ignored?

I don't think so, the part of the computation that produces the mean is https://github.com/kaldi-asr/kaldi/blob/master/src/nnet3/nnet-general-component.cc#L791 and the next two lines. With input-period=1 and output-period=1 it should be just adding the rows, and dividing by the count.

Michael Lee

unread,

Aug 20, 2018, 2:26:34 AM8/20/18

to kaldi-help

Hi David,

Thanks for your replying.

Are you sure that all frames of tdnn3.batchnorm are identical? It seems likely to me that there would be some differences in how the frame-level temporal context is handled, possibly causing differences in frames at the beginning or end of the recording.

I am sure that the output of tdnn3.batchnorm layer when using nnet3-compute and the output of pytorch model are same. I parse the text format of final.raw and extract all the parameters of affine, bias, bn.stats.mean and bn.stats.var of all three tdnn layers and use them to initialize my pytorch model. Then I use the same input and dump the outputs to disk. In case you are interested, I have them attached. The 'tdnn3_bn_pytorch.txt' is the output of the pytorch model and the 'tdnn3_bn_nnet3.txt' is the output of nnet3-compute when the extract.conf is 'output-node name=output input=tdnn3.batchnorm'.

Also, it might be helpful to dump the output of nnet3-compute to disk, then compute the mean "manually," to see if it's the same you'd get from nnet3-xvector-compute.

Yes, I have dumped the output of nnet3-compute to disk (the attached file 'tdnn3_bn_nnet3.txt') and write a script to load the output and use numpy to calculate the mean and deviation. I found that there were a little bit difference between these two outputs. In case you want to see it, I have made the numpy output ('stats-mean-std_numpy.txt') and the output of stats-pooling-0-10000 layer generated by nnet3-xvector-compute ('stats-mean-std_nnet3-xvector-compute.txt') attached. And also the script (load_output_calcu_mean_std.py).

It's important to declare that the output of tdnn3.batchnorm layer (tdnn3_bn_nnet3.txt) is generated by 'nnet3-compute' rather than 'nnet3-xvector-compute' because when I used 'nnet3-xvector-compute' to decode the output of tdnn.batchnorm layer, it reported "cindex output(0, 0, 0) is not computable". So I decided to use 'nnet3-compute' to get each frame level output of tdnn3.batchnorm layer. Considering that the mean and deviation of tdnn3.batchnorm layer calculated by hand (i.e. by np.mean() and np.std()) is different from the output of nnet3-xvector-compute, is there any chance that the accumulated frame level outputs of tdnn3.batchnorm layer are different when using 'nnet3-xvector-compute' rather than 'nnet3-compute' ? If they are same, then I have no idea why the mean and deviation of stats are different when providing the same inputs.

Micheal

在 2018年8月20日星期一 UTC+8上午12:38:35，David Snyder写道：

tdnn3_bn_pytorch.txt

tdnn3_bn_nnet3.txt

stats-mean-std_numpy.txt

load_output_calcu_mean_std.py

stats-mean-std_nnet3-xvector-compute.txt

David Snyder

unread,

Aug 20, 2018, 11:46:01 AM8/20/18

to kaldi-help

You're right to use nnet3-compute (rather than nnet3-xvector-compute) to extract frame-level features.

I think I know what the difference is, but I haven't verified it. I believe nnet3-compute will pad the left and right sides of the input so that the number of input frames is equal to the number of output frames. The padding is necessary since the layers have left and right context. However, this is not important for x-vectors, so I believe we omitted this padding step in nnet3-xvector-compute (we eventually added an option called --pad-input to the binary, but it has a slightly different purpose).

In other words, if your input is N frames, I think nnet3-xvector-compute is actually computing stats over N - k frames, where k is some number that depends on the left and right context of the network. When you compute the stats manually from the frame-level output of the nnet3 network, or when you compute it in pytorch, I think the network is copying the first and last frames in the input layer, so that your stats layer sees N frames.

Anyway, I think this difference is minor and unlikely to affect performance.

jian LI

unread,

Aug 20, 2018, 10:32:03 PM8/20/18

to kaldi...@googlegroups.com

Yes, I have taken care of the contexts for tdnn-affine layers to get exactly same outputs compared to the outputs of 'nnet3-compute'. If 'nnet3-xvector-compute' doesn't need padding, then maybe that's why there is some tiny differences.

Thanks very much. I will ignore these differences and move forward.

Michael

David Snyder <david.ry...@gmail.com> 于2018年8月20日周一下午11:46写道：

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/aGt1upsRbB0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d2ccd626-f4cc-4c12-9ac8-3c9f4edabf90%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Максим Маркитантов

unread,

Jan 28, 2019, 4:41:04 AM1/28/19

to kaldi-help

Hello, Have you solved this problem?

Could you possible share your code?

вторник, 21 августа 2018 г., 5:32:03 UTC+3 пользователь Michael Lee написал:

little_ming

unread,

Feb 25, 2020, 2:40:53 AM2/25/20

to kaldi-help

Hi,

I'm trying to rebuild TDNN in pytorch. I want to ensure those network has the same output. Can u tell me how to load parameters from a trained TDNN in kaldi into a pytorch model? I'm not familiar with Kaldi. Is the flie 'final.raw' the parameter file?