input-node name=input dim=23
component-node name=tdnn1.affine component=tdnn1.affine input=Append(Offset(input, -2), Offset(input, -1), input, Offset(input, 1), Offset(input, 2))
component-node name=tdnn1.relu component=tdnn1.relu input=tdnn1.affine
component-node name=tdnn1.batchnorm component=tdnn1.batchnorm input=tdnn1.relu
component-node name=tdnn2.affine component=tdnn2.affine input=Append(Offset(tdnn1.batchnorm, -2), tdnn1.batchnorm, Offset(tdnn1.batchnorm, 2))
component-node name=tdnn2.relu component=tdnn2.relu input=tdnn2.affine
component-node name=tdnn2.batchnorm component=tdnn2.batchnorm input=tdnn2.relu
component-node name=tdnn3.affine component=tdnn3.affine input=Append(Offset(tdnn2.batchnorm, -3), tdnn2.batchnorm, Offset(tdnn2.batchnorm, 3))
component-node name=tdnn3.relu component=tdnn3.relu input=tdnn3.affine
component-node name=tdnn3.batchnorm component=tdnn3.batchnorm input=tdnn3.relu
component-node name=stats-extraction-0-10000 component=stats-extraction-0-10000 input=tdnn3.batchnorm
component-node name=stats-pooling-0-10000 component=stats-pooling-0-10000 input=stats-extraction-0-10000
component-node name=tdnn4.affine component=tdnn4.affine input=Round(stats-pooling-0-10000, 1)
component-node name=tdnn4.relu component=tdnn4.relu input=tdnn4.affine
component-node name=tdnn4.batchnorm component=tdnn4.batchnorm input=tdnn4.relu
component-node name=output.affine component=output.affine input=tdnn4.batchnorm
component-node name=output.log-softmax component=output.log-softmax input=output.affine
output-node name=output input=output.log-softmax objective=linear
foo [ 4.334455 2.886454 -0.5860164 -0.5215476 -0.00700723 -0.4594012 -0.4913049 -0.5872197 1.665164 -0.3344883 ...]
foo [4.277137998923537, 2.8752495376927265, -0.5860262049083739, -0.5234862913436259, 0.005449407837680619, -0.4635228347653537, -0.4941208037070489, -0.5888299701448544, 1.692088950771982, -0.3419027789861246 ...]
When I compare the frame level output of pytorch and kaldi-xvector, I use the nnet3-compute to decode one utterance and use "nnet3-copy --nnet-config=extract.config exp/xvector_nnet_2c/final.raw - |" to modify the final.raw to make sure the output is from the "tdnn3.batchnorm" layer. There are totally 326 frames of the input utterances and the output of nnet3-compute and pytorch model are exactly same.
Is there some other tricky operations that I have ignored?
Are you sure that all frames of tdnn3.batchnorm are identical? It seems likely to me that there would be some differences in how the frame-level temporal context is handled, possibly causing differences in frames at the beginning or end of the recording.
Also, it might be helpful to dump the output of nnet3-compute to disk, then compute the mean "manually," to see if it's the same you'd get from nnet3-xvector-compute.
--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/aGt1upsRbB0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d2ccd626-f4cc-4c12-9ac8-3c9f4edabf90%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.