Phonetic decoding with large amounts of silence

112 views
Skip to first unread message

Xavier Anguera

unread,
Nov 22, 2015, 8:12:16 PM11/22/15
to kaldi-help
Hi,
I am trying to perform phonetic decoding using Kaldi and a GMM model and I get very different results if my audio file starts with a long stretch (1 minute or so) of silence). The same file without the silence works fine.
My phonetic decoding pipeline is as follows:
steps/decode_nolats.sh --> lattice-align-phones --> ali-to-phones

Thanks a lot,

X. Anguera

Daniel Povey

unread,
Nov 22, 2015, 8:18:19 PM11/22/15
to kaldi-help
Possibly this is the effect of cepstral mean normalization.
David Snyder is going to commit an example script soon showing speech-silence-music segmentation, which might be useful here.
It could also be the effect of roundoff in the decoder, simply from the length of the file.  Recompiling with -DKALDI_DOUBLEPRECISION=1 in kaldi.mk would show if that is the case.  This is unlikely though.
Most of the effect of cepstral mean normalization will disappear, though, if you use adaptation in your decoding pipeline.


Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xavier Anguera

unread,
Nov 22, 2015, 8:31:09 PM11/22/15
to kaldi...@googlegroups.com
Right, I did not think about it. Indeed it must be the CMS.
Would you say that I could avoid the CMS if I do adaptation?
Also, would another option be to just compute CMS stats over the speech frames?

Thanks!

X. Anguera

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/QVinXawkmkU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Nov 22, 2015, 8:33:19 PM11/22/15
to kaldi-help
Generally speaking it's best to do both CMS and adaptation.
Computing CMS stats over just the speech frames means you need to compute what the speech frames are, which implies speech-silence separation; we don't yet have example scripts for that.  Generally speaking the best way is to decode for two passes, and rely on adaptation to correct for deficiencies in the CMS.
Dan

Reply all
Reply to author
Forward
0 new messages