ASR - Dealing with silence segments

Bar Madar

unread,

Mar 6, 2022, 4:08:43 AM3/6/22

to kaldi...@googlegroups.com

Hey,

I am trying to examine how my ASR model (based on WSJ/s5 with nnet3 recipe) is dealing with silence segments.

So, I created a silence test set, that contains "utterances" of silence (recordings, without parts of speech).

I expected the model to outputs just SIL phones for those segments, which means to see nothing on the decoded text for each utterance.

But for each utterance, the model outputs a single word, most of the time is the word "EH" - so it doesn't recognize perfect silence.

Is it a normal behavior of such an ASR model(to output a single word on a whole segment of non-speech)? Or does this indicate that I have a problem dealing with non-speech sections?

Thanks.

Bar

Daniel Povey

unread,

Mar 6, 2022, 5:40:41 AM3/6/22

to kaldi-help

It could be a language model issue, that the language model assigns very low probability to the empty sentence (since presumably the empty sentence was never seen in training time).

If it's an acoustic model issue, you'll probably see it in the alignment: the EH either always appearing at the beginning, the end, or most of the utterance.

If an acoustic issue (likely about mismatch of environmental conditions, since WSJ is very clean), the EH would appear in different places and be reasonably short.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWm5vZQHRt%3D6EY2gP18NA46LfHAnOWNEGSY7uecCNOHqh%2BCgw%40mail.gmail.com.

Bar Madar

unread,

Mar 6, 2022, 7:47:52 AM3/6/22

to kaldi-help

Thank you for the quick response.

Actually, the model was trained on my own data (that is not very clean...), and I added also some "non-speech" data (includes the real environmental noise) to the training set in order to improve the acoustic model to recognize silence.

How should I check if it is an LM issue? and if it is the problem, how do you recommend fixing it?

Thanks,

Bar

Daniel Povey

unread,

Mar 7, 2022, 1:43:54 AM3/7/22

to kaldi-help

As I said, if the EH appears at random places and is not super long, it's likely a LM issue.

You could add some empty sentences to the LM training data before training the LM.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/62bb8241-8d37-47c5-ada9-da18c7c6a7ebn%40googlegroups.com.

Bar Madar

unread,

Mar 7, 2022, 3:17:28 AM3/7/22

to kaldi...@googlegroups.com

By adding some empty sentences to the LM training data you mean to add them to the text that the pocolm training on or to the data on which we compute the pronunciation and silence probabilities?

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuyS1twf30Hr%2B9T%3DrgU_Yk%2BjLrGBMELKv4-ptk-3roqr4hA%40mail.gmail.com.

Daniel Povey

unread,

Mar 7, 2022, 4:16:27 AM3/7/22

to kaldi-help

The pocolm training data

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWm5vZb72PebcqtBua_K5-8gafm112fcBHj-SfM%2BYKT8MvSYw%40mail.gmail.com.

Reply all

Reply to author

Forward