Speech activity detection and endpointing in real time ASR

960 views
Skip to first unread message

Sray Chen

unread,
Aug 29, 2019, 4:09:00 PM8/29/19
to kaldi-help
Hi all,

I'm using kaldi gstreamer with VAD for phone conversation scenario right now. In my understanding, the VAD here is just filter out silence and are not actually segmenting audio. In our system, the audio will be cut every 3 minutes(segmenting) by the way.

Recently, I'm trying to apply endpointing or SAD. Can anyone clarify if endpointing or SAD should help real time ASR and why them help? My understanding is that SAD can achieve segmentation and help decoding since long segments will degrade decoding performance. While endpointing should be the same idea but less effective than SAD.

I only tried endpointing for now. My experiments show that only sometimes the endpointing will help when comparing to my original system, where the LM is trained with long text in one line (since we are kind of decoding long audio (3 minutes)):
Results:
test_set1
WER 17.3% (endpointing + LM trained with long text in one line)
WER 18.22% (endpointing + LM trained with segmented text by steps/cleanup/segment_long_utterances_nnet3.sh)
WER 16.82% (LM trained with long text in one line, no endpointing)

test_set2 where endpointing help:
WER 32.1% (endpointing + LM trained with long text in one line)
WER 32.48% (endpointing + LM trained with segmented text by steps/cleanup/segment_long_utterances_nnet3.sh)
WER 33.63% (LM trained with long text in one line, no endpointing)

Please correct me if anything is wrong. Thank you.

Daniel Povey

unread,
Aug 29, 2019, 4:22:09 PM8/29/19
to kaldi-help
In general people tend to find that doing speech detection prior to ASR can only hurt the WER performance, not help, in typical scenarios.  The normal motivation for doing it is to speed things up.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6c4e92a3-e4e9-4566-af60-6830e73fa0c9%40googlegroups.com.

Sray Chen

unread,
Aug 29, 2019, 4:24:59 PM8/29/19
to kaldi-help
Sorry, just to footnote, I think the main reason a good neural net based SAD model can help is that it can segment better for feature extraction, which is better than rule based endpointing. Please correct me if this is wrong.

Thanks.


Sray Chen於 2019年8月29日星期四 UTC-5下午3時09分00秒寫道:

Daniel Povey

unread,
Aug 29, 2019, 6:07:34 PM8/29/19
to kaldi-help
That would only affect the CMN (if used) and the ivector (if used).  It would depend on your feature type, and it may not even help in all situations even if you use those things.

Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Itai Peer

unread,
Sep 2, 2019, 3:03:13 AM9/2/19
to kaldi-help
Hi Dan ,  a follow-up question please : 

Does the language-model in Kaldi   models differently words at the beggniing of the sentences , than words in the middle and end of sentences ? 

If there is any differences in the LM part , than segmenting audio into differences sentences/utterances should improve accuracy 

thanks  

בתאריך יום שישי, 30 באוגוסט 2019 בשעה 01:07:34 UTC+3, מאת Dan Povey:
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,
Sep 3, 2019, 5:26:56 AM9/3/19
to kaldi-help


Does the language-model in Kaldi   models differently words at the beggniing of the sentences , than words in the middle and end of sentences ? 

Yes, it does, as all language models do, via the beginning-of-sentence and end-of-sentence markers <s> and </s>.
 

If there is any differences in the LM part , than segmenting audio into differences sentences/utterances should improve accuracy 

In theory, maybe; but bear in mind that often the language modeling text will be segmented by sentence; and sentence boundaries and silences will not always coincide.

In practice I think you'd be more likely to see slightly better results from keeping utterances in one piece, as more relevant context would be available.

Dan

 
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1011e5f7-07a1-4a15-8805-b97df7eadf75%40googlegroups.com.

Itai Peer

unread,
Sep 4, 2019, 3:59:42 AM9/4/19
to kaldi-help
thank you for the answer ,  it actually helps a lot  ,  I validate your answer in short experiment (longer and shorter utterances ) , and saw no effect on accuracy  ,  

בתאריך יום שלישי, 3 בספטמבר 2019 בשעה 12:26:56 UTC+3, מאת Dan Povey:


Does the language-model in Kaldi   models differently words at the beginning of the sentences , than words in the middle and end of sentences ? 

Mikel Esparza

unread,
Sep 5, 2019, 3:45:33 AM9/5/19
to kaldi...@googlegroups.com
Hi!

Regarding to this. I've seen that Kaldi tends to work better with audio segments between 10 and 30 seconds long. If you have a big audio and you split it to have those types of segments you will probably also cut some words at the middle. Wouldn't be better to use a previous VAD and cut the audio on the silences it detects in order to prevent cutting the words?

Thanks in advance.

Jan Trmal

unread,
Sep 9, 2019, 10:57:28 AM9/9/19
to kaldi-help
yes it's definitely possible... Might need extra tuning of minimal segment length and so on.
y.

Reply all
Reply to author
Forward
0 new messages