Hi all,
I'm using kaldi gstreamer with VAD for phone conversation scenario right now. In my understanding, the VAD here is just filter out silence and are not actually segmenting audio. In our system, the audio will be cut every 3 minutes(segmenting) by the way.
Recently, I'm trying to apply endpointing or SAD. Can anyone clarify if endpointing or SAD should help real time ASR and why them help? My understanding is that SAD can achieve segmentation and help decoding since long segments will degrade decoding performance. While endpointing should be the same idea but less effective than SAD.
I only tried endpointing for now. My experiments show that only sometimes the endpointing will help when comparing to my original system, where the LM is trained with long text in one line (since we are kind of decoding long audio (3 minutes)):
Results:
test_set1
WER 17.3% (endpointing + LM trained with long text in one line)
WER 18.22% (endpointing + LM trained with segmented text by steps/cleanup/segment_long_utterances_nnet3.sh)
WER 16.82% (LM trained with long text in one line, no endpointing)
test_set2 where endpointing help:
WER 32.1% (endpointing + LM trained with long text in one line)
WER 32.48% (endpointing + LM trained with segmented text by steps/cleanup/segment_long_utterances_nnet3.sh)
WER 33.63% (LM trained with long text in one line, no endpointing)
Please correct me if anything is wrong. Thank you.