Hey,
I am trying to examine how my ASR model (based on WSJ/s5 with nnet3 recipe) is dealing with silence segments.
So, I created a silence test set, that contains "utterances" of silence (recordings, without parts of speech).
I expected the model to outputs just SIL phones for those segments, which means to see nothing on the decoded text for each utterance.
But for each utterance, the model outputs a single word, most of the time is the word "EH" - so it doesn't recognize perfect silence.
Is it a normal behavior of such an ASR model(to output a single word on a whole segment of non-speech)? Or does this indicate that I have a problem dealing with non-speech sections?
Thanks.
Bar