Filtering Bad Utterances

126 views
Skip to first unread message

sxk

unread,
Feb 11, 2021, 11:37:13 PM2/11/21
to kaldi-help
I have some utterances where certain words are missing in transcripts, mostly numbers, say while in conversation people says "5000 , 5 million, 9th January" , in transcript the words are there except the numeric numbers, ie 5000, 5, 9 etc.  is there a way to filter out utterances like these before training ? Or while training will Kaldi notify of the faulty utterances ?  

Daniel Povey

unread,
Feb 12, 2021, 12:32:53 AM2/12/21
to kaldi-help
You could use find_bad_utts.sh, maybe, to find utterances that don't align well.


On Fri, Feb 12, 2021 at 12:37 PM sxk <shahe...@gmail.com> wrote:
I have some utterances where certain words are missing in transcripts, mostly numbers, say while in conversation people says "5000 , 5 million, 9th January" , in transcript the words are there except the numeric numbers, ie 5000, 5, 9 etc.  is there a way to filter out utterances like these before training ? Or while training will Kaldi notify of the faulty utterances ?  

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/8f160039-d471-446c-bf85-3fa3ab9b82a9n%40googlegroups.com.

sxk

unread,
Feb 12, 2021, 12:49:58 AM2/12/21
to kaldi-help
Hi Dan, 

I previously used steps/cleanup/segment_long_utterances_nnet3.sh to get it to where I am now. However, before doing that, from the raw transcript I stripped off all the numerics assuming even in the final output it would discard all utterances that would contain the numbers in it, however right now I have the utterances and its corresponding text, where the specific numbers are missing . And these kind of utterances are like around 20% of the whole corpus, so should I go back and re do the same process with numerics in the corpus or say I train an hmm-gmm model now and use the clean and segment after that to remove these utterances ? which would yield better results ? : / 

Daniel Povey

unread,
Feb 12, 2021, 12:56:38 AM2/12/21
to kaldi-help
You shouldn't have removed the numerics, you should have converted the easy/unambiguous ones to words and maybe left the rest there to be  converted to unk.


sxk

unread,
Feb 12, 2021, 1:00:51 AM2/12/21
to kaldi-help
hmm thanks, seems like that was a very painful mistake ! 10M utterances and after one week in compute. Anyway going back to stage 0 ! Thanks for sharing the thoughts ! 
Reply all
Reply to author
Forward
0 new messages