Filtering Bad Utterances

sxk

unread,

Feb 11, 2021, 11:37:13 PM2/11/21

to kaldi-help

I have some utterances where certain words are missing in transcripts, mostly numbers, say while in conversation people says "5000 , 5 million, 9th January" , in transcript the words are there except the numeric numbers, ie 5000, 5, 9 etc. is there a way to filter out utterances like these before training ? Or while training will Kaldi notify of the faulty utterances ?

Daniel Povey

unread,

Feb 12, 2021, 12:32:53 AM2/12/21

to kaldi-help

You could use find_bad_utts.sh, maybe, to find utterances that don't align well.

On Fri, Feb 12, 2021 at 12:37 PM sxk <shahe...@gmail.com> wrote:

I have some utterances where certain words are missing in transcripts, mostly numbers, say while in conversation people says "5000 , 5 million, 9th January" , in transcript the words are there except the numeric numbers, ie 5000, 5, 9 etc. is there a way to filter out utterances like these before training ? Or while training will Kaldi notify of the faulty utterances ?

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/8f160039-d471-446c-bf85-3fa3ab9b82a9n%40googlegroups.com.

sxk

unread,

Feb 12, 2021, 12:49:58 AM2/12/21

to kaldi-help

Hi Dan,

I previously used steps/cleanup/segment_long_utterances_nnet3.sh to get it to where I am now. However, before doing that, from the raw transcript I stripped off all the numerics assuming even in the final output it would discard all utterances that would contain the numbers in it, however right now I have the utterances and its corresponding text, where the specific numbers are missing . And these kind of utterances are like around 20% of the whole corpus, so should I go back and re do the same process with numerics in the corpus or say I train an hmm-gmm model now and use the clean and segment after that to remove these utterances ? which would yield better results ? : /

Daniel Povey

unread,

Feb 12, 2021, 12:56:38 AM2/12/21

to kaldi-help

You shouldn't have removed the numerics, you should have converted the easy/unambiguous ones to words and maybe left the rest there to be converted to unk.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bd04de89-d11a-4fea-ad65-6050376f22d2n%40googlegroups.com.

sxk

unread,

Feb 12, 2021, 1:00:51 AM2/12/21

to kaldi-help

hmm thanks, seems like that was a very painful mistake ! 10M utterances and after one week in compute. Anyway going back to stage 0 ! Thanks for sharing the thoughts !

Reply all

Reply to author

Forward