Training YES/NO model with Kaldi.

Shantam Garg

unread,

Aug 31, 2016, 3:20:33 AM8/31/16

to kaldi-help

Hi,

We are building an ASR for INDIAN ENGLISH Accent that recognises YES/NO words, we have around 10K wave files for YES but only 700 wave files for NO.

(Note: Each wave file has a different speaker)

As you can see number of wav file for training NO is very less, we want to know if there is any way in Kaldi so, that we can train our models with only YES wav files ?

We actually tried an experiment in which we trained a YES model with 6K YES files and a language model with only one word - YES, while decoding found that as our model is only trained with YES wav files it recognises every test file as some form of YES.

Example results we got:

Actual transcript YES -- Recognised as YES

Actual transcript YES -- Recognised as YES YES or YES YES YES etc.

(we think it might be recognising some noise in wave file also as YES leading to multiple YES even though there is only one YES in the wav file)

Actual transcript NO -- Recognised as YES or YES YES etc.

So, is there any way in Kaldi where we could set a threshold score such that - decoding results above that threshold are only considered to be accurate.

Regards,

Shantam Garg

Daniel Povey

unread,

Aug 31, 2016, 2:58:01 PM8/31/16

to kaldi-help

There isn't such a way. You should train on both yes and no.

Dan

> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Shantam Garg

unread,

Aug 31, 2016, 3:29:03 PM8/31/16

to kaldi-help, dpo...@gmail.com

Thanks for the reply Dan.

If there isn't any way to achieve it, can we train on both YES and NO (with NO file very less in comparison to YES) keeping the ratio on YES:NO similar in test and train or do we need to collect more data for NO ?

Regards,

Shantam Garg

Daniel Povey

unread,

Aug 31, 2016, 3:45:36 PM8/31/16

to Shantam Garg, kaldi-help

The ratio between 'yes' and 'no' doesn't have to be the same in test and train.
Just make sure your language model FST has the weights that are
suitable, e.g. the same for yes and no. e.g.

fstcompile <<EOF > data/lang/G.fst
0 1 2 2
0 1 3 3
1 0.0
EOF

is an FST that just accepts either a single "yes" or a single "no"
with equal weight [assuming words.txt assigns label 2 to "yes" and 3
to "no", which it might not...

Read the HTK Book to get some background understanding of speech recognition.

Dan

Shantam Garg

unread,

Sep 6, 2016, 9:56:54 AM9/6/16

to kaldi-help, shanta...@gmail.com, dpo...@gmail.com

Hi Dan,

As you mentioned I have trained the model with dictionary based language model for 'yes' and 'no', and I got 5% WER while decoding.

When I was trying out the trained model on some random wav files, it correctly detected yes/no wav files but when the wav file was empty or has some random transcript (words not in corpus), then also it got mapped it to either 'yes' or 'no'.

So, is there any way we can detect such instance with random transcript or any workaround to avoid mapping them to the words in corpus.

Thanks,

Shantam Garg

Daniel Povey

unread,

Sep 6, 2016, 2:06:28 PM9/6/16

to Shantam Garg, kaldi-help

The only easy way to do that is to train a complete LVCSR system. Or
you could have some data that's not "yes" or "no" and just mark it as
some 3rd class, like "garbage", and treat that as a separate word.

Dan

Shantam Garg

unread,

Sep 17, 2016, 9:11:08 AM9/17/16

to kaldi-help, shanta...@gmail.com, dpo...@gmail.com

Hi Dan,

I trained the model with 'yes', 'no' and 'garbage' (for unknown or random wav files) as we don't have enough data to train a LVCSR system.

After testing the models on a few files the precision and recall were:

(Date used to train: 15K Yes files, 15K garbage files, 1K No files)

Precision_Yes: 86.0826563052 Recall_Yes: 98.1869460113

Precision_No: 48.5815602837 Recall_No: 91.3333333333

Precision_Unknown: 98.6041874377 Recall_Unknown: 85.0143266476

Although the recall is quite good I am concerned about the precision of the models, currently the precision of 'yes' and 'no' is low.

Is there any change in the way we train our model so as to increase the precision ?

Also when we recognise a wav file in Kaldi can we somehow get a score or probability value corresponding to it, that can then be used as a threshold to

determine if the recognition was accurate or not.

Thanks,

Shantam Garg

Daniel Povey

unread,

Sep 17, 2016, 5:23:55 PM9/17/16

to Shantam Garg, kaldi-help

I think I am going to have to start answering fewer newbie questions
and focus on the harder ones. I wonder if there is someone else who
can answer this?

Nickolay Shmyrev

unread,

Sep 18, 2016, 3:03:45 AM9/18/16

to kaldi-help, shanta...@gmail.com, dpo...@gmail.com

You need to play with the weights of a grammar you use for your task. If you use a simple loop FST, increase weight for garbage and decrease weights for yes and no. Then it will score more as garbage, but will be more certain about detection of yes and no.

Listen for mistakenly understood recording and try to understand why they were recognized as word instead of garbage, maybe add more garbage types to the training system.

Further on the road, you need to make sure you are using wide context LSTM network, not simple GMM models. With 15k input files, you should be able to train a LSTM.

Shantam Garg

unread,

Sep 19, 2016, 6:54:39 AM9/19/16

to kaldi-help, shanta...@gmail.com, dpo...@gmail.com

Thanks for the help @Dan, @Nickolay

Currently I am using this grammar (G.fst)

0 1 3 3

0 1 4 4

0 1 5 5

1

Where label 'yes' is 5, 'no' is 4, 'garbage' is 3 and weight are equal for all three, I will try playing the weights a little bit

- increase the weight for 'garbage' and decrease the weights of 'yes' and 'no'.

Thanks,

Shantam Garg

Daniel Povey

unread,

Sep 19, 2016, 9:21:10 AM9/19/16

to Shantam Garg, kaldi-help

the 5th field is the weight (default 0.0), but remember it is a
negative natural log, so positive value (like 0.5 or 1.0) will
suppress the occurrence of that word.

Reply all

Reply to author

Forward