triphone error higher than monophone error: problem building ASR for an exotic new language

jeyhun...@gmail.com

unread,

Nov 2, 2018, 1:43:55 PM11/2/18

to kaldi-help

I'm building an ASR for an exotic language (from Semitic languages origin) - I have very limited data, 2.5 hours in total - which I used 2 hours for training and 0.5 hours for testing.

Monophone training went okay, I'm getting around 50% WER on my test set.

But when I start triphone training, the WER goes up to 70%. I'm not sure if this is really about overfitting to my limited training data or I'm missing something very obvious here.

Here is some more details about my setup:

My lexicon is 5k words, I have 30 phones in total from which 6 are vowels (excluding sil and spn phones)

I built a 3gram LM (with KneserNey discounting) from the training set's transcripts (and I understand this is very very tiny for LM, but currently I couldn't find much written form to estimate my LM).

PPL on training set is around 50, on test set it is 400 (I know I should build a better LM, but don't think this is related to my poor tri phone performance compared to monophone).

My data is segmented and I have silence in boundaries.

During triphone training, aligning data looks fine.

I tried lowering number of leaves and total number of Gaussian components, they did not help much, also tried lowering number of training iterations (in case of over fitting), no luck.

In monophone training, avg likelihood goes from around -105 to -86

In triphone training, from around -90 to -82

So my question is: what could be wrong in my setup? Is it simply over-fitting to my limited training data?

Should I even try more with this setup or I should focus on getting more data, at least 10h?

Daniel Povey

unread,

Nov 2, 2018, 1:49:56 PM11/2/18

to kaldi...@googlegroups.com

Maybe try having fewer Gaussians and fewer leaves in the tree. Those are the command-line args of the triphone training script.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/b958ec19-70ff-46e4-81c4-f12dbb806bc4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jeyhun...@gmail.com

unread,

Nov 2, 2018, 2:05:17 PM11/2/18

to kaldi-help

Already tried that (as noted in my original question below), but didn't help

Daniel Povey

unread,

Nov 2, 2018, 2:07:37 PM11/2/18

to kaldi...@googlegroups.com

Maybe you didn't lower the model size enough; or maybe there was some kind of script bug. It's very unusual for triphone to be worse than monophone.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/9f09aa3f-632a-4a3d-b692-14eb52801c46%40googlegroups.com.

jeyhun...@gmail.com

unread,

Nov 2, 2018, 2:16:08 PM11/2/18

to kaldi-help

Thanks for the suggestions, Dan.

Started with the num_leaves/total-gauss of 2000/10000 and went all the way down to 200/1000 - do you think I should have tried lower values? (and I didn't see WER moving too much form it's 70% mark).

About script bugs, my setup is based on latest wsj script (from latest Kaldi as of few days ago) - not sure what sort of script bugs you mentioned?

Also on a side note, WER on the training set using the triphone model is as low as 3% (which given the LM is trained with that exact text, might be be justifiable)

Daniel Povey

unread,

Nov 2, 2018, 2:20:47 PM11/2/18

to kaldi...@googlegroups.com

Check that the WERs are being computed over the same set (it should print the total number of words in the scoring output, like WER %xyy num / den; den would be the total).

And if there is no mismatch, double check your script for mismatches in data directories (e.g. using a subset of training data for some reason), and then look at the output to see if you see any strange error patterns that might indicate a problem with your transcripts.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/01d58fea-3b9d-4c40-bfe7-a69e0f5f41eb%40googlegroups.com.

Jeyhun Marks

unread,

Nov 2, 2018, 2:44:04 PM11/2/18

to kaldi...@googlegroups.com

Thanks for the suggestions.

The WER is being computed on the same test set in both cases (den is 2539)

I only have two data dirs, one for training, and one for testing, double checked the folder, transcripts and etc. - they all look good (I manually annotated the whole corpus, training / test split is done randomly).

Double checked the training scripts, they use the correct training and test directories.

Training logs have nothing unusual (as far as I understand) - some warnings about not having enough data for estimating parameters

WARNING (gmm-est[5.5.89~1-d68c0]:MleDiagGmmUpdate():mle-diag-gmm.cc:365) Gaussian has too little data but not removing it because it is the last Gaussian: i = 0, occ = 0, weight = 1

(which appeared 34 times, once in each iteration's log)

You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/8JajQa-yBuQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuySspU0sg0QvseFYz1biwx3ybfBNVHffbyGj5oJSbnUUfw%40mail.gmail.com.

Daniel Povey

unread,

Nov 2, 2018, 3:31:23 PM11/2/18

to kaldi...@googlegroups.com

I can't see what the cause of the problem might be. Have a look at the output in the decode.*.log and see if you see any unusual patterns, comparing the monophone and triphone.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAGVKimWxtXkHhXivWT6HZFOTGuGF0UjRgb4XEWHJvaaH%2BiUkFw%40mail.gmail.com.

jeyhun...@gmail.com

unread,

Nov 7, 2018, 11:48:26 AM11/7/18

to kaldi-help

I didn't spot any unusual patterns in decode log files between monophone and triphone models.

Should I be concerned about the data amount (2 hours training & 0.5 hour test) and try to gather some more data instead of trying to debug the current issue? or you still think I should have a bug somewhere in my setup?

Daniel Povey

unread,

Nov 7, 2018, 11:49:33 AM11/7/18

to kaldi...@googlegroups.com

Sure, you could go ahead and collect more data.

It could still be a bug but it's hard to know what. More data should help regardless.

Dan

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/409261fd-7368-4589-aec4-f815c1b5a032%40googlegroups.com.

jeyhun...@gmail.com

unread,

Nov 7, 2018, 12:29:38 PM11/7/18

to kaldi-help

Sure. Thanks again for the suggestions.

Reply all

Reply to author

Forward