How to optimize the result?

Prabhjit Singh Thind

unread,

May 10, 2017, 3:22:38 AM5/10/17

to kaldi-help

Hello,

I am taking a training dataset of 20 different speakers and for each speaker I have 10 utterances which are of 5 seconds each. My testing dataset includes 2 utterances per speaker(i.e. 40 utterances) each of 5 seconds. I train it using delta for mono and using lda+mllt for triphone. I am getting a WER of around 45%-50%(which is a lot) on decoding for both. I had prepared the language model accordingly.

So I wanted to know how do I improve my result? Do I need a more balanced dataset in terms of training and testing?

Olumide

unread,

May 10, 2017, 8:33:38 AM5/10/17

to kaldi...@googlegroups.com

Prabhjit I think you've got too little training data. You need hundreds
of hours of training data. How many hours of speech have you got?

- Olumide

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google
> Groups "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to kaldi-help+...@googlegroups.com
> <mailto:kaldi-help+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Armando

unread,

May 10, 2017, 8:55:35 AM5/10/17

to kaldi-help

20*5*10 sec ~ 15 minutes of training data? for triphones? I'm even surprised you got 45-50% of WER
and that was maybe because you used the same speakers in the test data
I'd say you'd need at least a few dozens of hours as a minimal requirement

Prabhjit Singh Thind

unread,

May 11, 2017, 4:25:03 AM5/11/17

to kaldi-help

Yeah, the 40-50% WER is because I took the testing data of same speakers. I'll take more data for training. Thanks armando and olumide. Also I wanted to ask, like I am using audio wav files at 11025hz. What are the higher frequencies I can use? Basically how do i decide at what frequency should the data be trained ?

Armando

unread,

May 11, 2017, 8:37:07 AM5/11/17

to kaldi-help

same as your training and testing data
if you downsample, you degrade the audio quality, and you are not gaining if you increase the sampling rate anyway
important thing is to keep the same rate in both training and testing
most of the time, you find training corpora at either 8 or 16 KHz; let's say you have a training corpus of a few dozens of hours at 8 KHZ; after traing the acoustic models, in that case, you might as well downsample your testing data from 11 to 8 KHz (well, you might instead modify accordingly the high frequency param in the feat extraction binary but forget it, for now).

Reply all

Reply to author

Forward