You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to kaldi-help
Hello,
I am taking a training dataset of 20 different speakers and for each speaker I have 10 utterances which are of 5 seconds each. My testing dataset includes 2 utterances per speaker(i.e. 40 utterances) each of 5 seconds. I train it using delta for mono and using lda+mllt for triphone. I am getting a WER of around 45%-50%(which is a lot) on decoding for both. I had prepared the language model accordingly.
So I wanted to know how do I improve my result? Do I need a more balanced dataset in terms of training and testing?
Olumide
unread,
May 10, 2017, 8:33:38 AM5/10/17
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to kaldi...@googlegroups.com
Prabhjit I think you've got too little training data. You need hundreds
of hours of training data. How many hours of speech have you got?
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to kaldi-help
20*5*10 sec ~ 15 minutes of training data? for triphones? I'm even surprised you got 45-50% of WER and that was maybe because you used the same speakers in the test data I'd say you'd need at least a few dozens of hours as a minimal requirement
Prabhjit Singh Thind
unread,
May 11, 2017, 4:25:03 AM5/11/17
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to kaldi-help
Yeah, the 40-50% WER is because I took the testing data of same speakers. I'll take more data for training. Thanks armando and olumide. Also I wanted to ask, like I am using audio wav files at 11025hz. What are the higher frequencies I can use? Basically how do i decide at what frequency should the data be trained ?
Armando
unread,
May 11, 2017, 8:37:07 AM5/11/17
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to kaldi-help
same as your training and testing data if you downsample, you degrade the audio quality, and you are not gaining if you increase the sampling rate anyway important thing is to keep the same rate in both training and testing most of the time, you find training corpora at either 8 or 16 KHz; let's say you have a training corpus of a few dozens of hours at 8 KHZ; after traing the acoustic models, in that case, you might as well downsample your testing data from 11 to 8 KHz (well, you might instead modify accordingly the high frequency param in the feat extraction binary but forget it, for now).