Kaldi WER when using very small lexicon

832 views
Skip to first unread message

oren...@gmail.com

unread,
Oct 20, 2016, 3:10:25 PM10/20/16
to kaldi-help
I'm developing an app which uses ASR with a vocabulary of about 30 words. The words are known only at runtime. I currently use PocketSphinx, and I read that Kaldi is as twice accurate when using large vocabulary. Is this also apply when using a vocabulary of 20-30 words? 

And another question: I have to reject words not in the lexicon. I currently do this with a phone-loop which I put inside a JSGF grammar. Is this approach suitable for Kaldi?

I can't check all these right away, since I'm not familiar with Linux nor with Kaldi yet, so I'm first asking.

Daniel Povey

unread,
Oct 20, 2016, 4:31:22 PM10/20/16
to kaldi-help
I'm developing an app which uses ASR with a vocabulary of about 30 words. The words are known only at runtime. I currently use PocketSphinx, and I read that Kaldi is as twice accurate when using large vocabulary. Is this also apply when using a vocabulary of 20-30 words? 

Yes, it should.  (Assuming both systems were trained on the same data, etc.) 
The Kaldi runtime is a bit harder to deal with than PocketSphinx as there are a lot more options, model types, etc., and there are more ways to use it; it's less slickly packaged, more of a low-level library.

And another question: I have to reject words not in the lexicon. I currently do this with a phone-loop which I put inside a JSGF grammar. Is this approach suitable for Kaldi?

Yes, you can do that.  Right now in the tedlium s5_r2 example scripts, there is a script local/run_unk_model.sh that demonstrates how to do decoding where the unknown-word <unk> maps to an n-gram model on phones.  That is a good way to handle OOV words.
 
I can't check all these right away, since I'm not familiar with Linux nor with Kaldi yet, so I'm first asking.

If you are not familiar with Linux, expect it to be an uphill struggle.  Kaldi is intended for experienced people or teams to use, and we generally assume you have your own data and are training your own models.  There will be a learning curve.

Dan


 

You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ognjen Todic

unread,
Oct 21, 2016, 1:52:56 AM10/21/16
to kaldi-help
You will definitely get better performance with Kaldi. The extent of improvement will depend on a lot of things:
  1. For small vocabularies performance should already be very high (see pocketsphinx related comments below though), unless there are other factors where you are pushing the envelope (super noisy environment, non-native speech, acoustically similar words, etc.)
  2. I've done some experiments in the past with PocketSphinx acoustic models retrained (MAP adapted to be more precise) using domain specific data (similar to yours, about 100 words, and maybe about 1h of data). Did the same thing with Kaldi (using GMMs, not DNNs since data was limited) and got much better results... can't remember the exact numbers of top of my head, but it was substantial improvement, something like 92-93% to 97-98% or similar
  3. With NNet models/decoder (which you should use), you are likely to get even better performance
  4. If I recall correctly, quite a few issues with PocketSphinx were related to not-so-great silence model (this was both with PS acoustic models and my own custom trained ones); miscellaneous noises like lip smack, coughing, etc. would end up triggering their VAD and being recognized as one of the words (garbage model would help a bit, but it did not work that well). It's possible I was doing something wrong, but I've spent quite a bit of time trying to get it to work before switching to Kaldi
Not sure what platform you are developing for. If iOS, you can experiment with our proof-of-concept app (http://github.com/keenresearch/kaldi-ios-poc). Just take one of the demos, and plug in your own words. There is no filler/garbage model yet, but it will be available soon. 

If you have any questions about the proof-of-concept app, please don't post here; contact me directly at o...@keenresearch.com.

oren...@gmail.com

unread,
Oct 21, 2016, 3:59:33 PM10/21/16
to kaldi-help
Thanks a lot Dan and Ognjen. I guess that building Kaldi on Windows is not necessarily easier then learning to use it on Linux... I will check both options.

Daniel Povey

unread,
Oct 21, 2016, 4:04:28 PM10/21/16
to kaldi-help
Even if you can get Kaldi to compile on Windows it won't do you any good because Kaldi relies heavily on scripting, and that only really works on UNIX.
You could try it on Cygwin, but Cygwin is very buggy and limited.  Better to just use UNIX.
Early on in Kaldi development I thought it might be possible to have a DOS or PowerShell version of the script, but when I realized how limited DOS is, and how strange and un-UNIXy PowerShell is, I had to abandon that idea.

Dan


--

Ri ki

unread,
Jul 27, 2017, 3:48:21 AM7/27/17
to kaldi-help, dpo...@gmail.com
Hi,


The original post states the following:

----> I'm developing an app which uses ASR with a vocabulary of about 30 words. The words are known only at runtime.

So if the words are known only at runtime, is it possible to generate( may be add the new words to existing lexicon, words.txt, need to generate the pronunciation etc.) a new G.fst and then create the HCLG.fst - all at runtime?

Thanks in advance.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

o...@keenresearch.com

unread,
Jul 27, 2017, 12:37:08 PM7/27/17
to kaldi-help, dpo...@gmail.com
 It depends what your definition of runtime is... but, if the question is if you could build HCLG.fst in your own binary that's doing decoding, the answer is yes. You'd have to figure out all the steps involved in doing that (e.g. follow the logic from the scripts and corresponding binaries).

Ri ki

unread,
Jul 27, 2017, 2:29:14 PM7/27/17
to kaldi-help, dpo...@gmail.com
Hi Ognjen,

Thank you for the prompt response.

Since the original message poster mentioned app and PocketSphinx, my understanding was that this person is trying to use kaldi on a mobile device. And "at runtime" could mean that the words are available once the app is running. Say for example to read the songs list or phone contacts list.

So my question was whether it is possible to be able to do this once the app is loaded into memory i.e to be able to pick up songs list on the device, create pronunciations to newly added words(if not already existing ), create the G.fst and all that stuff.(I beleive you answered my question though)

Is there an easier or alternate way of doing this?

Thanks

Daniel Povey

unread,
Jul 27, 2017, 2:48:35 PM7/27/17
to Ri ki, kaldi-help
There isn't a super-easy way, and the conversion of spelling into
pronunciation would normally be done with externals tools
(phonetisaurus or sequitur) so the integration would be tricky. The
original poster stated that he is not that familiar with Linux. All
this would be extremely difficult for him.

Ri ki

unread,
Jul 27, 2017, 2:57:20 PM7/27/17
to kaldi-help, maria...@gmail.com, dpo...@gmail.com
ok.thank you Dan
Reply all
Reply to author
Forward
0 new messages