Mixing Language models efficiently and handling out-of-vocablary words

yasir....@sybrid.com

unread,

Dec 7, 2021, 6:20:14 AM12/7/21

to kaldi-help

Dear Kaldi Community,

I have built an Urdu ASR system which is very focused and domain specific with only around 7000 words in Vocabulary. It is giving WER of around 22% but due to limited vocabulary, it understandably, it is unable to identify "out of vocabulary" words especially proper nouns and names etc.

To counter, I scrapped data from online sources with 500000+ sentences and vocabulary of 100000+ words and mixed the two languages models.

Corpus_1 is from my data, Corpus_2 is from online sources.

The Command to prepare language model with lm_order= 5 for both corpus is follows:

ngram-count -order 5 -write-vocab data/local/tmp/vocab-full_1.txt -wbdiscount -text data/local/corpus_1.txt -lm data/local/tmp/lm_1.arpa

ngram-count -order 5 -write-vocab data/local/tmp/vocab-full_2.txt -wbdiscount -text data/local/corpus_2.txt -lm data/local/tmp/lm_2.arpa

Mixing models using:

"ngram -order <order> -lm <first-lm> -mix-lm <second-lm> -lambda <lambda> -write-lm <final.lm>" with lambda= 0.90.

This has deteriorated my system a lot.

Any recommendations to handing this problem ?

1- Will Changing Lm_order for 1st LM to 5 and 2nd LM to 1 help ?

Best regards,

Yasir

Daniel Povey

unread,

Dec 7, 2021, 6:34:43 AM12/7/21

to kaldi-help

I think kneser-ney discounting is more common than Witten-Bell.

You will have to tune the LM orders and the mixture weights.

The more generic LM may of course degrade WER on data from the limited domain.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3f61cc3a-dd61-4304-87b0-4aaa15499d4en%40googlegroups.com.

Jan Yenda Trmal

unread,

Dec 7, 2021, 1:50:25 PM12/7/21

to kaldi-help

I suggest trying different weights on some heldout in-domain data.

Plus, the lambda parameter might not mean what you mean (you might mean 1 - 0.9 = 0.1) -- I find the lambda setting in srilm quite confusing

http://www.speech.sri.com/projects/srilm/manpages/ngram.1.html

(lambda is for the main model, but the other lambdas are for the lm-mix models)

y.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuyRokxkfeC%3DH_rWWSA1EcL%2Bv8cEMV08hr3W3kCR1YNTw6g%40mail.gmail.com.

yasir....@sybrid.com

unread,

Dec 8, 2021, 12:35:50 AM12/8/21

to kaldi-help

This is very helpful.
If I am clear means that,

ngram -order 5 -lm lm1.arpa -mix-lm lm2.arpa -lambda 0.90 -write-lm lm.arpa

Here, lm2.arpa (generic) will have higher weight than lm1.arpa (domain) ?

For my case, this will make more sense ? as I want to assign 0.90 to Domain and 0.10 to lm2.

ngram -order 5 -lm lm1.arpa -mix-lm lm2.arpa -lambda 0.10 -write-lm lm.arpa

Asking this because

best regards.

Jan Yenda Trmal

unread,

Dec 8, 2021, 8:04:19 AM12/8/21

to kaldi...@googlegroups.com

I believe so.

Y.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/2948aafd-6308-488e-a08d-222e970c316bn%40googlegroups.com.

Reply all

Reply to author

Forward