Mixing Language models efficiently and handling out-of-vocablary words

70 views
Skip to first unread message

yasir....@sybrid.com

unread,
Dec 7, 2021, 6:20:14 AM12/7/21
to kaldi-help
Dear Kaldi Community,

I have built an Urdu ASR system which is very focused and domain specific with only around 7000 words in Vocabulary. It is giving WER of around 22% but due to limited vocabulary, it understandably, it is unable to identify "out of vocabulary" words especially proper nouns and names etc.
To counter, I scrapped data from online sources with 500000+ sentences and vocabulary of 100000+ words and mixed the two languages models.

Corpus_1 is from my data, Corpus_2 is from online sources.
The Command to prepare language model with lm_order= 5 for both corpus is follows:

ngram-count -order 5 -write-vocab data/local/tmp/vocab-full_1.txt -wbdiscount -text data/local/corpus_1.txt -lm data/local/tmp/lm_1.arpa

ngram-count -order 5 -write-vocab data/local/tmp/vocab-full_2.txt -wbdiscount -text data/local/corpus_2.txt -lm data/local/tmp/lm_2.arpa

Mixing models using:
"ngram -order <order> -lm <first-lm> -mix-lm <second-lm> -lambda <lambda> -write-lm <final.lm>" with lambda= 0.90.

This has deteriorated my system a lot.

Any recommendations to handing this problem ?
1- Will Changing Lm_order for 1st LM to 5 and 2nd LM to 1 help ?

Best regards,
Yasir

Daniel Povey

unread,
Dec 7, 2021, 6:34:43 AM12/7/21
to kaldi-help
I think kneser-ney discounting is more common than Witten-Bell.
You will have to tune the LM orders and the mixture weights.
The more generic LM may of course degrade WER on data from the limited domain.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3f61cc3a-dd61-4304-87b0-4aaa15499d4en%40googlegroups.com.

Jan Yenda Trmal

unread,
Dec 7, 2021, 1:50:25 PM12/7/21
to kaldi-help
I suggest trying different weights on some heldout in-domain data. 
Plus, the lambda parameter might not mean what you mean  (you might mean 1 - 0.9 = 0.1) -- I find the lambda setting in srilm quite confusing
(lambda is for the main model, but the other lambdas are for the lm-mix models)
y.

yasir....@sybrid.com

unread,
Dec 8, 2021, 12:35:50 AM12/8/21
to kaldi-help
This is very helpful.
If I am clear means that,
ngram -order 5 -lm lm1.arpa -mix-lm lm2.arpa -lambda 0.90 -write-lm lm.arpa
Here, lm2.arpa (generic) will have higher weight than lm1.arpa (domain) ?
For my case, this will make more sense ? as I want to assign 0.90 to Domain and 0.10 to lm2.
ngram -order 5 -lm lm1.arpa -mix-lm lm2.arpa -lambda 0.10 -write-lm lm.arpa

Asking this because
best regards.

Jan Yenda Trmal

unread,
Dec 8, 2021, 8:04:19 AM12/8/21
to kaldi...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages