Dear Kaldi Community,
I have built an Urdu ASR system which is very focused and domain specific with only around 7000 words in Vocabulary. It is giving WER of around 22% but due to limited vocabulary, it understandably, it is unable to identify "out of vocabulary" words especially proper nouns and names etc.
To counter, I scrapped data from online sources with 500000+ sentences and vocabulary of 100000+ words and mixed the two languages models.
Corpus_1 is from my data, Corpus_2 is from online sources.
The Command to prepare language model with
lm_order= 5 for both corpus is follows:
ngram-count -order 5 -write-vocab data/local/tmp/vocab-full_1.txt -wbdiscount -text data/local/corpus_1.txt -lm data/local/tmp/lm_1.arpa
ngram-count -order 5 -write-vocab data/local/tmp/vocab-full_2.txt
-wbdiscount -text data/local/corpus_2.txt -lm data/local/tmp/lm_2.arpa
Mixing models using:
"ngram -order <order> -lm <first-lm> -mix-lm
<second-lm> -lambda <lambda> -write-lm <final.lm>"
with lambda= 0.90.
This has deteriorated my system a lot.
Any recommendations to handing this problem ?
1- Will Changing Lm_order for 1st LM to 5 and 2nd LM to 1 help ?
Best regards,
Yasir