please help, a simple but confusing problem about the lm in kaldi.

412 views
Skip to first unread message

shafeng

unread,
Oct 16, 2019, 9:44:07 AM10/16/19
to kaldi-help

I knew when using SRILM ,we can use the -limit-vocab -vocab vocab_file to train the lm,so the number of 1-gram words will be the same with the number of words contained in a vocab.
When i use Kenlm, I also used the -limit_vocab just like SRILM.But i found the lm trained by the kenlm has different number of words of the vocab.Specifically,the number of words contained in a vocab  is 136922(i use wc -l vocab to get the number of words in a vocab). But the number of 1-grams in the lm trained by the Kenlm is:

# Input file: /data2/shafeng/code_jucai/corpus_process_and_lm_demo/lmtrain/16319.txt.utf-8.3.final
# Token count: 463745
# Smoothing: Modified Kneser-Ney
\data\
ngram 1=25846

We can see there is only  25846 1-ngrams in the lm.

How can i get the keep the number of 1-ngrams consistent with number of words contained in the vocab?

And another question:
if i use a lm with 25846 1-ngrams to build a HCLG.fst and use the HCLG.fst to train and decode, can i use a lm with more 1-ngrams to rescore??

Jan Trmal

unread,
Oct 16, 2019, 11:04:14 AM10/16/19
to kaldi-help
KenLM does have the option '--limit_vocab_file' (not just -limit_vocab, but I assume you used the correct one) -- but I'm not sure if it also extends the LM. That's more question for Ken (https://github.com/kpu/kenlm) and not exactly Kaldi question
Depending on how the <unk> probability and the unigram discounts are calculated, the words might be getting the same probability as <unk> (and might not matter to include them into the LM)
But as long as you have the words in the lexicon, Kaldi should be able to recognize them.
y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d64c7e81-7793-4a49-8cdc-f90193bc1714%40googlegroups.com.

Daniel Povey

unread,
Oct 16, 2019, 1:55:57 PM10/16/19
to kaldi-help
.. but if the words don't appear in the LM and you make the HCLG from that LM, they won't be recognized.

shafeng

unread,
Oct 16, 2019, 10:06:03 PM10/16/19
to kaldi-help
Thanks for your explaination !! 

在 2019年10月16日星期三 UTC+8下午11:04:14,Yenda写道:
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages