please help, a simple but confusing problem about the lm in kaldi.

shafeng

unread,

Oct 16, 2019, 9:44:07 AM10/16/19

to kaldi-help

I knew when using SRILM ,we can use the -limit-vocab -vocab vocab_file to train the lm,so the number of 1-gram words will be the same with the number of words contained in a vocab.

When i use Kenlm, I also used the -limit_vocab just like SRILM.But i found the lm trained by the kenlm has different number of words of the vocab.Specifically,the number of words contained in a vocab is 136922(i use wc -l vocab to get the number of words in a vocab). But the number of 1-grams in the lm trained by the Kenlm is:

# Input file: /data2/shafeng/code_jucai/corpus_process_and_lm_demo/lmtrain/16319.txt.utf-8.3.final

# Token count: 463745

# Smoothing: Modified Kneser-Ney

\data\

ngram 1=25846

We can see there is only 25846 1-ngrams in the lm.

How can i get the keep the number of 1-ngrams consistent with number of words contained in the vocab?

And another question:

if i use a lm with 25846 1-ngrams to build a HCLG.fst and use the HCLG.fst to train and decode, can i use a lm with more 1-ngrams to rescore??

Jan Trmal

unread,

Oct 16, 2019, 11:04:14 AM10/16/19

to kaldi-help

KenLM does have the option '--limit_vocab_file' (not just -limit_vocab, but I assume you used the correct one) -- but I'm not sure if it also extends the LM. That's more question for Ken (https://github.com/kpu/kenlm) and not exactly Kaldi question

Depending on how the <unk> probability and the unigram discounts are calculated, the words might be getting the same probability as <unk> (and might not matter to include them into the LM)

But as long as you have the words in the lexicon, Kaldi should be able to recognize them.

y.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d64c7e81-7793-4a49-8cdc-f90193bc1714%40googlegroups.com.

Daniel Povey

unread,

Oct 16, 2019, 1:55:57 PM10/16/19

to kaldi-help

.. but if the words don't appear in the LM and you make the HCLG from that LM, they won't be recognized.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAFReZQb6F%2B9WEMrrM81uRf7pHGxZqws_6YbynONyFye74iX-CQ%40mail.gmail.com.

shafeng

unread,

Oct 16, 2019, 10:06:03 PM10/16/19

to kaldi-help

Thanks for your explaination !!

在 2019年10月16日星期三 UTC+8下午11:04:14，Yenda写道：

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Reply all

Reply to author

Forward