I knew when using SRILM ,we can use the -limit-vocab -vocab vocab_file to train the lm,so the number of 1-gram words will be the same with the number of words contained in a vocab.
When i use Kenlm, I also used the -limit_vocab just like SRILM.But i found the lm trained by the kenlm has different number of words of the vocab.Specifically,the number of words contained in a vocab is 136922(i use wc -l vocab to get the number of words in a vocab). But the number of 1-grams in the lm trained by the Kenlm is:
# Input file: /data2/shafeng/code_jucai/corpus_process_and_lm_demo/lmtrain/16319.txt.utf-8.3.final
# Token count: 463745
# Smoothing: Modified Kneser-Ney
\data\
ngram 1=25846
We can see there is only 25846 1-ngrams in the lm.
How can i get the keep the number of 1-ngrams consistent with number of words contained in the vocab?
And another question:
if i use a lm with 25846 1-ngrams to build a HCLG.fst and use the HCLG.fst to train and decode, can i use a lm with more 1-ngrams to rescore??