Hi,
I am using libripseech example for ASR training and I had trained a gmm model till tri4b.
I wanted to add some more text to corpus and build a new language model.
I tried building the language model with the older corpus + some new corpus text.
I was successful with the language model and lexicon generation.
While building graph using the librispeech example code from run.sh,
I am getting this error:
utils/dict_dir_add_pronprobs.sh: number of lines differs from data/local/dict/lexicon.txt 200001 vs data/local/dict/lexiconp.txt 200130
Probably something went wrong (e.g. input prons were generated from a different lexicon
than data/local/dict_nosp, or you used pron_counts.txt when you should have used pron_counts_nowb.txt
or something else. Make sure the prons in data/local/dict_nosp/lexicon.txt exp/tri4b/pron_counts_nowb.txt look
the same
I am attaching the code for reference.
Here, i am using g2p from cmusphinx for lexicon and i used the same g2p model before and after.
I preserved the phones.txt with --phone_symbol_table while doing utils/prepare_lang.sh.
I tried with and without preserving phones.txt. Still I am getting the same error(Both times).
Code :
local/lm/train_lm.sh $LM_CORPUS_ROOT \
data/local/lm/norm_old/tmp data/local/lm/norm_old/norm_texts data/local/lm
# Optional G2P training scripts.
# As the LM training scripts above, this script is intended primarily to
# document our G2P model creation process
#local/g2p/train_g2p.sh data/local/dict/cmudict data/local/lm
# when "--stage 3" option is used below we skip the G2P steps, and use the
local/prepare_dict.sh --stage 3 --nj 20 --cmd "$train_cmd" \
data/local/lm data/local/lm data/local/dict_nosp
utils/prepare_lang.sh --phone_symbol_table exp/tri4b/phones.txt data/local/dict_nosp \
"<UNK>" data/local/lang_tmp_nosp data/lang_nosp
local/format_lms.sh --src-dir data/lang_nosp data/local/lm
# Create ConstArpaLm format language model for full 3-gram and 4-gram LMs
utils/build_const_arpa_lm.sh data/local/lm/lm_tglarge.arpa.gz \
data/lang_nosp data/lang_nosp_test_tglarge
utils/build_const_arpa_lm.sh data/local/lm/lm_fglarge.arpa.gz \
data/lang_nosp data/lang_nosp_test_fglarge
steps/get_prons.sh --cmd "$train_cmd" \
feature_data/train_clean_100 data/lang_nosp exp/tri4b
utils/dict_dir_add_pronprobs.sh --max-normalize true \
data/local/dict_nosp \
exp/tri4b/pron_counts_nowb.txt exp/tri4b/sil_counts_nowb.txt \
exp/tri4b/pron_bigram_counts_nowb.txt data/local/dict
utils/prepare_lang.sh --phone_symbol_table exp/tri4b/phones.txt data/local/dict \
"<UNK>" data/local/lang_tmp data/lang
local/format_lms.sh --src-dir data/lang data/local/lm
utils/mkgraph.sh \
data/lang_test_tgsmall exp/tri4b exp/tri4b/graph_tgsmall_new
Thanks in advance,