Build new graph HCLG.fst from new language model

sandeep cb

unread,

Apr 19, 2018, 6:45:56 AM4/19/18

to kaldi-help

Hi,

I am using libripseech example for ASR training and I had trained a gmm model till tri4b.

I wanted to add some more text to corpus and build a new language model.

I tried building the language model with the older corpus + some new corpus text.

I was successful with the language model and lexicon generation.

While building graph using the librispeech example code from run.sh,

I am getting this error:

utils/dict_dir_add_pronprobs.sh: number of lines differs from data/local/dict/lexicon.txt 200001 vs data/local/dict/lexiconp.txt 200130

Probably something went wrong (e.g. input prons were generated from a different lexicon

than data/local/dict_nosp, or you used pron_counts.txt when you should have used pron_counts_nowb.txt

or something else. Make sure the prons in data/local/dict_nosp/lexicon.txt exp/tri4b/pron_counts_nowb.txt look

the same

I am attaching the code for reference.

Here, i am using g2p from cmusphinx for lexicon and i used the same g2p model before and after.

I preserved the phones.txt with --phone_symbol_table while doing utils/prepare_lang.sh.

I tried with and without preserving phones.txt. Still I am getting the same error(Both times).

Code :

local/lm/train_lm.sh $LM_CORPUS_ROOT \

data/local/lm/norm_old/tmp data/local/lm/norm_old/norm_texts data/local/lm

# Optional G2P training scripts.

# As the LM training scripts above, this script is intended primarily to

# document our G2P model creation process

#local/g2p/train_g2p.sh data/local/dict/cmudict data/local/lm

# when "--stage 3" option is used below we skip the G2P steps, and use the

# lexicon we have already downloaded from openslr.org/11/

local/prepare_dict.sh --stage 3 --nj 20 --cmd "$train_cmd" \

data/local/lm data/local/lm data/local/dict_nosp

utils/prepare_lang.sh --phone_symbol_table exp/tri4b/phones.txt data/local/dict_nosp \

"<UNK>" data/local/lang_tmp_nosp data/lang_nosp

local/format_lms.sh --src-dir data/lang_nosp data/local/lm

# Create ConstArpaLm format language model for full 3-gram and 4-gram LMs

utils/build_const_arpa_lm.sh data/local/lm/lm_tglarge.arpa.gz \

data/lang_nosp data/lang_nosp_test_tglarge

utils/build_const_arpa_lm.sh data/local/lm/lm_fglarge.arpa.gz \

data/lang_nosp data/lang_nosp_test_fglarge

steps/get_prons.sh --cmd "$train_cmd" \

feature_data/train_clean_100 data/lang_nosp exp/tri4b

utils/dict_dir_add_pronprobs.sh --max-normalize true \

data/local/dict_nosp \

exp/tri4b/pron_counts_nowb.txt exp/tri4b/sil_counts_nowb.txt \

exp/tri4b/pron_bigram_counts_nowb.txt data/local/dict

utils/prepare_lang.sh --phone_symbol_table exp/tri4b/phones.txt data/local/dict \

"<UNK>" data/local/lang_tmp data/lang

local/format_lms.sh --src-dir data/lang data/local/lm

utils/mkgraph.sh \

data/lang_test_tgsmall exp/tri4b exp/tri4b/graph_tgsmall_new

Thanks in advance,

Xiaohui Zhang

unread,

Apr 20, 2018, 6:35:58 PM4/20/18

to kaldi-help

You may have expanded the vocab and overwritten the dict-dir with the original vocab without deleting the old lexicons. Can you try putting everything in a new dict-dir? Let me know if this doesn't work

Xiaohui

sandeep cb

unread,

Apr 21, 2018, 6:08:13 AM4/21/18

to kaldi-help

I have backed up the old dict* and lang* directories , deleted them and then ran the script.

I am pretty sure that the old dict was not there.

Daniel Povey

unread,

Apr 21, 2018, 2:10:02 PM4/21/18

to kaldi-help

lexiconp.txt and lexicon.txt are auto-generated from from the other-- when you write to one of them you need to delete the other one if it is present to avoid a mismatch.

I think the issue is likely in data/local/dict/.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/3a2ac0a7-9cec-4ec4-be94-dd1428de7e4c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Xiaohui Zhang

unread,

Apr 21, 2018, 5:11:08 PM4/21/18

to kaldi-help

I think I figured out your problem. The error message says "Make sure the prons in data/local/dict_nosp/lexicon.txt exp/tri4b/pron_counts_nowb.txt look the same", which means you want to make sure the pronunciation lists from the data/local/dict_nosp/lexicon.txt and the pronunciation stats file (pron_counts_nowb.txt) from tri4b are exactly the same. The reason why they are not the same here (you should check), is that you should have used the same lexicons for building the lang used to generate the alignments in exp/tri4b and estimating pronunciation-probs afterwards. However apparently you have changed the lexicon after generating exp/tri4b. To solve the mis-match, after getting the new dict/lang_nosp, you should re-align exp/tri4b and then run get_prons.sh and dict_dir_add_pronprobs.sh,.etc on top of the new alignments. Let me know if this is not clear.

On Thursday, April 19, 2018 at 6:45:56 AM UTC-4, sandeep cb wrote:

sandeep cb

unread,

Apr 23, 2018, 4:47:21 AM4/23/18

to kaldi-help

Thanks Dan and Xiaohui.

I was able to run the script after changing the original lexicon

which had wrong phones than the older one.

Reply all

Reply to author

Forward