Doubt in the WSJ recipe.

547 views
Skip to first unread message

migue...@gmail.com

unread,
Sep 4, 2015, 6:22:55 AM9/4/15
to kaldi-help
Hello,

The WSJ recipe is using a big dictionary.
It is applied in the following code:

# Trying the larger dictionary ("big-dict"/bd) + locally produced LM.
utils/mkgraph.sh data/lang_nosp_test_bd_tgpr \
  exp/tri3b exp/tri3b/graph_nosp_bd_tgpr || exit 1;

I don't understand as it is created the folder data/lang_nosp_test_bd_tgpr.

I have a WSJ distributions with subdirectories 'doc', 'si_et_05', etc. directly under the wsj0 or wsj1 directories.

Thanks in advance

Daniel Povey

unread,
Sep 4, 2015, 12:53:08 PM9/4/15
to kaldi-help
I don't see a question in there.  Perhaps you wonder where the big dictionary comes from?  It's derived from the original dictionary by automatically creating rules for things like plurals and other suffixes/prefixes.
Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Miguel Matos

unread,
Sep 4, 2015, 2:16:03 PM9/4/15
to kaldi...@googlegroups.com
Sorry my poor English.
Yes.

How do you get
the big dictionary?
or
How can I generate this big dictionary?

I was
looking in my database that is the kind I have mentioned and I don't find the big dictionary.

Thanks

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/b9os_OyHLvc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Sep 4, 2015, 2:23:01 PM9/4/15
to kaldi-help
It is generated by the scripts, it's not something that you have to download.
As long as the data preparation scripts succeeded there should not be a problem.  If you get an error running a specific stage of the run.sh, let us know what stage failed and what the error message was.
Dan

Miguel Matos

unread,
Sep 7, 2015, 3:25:40 PM9/7/15
to kaldi...@googlegroups.com
Thank you for your help.

I ran again today (monday) the latest version of code.
My WSJ database is in the following format:

'doc', 'si_et_05', etc.

So just I ran the following commands successfully:

local/cstr_wsj_data_prep.sh $corpus

if [ -f data/local/dict/lexiconp.txt ];then
    rm data/local/dict/lexiconp.txt
fi

local/wsj_prepare_dict.sh --dict-suffix "_nosp" || exit 1;

utils/prepare_lang.sh data/local/dict_nosp \
  "<SPOKEN_NOISE>" data/local/lang_tmp_nosp data/lang_nosp || exit 1;

local/wsj_format_data.sh --lang-suffix "_nosp" || exit 1;


In the end I didn't get the folder ./data/lang_nosp_test_bd_tgpr/
I obtained the following folders:
wsj/s5/data> ls
dev_dt_05  lang_nosp   lang_nosp_test_bg_5k  lang_nosp_test_tg_5k  lang_nosp_test_tgpr_5k  test_dev93     test_eval92     test_eval93     train_si284
dev_dt_20  lang_nosp_test_bg  lang_nosp_test_tg  lang_nosp_test_tgpr   local  test_dev93_5k  test_eval92_5k  test_eval93_5k


I think there must be some error in the recipe or else some part that I'm not running.

Thanks,
Miguel

Daniel Povey

unread,
Sep 7, 2015, 3:27:38 PM9/7/15
to kaldi-help
You didn't get to the place that creates that yet- see below, where it says
 (
   local/wsj_extend_dict.sh --dict-suffix "_nosp" $wsj1/13-32.1  && \
   utils/prepare_lang.sh data/local/dict_nosp_larger \
     "<SPOKEN_NOISE>" data/local/lang_tmp_nosp_larger data/lang_nosp_bd && \
   local/wsj_train_lms.sh --dict-suffix "_nosp" &&
   local/wsj_format_local_lms.sh --lang-suffix "_nosp" # &&                                                                                                              
  ....

That is where it happens.

Dan

Miguel Matos

unread,
Sep 9, 2015, 8:00:03 PM9/9/15
to kaldi...@googlegroups.com
Thanks for your help.

I ran the suggested code, I found a bug that I fixed.

Later in the code appeared another bug that still can’t solve.

Again I remember I have a wsj database in the format:

'doc', 'si_et_05', etc.

When running
local/wsj_extend_dict.sh --dict-suffix "_nosp" $wsj1/13-32.1

The following errors occur:
gzip: /13-32.1/wsj1/doc/lng_modl/lm_train/np_data/87/*.z: No such file or directory
gzip: /13-32.1/wsj1/doc/lng_modl/lm_train/np_data/88/*.z: No such file or directory
gzip: /13-32.1/wsj1/doc/lng_modl/lm_train/np_data/89/*.z: No such file or directory

These files are in the folder:
wsj1/doc/lng_modl/lm_train/np_data/8?/*.z

However, when I correct the error appears another:
local/wsj_extend_dict.sh --dict-suffix "_nosp" $wsj1

The error message is:
Expecting the argument to this script to end in 13-32.1

It's easy to work around this bug, this should be resolved in official receipt.
Just change the check conditions inside as the script local/wsj_extend_dict.sh

The entire script ran without error until it reach the next line in run.sh:
utils/mkgraph.sh data/lang_nosp_test_bd_tgpr \
  exp/tri3b exp/tri3b/graph_nosp_bd_tgpr


The resulting error was as follows:

fsttablecompose data/lang_nosp_test_bd_tgpr/L_disambig.fst data/lang_nosp_test_bd_tgpr/G.fst
fstminimizeencoded
fstdeterminizestar --use-log=true
fstisstochastic data/lang_nosp_test_bd_tgpr/tmp/LG.fst
0.000488639 -1.4022
[info]: LG not stochastic.
fstcomposecontext --context-size=3 --central-position=1 --read-disambig-syms=data/lang_nosp_test_bd_tgpr/phones/disambig.int --write-disambig-syms=data/lang_nosp_test_bd_tgpr/tmp/disambig_ilabels_3_1.int data/lang_nosp_test_bd_tgpr/tmp/ilabels_3_1
fstisstochastic data/lang_nosp_test_bd_tgpr/tmp/CLG_3_1.fst
0.000488639 -1.4022
[info]: CLG not stochastic.
make-h-transducer --disambig-syms-out=exp/tri3b/graph_nosp_bd_tgpr/disambig_tid.int --transition-scale=1.0 data/lang_nosp_test_bd_tgpr/tmp/ilabels_3_1 exp/tri3b/tree exp/tri3b/final.mdl
ERROR (make-h-transducer:TopologyForPhone():hmm-topology.cc:279) TopologyForPhone(), phone 88 not covered.
ERROR (make-h-transducer:TopologyForPhone():hmm-topology.cc:279) TopologyForPhone(), phone 88 not covered.

[stack trace: ]
kaldi::KaldiGetStackTrace()
kaldi::KaldiErrorMessage::~KaldiErrorMessage()
kaldi::HmmTopology::TopologyForPhone(int) const
kaldi::GetHmmAsFst(std::vector<int, std::allocator<int> >, kaldi::ContextDependencyInterface const&, kaldi::TransitionModel const&, kaldi::HTransducerConfig const&, std::tr1::unordered_map<std::pair<int, std::vector<int, std::allocator<int> > >, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >*, kaldi::HmmCacheHash, std::equal_to<std::pair<int, std::vector<int, std::allocator<int> > > >, std::allocator<std::pair<std::pair<int, std::vector<int, std::allocator<int> > > const, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >*> > >*)
kaldi::GetHTransducer(std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, kaldi::ContextDependencyInterface const&, kaldi::TransitionModel const&, kaldi::HTransducerConfig const&, std::vector<int, std::allocator<int> >*)
make-h-transducer(main+0x383) [0x59bb60]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f3e441c58c5]
make-h-transducer(_start+0x29) [0x59b719]

I don't know how to solve this error.
Have any suggestions?


Thanks,
Miguel


Daniel Povey

unread,
Sep 9, 2015, 8:08:50 PM9/9/15
to kaldi-help
These are not bugs in the script.  The issue is that your data is not in the standard format that LDC releases WSJ with - either it is an older format, or most likely someone has helpfully tried to reorganize the data into a better format.  The script should be called with arguments something like the following (ignore the backslashes):

local/wsj_data_prep.sh /export/corpora5/LDC/LDC93S6B/11-1.1 /export/corpora5/LDC/LDC93S6B/11-2.1 /export/corpora5/LDC/LDC93S6B/11-3.1 /export/corpora5/LDC/LDC93S6B/11-4\
.1 /export/corpora5/LDC/LDC93S6B/11-5.1 /export/corpora5/LDC/LDC93S6B/11-6.1 /export/corpora5/LDC/LDC93S6B/11-13.1 /export/corpora5/LDC/LDC93S6B/11-14.1 /export/corpora\
5/LDC/LDC93S6B/11-15.1 /export/corpora5/LDC/LDC94S13B/13-1.1 /export/corpora5/LDC/LDC94S13B/13-2.1 /export/corpora5/LDC/LDC94S13B/13-3.1 /export/corpora5/LDC/LDC94S13B/\
13-4.1 /export/corpora5/LDC/LDC94S13B/13-5.1 /export/corpora5/LDC/LDC94S13B/13-6.1 /export/corpora5/LDC/LDC94S13B/13-7.1 /export/corpora5/LDC/LDC94S13B/13-11.1 /export/\
corpora5/LDC/LDC94S13B/13-13.1 /export/corpora5/LDC/LDC94S13B/13-14.1 /export/corpora5/LDC/LDC94S13B/13-15.1 /export/corpora5/LDC/LDC94S13B/13-16.1 /export/corpora5/LDC\
/LDC94S13B/13-18.1 /export/corpora5/LDC/LDC94S13B/13-19.1 /export/corpora5/LDC/LDC94S13B/13-20.1 /export/corpora5/LDC/LDC94S13B/13-21.1 /export/corpora5/LDC/LDC94S13B/1\
3-32.1 /export/corpora5/LDC/LDC94S13B/13-33.1 /export/corpora5/LDC/LDC94S13B/13-34.1

See if you can dig up the original disks.  Otherwise you could try to figure out how to change the script, but i doubt you would be able to do that.
Dan



Daniel Povey

unread,
Sep 9, 2015, 8:20:57 PM9/9/15
to kaldi-help
Oh, I just noticed that there is an alternative data-preparation script for your (older) WSJ distribution format, and that you are using it.
The issue seems to be that the wsj_extend_dict.sh script is not able to deal with the older-format WSJ distribution.
I don't think it is worth anyone's time to fix this, as few people have the older-format WSJ corpus.  I would advise you to just skip the parts of the scripts that require the big dictionary.

Dan



Miguel Matos

unread,
Sep 10, 2015, 8:52:08 AM9/10/15
to kaldi...@googlegroups.com
Yes, I'm using the script local/cstr_wsj_data_prep.sh that deals with old format of the WSJ.

I made all recipe without the great dictionary and I got the following results:

I created a link lang_nosp_test_bg folder to the folder lang_nosp_test_bd_tgpr.
There is another folder lang to be better than lang_nosp_test_bg?

# sMBR training (1+4 iterations, lattices+alignment updated after 1st iteration)
%WER 9.56 exp/dnn5b_pretrain-dbn_dnn_smbr_i1lats/decode_nosp_bd_tgpr_dev93_iter4/wer_13_0.5
%WER 6.57 exp/dnn5b_pretrain-dbn_dnn_smbr_i1lats/decode_nosp_bd_tgpr_eval92_iter4/wer_12_1.0

instead of the following values that are in RESULTS file:
# sMBR training (1+4 iterations, lattices+alignment updated after 1st iteration)
%WER 6.15 exp/dnn5b_pretrain-dbn_dnn_smbr_i1lats/decode_bd_tgpr_dev93_iter4/wer_11
%WER 3.56 exp/dnn5b_pretrain-dbn_dnn_smbr_i1lats/decode_bd_tgpr_eval92_iter4/wer_13


I analysed the script local/wsj_extend_dict.sh
This script just need the folder $srcdir/wsj1/doc/lng_modl/lm_train/np_data/{87,88,89}/*.z ,
which also is located in the old format of the WSJ with the next files:

np_data/87/w7_001.z
...
np_data/87/w7_126.z

np_data/88/w8_001.z
...
np_data/88/w8_107.z

np_data/89/w9_01.z
...
np_data/89/w9_41.z


I ran the scripts:
 
>local/wsj_extend_dict.sh --dict-suffix "_nosp" $corpus (Output in output_commands.txt)


 >utils/prepare_lang.sh data/local/dict_nosp_larger "<SPOKEN_NOISE>" data/local/lang_tmp_nosp_larger data/lang_nosp_bd
  (Output in output_commands.txt)

 >local/wsj_train_lms.sh --dict-suffix "_nosp" (Output in wsj_trains_lms.txt)

 >local/wsj_format_local_lms.sh --lang-suffix "_nosp" (Output in output_commands.txt)

and the output seems ok to me.

I put the output of each script in the attached files.
If you have time could check the files and tell me what you think?

I talked to my colleagues to see if we can the WSJ in the new format but I have no guarantees.

Thanks for all the help,
Miguel


output_commands.txt
wsj_train_lms.txt

Daniel Povey

unread,
Sep 10, 2015, 4:00:35 PM9/10/15
to kaldi-help
Your output seems the same as what we got when we ran it locally.

On Thu, Sep 10, 2015 at 8:52 AM, Miguel Matos <migue...@gmail.com> wrote:
Yes, I'm using the script local/cstr_wsj_data_prep.sh that deals with old format of the WSJ.

I made all recipe without the great dictionary and I got the following results:

I created a link lang_nosp_test_bg folder to the folder lang_nosp_test_bd_tgpr.

bg means bigram.  The bigram LM would be substantially worse than 'tgpr', which is a pruned trigram.
Various things can go wrong with sMBR training, so without knowing whether the results with the regular dictionary were in the right range, it's hard to know what went wrong.  
I don't have time to analyze this in depth and figure out what the issue was.
Dan

hariv...@gmail.com

unread,
Apr 9, 2016, 4:56:33 AM4/9/16
to kaldi-help, migue...@gmail.com
hiii Miguel i am also using the same data set that you have mentioned i am getting the same error (phone 88 not covered ) can yo tell the procedure to resolve it

Daniel Povey

unread,
Apr 9, 2016, 1:49:47 PM4/9/16
to kaldi-help, migue...@gmail.com
I don't think there was any progress since my last email-- those "big-dict" scripts don't work with these "old-style" WSJ distributions.  Someone would have to fix the script, and we don't have time.  Just comment out the parts of run.sh that use the "big dictioary" (i.e. have "bd" in the command line).
Dan

Daniel Povey

unread,
Apr 13, 2016, 2:38:01 PM4/13/16
to kaldi-help, Miguel Matos

Update- I see in the script that there *is* a script to prepare the big-dict from the older format of WSJ:

 # NOTE: If you have a setup corresponding to the older cstr_wsj_data_prep.sh style,                                                                                            

 # use local/cstr_wsj_extend_dict.sh --dict-suffix "_nosp" $corpus/wsj1/doc/ instead. 



Miguel Matos

unread,
Apr 13, 2016, 3:07:18 PM4/13/16
to dpo...@gmail.com, hariv...@gmail.com, kaldi-help
Hello,

I solved the problem as follows:

In the run.sh:

local/cstr_wsj_extend_dict.sh --dict-suffix "_nosp" $corpus/wsj1/doc/

local/wsj_extend_dict.sh --dict-suffix "_nosp" $corpus

utils/prepare_lang.sh --position-dependent-phones ${position_dependent_phones} data/local/dict_nosp_larger \
   "<SPOKEN_NOISE>" data/local/lang_tmp_nosp_larger data/lang_nosp_bd

local/wsj_train_lms.sh --dict-suffix "_nosp"
local/wsj_format_local_lms.sh --lang-suffix "_nosp"



In the script local/wsj_extend_dict.sh comment/delete this lines:

#if [ "`basename $1`" != 13-32.1 ]; then
#  echo "Expecting the argument to this script to end in 13-32.1"
#  exit 1
#fi


If you have problems, delete all the folders "data" and "exp" and make all process again with the changes in the code.

Best Regards,
Miguel


Daniel Povey

unread,
Apr 13, 2016, 3:12:44 PM4/13/16
to Miguel Matos, Harikrishna Vydana, kaldi-help
local/cstr_wsj_extend_dict.sh is supposed to be a replacement for local/wsj_extend_dict.sh for when you have the old-style data, so running the latter is not needed.

Miguel Matos

unread,
Apr 13, 2016, 3:17:18 PM4/13/16
to Daniel Povey, Harikrishna Vydana, kaldi-help

Thanks for the information

Best regards,
Miguel
Reply all
Reply to author
Forward
0 new messages