Mono decoding failing for custom corpus

329 views
Skip to first unread message

saurabh vyas

unread,
Sep 22, 2019, 2:55:18 AM9/22/19
to kaldi...@googlegroups.com
I trained my own corpus ( around 5 - 10 hours of hindi data, I used transliteration to convert from hindi letter to english roman version and used online cmu tool to generate lexicon file ) using kaldi, and after carefully following instructions as per documentation, I created file for the "data part" and "lang part", each part was successful, and I had all the files ( including the language model files ), I successfully completed the mono training. However, when decoding with test data, I am getting the following error :

steps/decode.sh --nj 1 --cmd "$decode_cmd" exp/mono/graph data/test exp/mono/inference
steps/decode.sh --nj 1 --cmd utils/run.pl exp/mono/graph data/test exp/mono/inference
decode.sh: feature type is delta
steps/diagnostic/analyze_lats.sh --cmd utils/run.pl exp/mono/graph exp/mono/inference
run.pl: job failed, log is in exp/mono/inference/log/analyze_alignments.log
local/score.sh --cmd utils/run.pl data/test exp/mono/graph exp/mono/inference
local/score.sh: scoring with word insertion penalty=0.0,0.5,1.0

cat exp/mono/inference/log/analyze_alignments.log

# gunzip -c exp/mono/inference/phone_stats.*.gz | steps/diagnostic/analyze_phone_length_stats.py exp/mono/graph
# Started at Sun Sep 22 15:43:51 JST 2019
#
Traceback (most recent call last):
  File "steps/diagnostic/analyze_phone_length_stats.py", line 170, in <module>
    assert num_utterances > 0
AssertionError
# Accounting: time=0 threads=1
# Ended (code 1) at Sun Sep 22 15:43:51 JST 2019, elapsed time 0 seconds

some stats from mono training

gmm-info exp/mono/final.mdl
number of phones 166
number of pdfs 127
number of transition-ids 1116
number of transition-states 518
feature dimension 39
number of gaussians 999

Do you have any idea, what could have gone wrong ?

saurabh vyas

unread,
Sep 22, 2019, 3:56:54 AM9/22/19
to kaldi...@googlegroups.com
I have actually gone through decoded files, and WER for all files is 100 % 

%WER 100.00 [ 58 / 58, 0 ins, 58 del, 0 sub ]
%SER 100.00 [ 4 / 4 ]
Scored 4 sentences, 0 not present in hyp.

I feel something is very wrong

saurabh vyas

unread,
Sep 22, 2019, 4:04:16 AM9/22/19
to kaldi...@googlegroups.com
I think I found the error, searching for 100 percent wer error here in forums, i found this https://mail.google.com/mail/u/0/#search/kaldi+help+100++wer/FMfcgxwChmKfSNSSkWHTrNkQfksFQPlb

Since I used online cmu dict tool for generating lexicon, in my lexicon.txt, each entry is in CAPS, but in my text file ( in data/train ) and G.fst and corpus.txt file ( language model files ), text is in small case

I will try converting either one to other ones case and see if this solves the problem.

Daniel Povey

unread,
Sep 22, 2019, 4:53:38 AM9/22/19
to kaldi-help
OK. The analyze_alignments log may be a bug that was fixed a month or two ago.
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAHqPSsNG9bZKJjo%2BRfy_CcpmcVx9oEjfUBGHMKOXoTDjdvSUng%40mail.gmail.com.

saurabh vyas

unread,
Sep 22, 2019, 5:25:49 AM9/22/19
to kaldi...@googlegroups.com
Thank you for your response Dr. Povey, it seems the CAPS was the issue, after converting lexicon.txt to lowercase, and rerunning "the lang" part and then mono training, I got around 75 % wer, which sounds about right for mono model, with the amount of data I have, I will also update kaldi to latest commit.



Reply all
Reply to author
Forward
0 new messages