I trained my own corpus ( around 5 - 10 hours of hindi data, I used transliteration to convert from hindi letter to english roman version and used online cmu tool to generate lexicon file ) using kaldi, and after carefully following instructions as per documentation, I created file for the "data part" and "lang part", each part was successful, and I had all the files ( including the language model files ), I successfully completed the mono training. However, when decoding with test data, I am getting the following error :
steps/decode.sh --nj 1 --cmd "$decode_cmd" exp/mono/graph data/test exp/mono/inference
steps/decode.sh --nj 1 --cmd utils/
run.pl exp/mono/graph data/test exp/mono/inference
decode.sh: feature type is delta
steps/diagnostic/analyze_lats.sh --cmd utils/
run.pl exp/mono/graph exp/mono/inference
run.pl: job failed, log is in exp/mono/inference/log/analyze_alignments.log
local/score.sh --cmd utils/
run.pl data/test exp/mono/graph exp/mono/inference
local/score.sh: scoring with word insertion penalty=0.0,0.5,1.0
cat exp/mono/inference/log/analyze_alignments.log
# gunzip -c exp/mono/inference/phone_stats.*.gz | steps/diagnostic/analyze_phone_length_stats.py exp/mono/graph
# Started at Sun Sep 22 15:43:51 JST 2019
#
Traceback (most recent call last):
File "steps/diagnostic/analyze_phone_length_stats.py", line 170, in <module>
assert num_utterances > 0
AssertionError
# Accounting: time=0 threads=1
# Ended (code 1) at Sun Sep 22 15:43:51 JST 2019, elapsed time 0 seconds
some stats from mono training
gmm-info exp/mono/final.mdl
number of phones 166
number of pdfs 127
number of transition-ids 1116
number of transition-states 518
feature dimension 39
number of gaussians 999
Do you have any idea, what could have gone wrong ?