Kaldi-help

Patrice Yemmene

unread,

Jun 23, 2017, 1:18:58 AM6/23/17

to kaldi-help, Daniel Povey, Chiang, Chi-Lung

Dear Dan,

Below are the results I am getting. Much better now, but I still need some guidance:

In the test and train folder in kaldi-trunk/egs/digits/data, there is the same data in their respective utt2spk files. However, after I run the script, the data is wiped out of the utt2spk file in the train folder. This is why the "Invalid line in utt2spk file:" error message is being thrown out. I am curious how to fix this.

Also, for the features extraction mono training as well as MAKING G.fst , I am curious what I may have done wrong.

Any help would be appreciated.

Thank you

===== PREPARING ACOUSTIC DATA =====

Invalid line in utt2spk file:

===== FEATURES EXTRACTION =====

steps/make_mfcc.sh --nj 1 --cmd run.pl data/train exp/make_mfcc/train mfcc

utils/validate_data_dir.sh: empty file spk2utt

steps/make_mfcc.sh --nj 1 --cmd run.pl data/test exp/make_mfcc/test mfcc

utils/validate_data_dir.sh: empty file spk2utt

steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc

make_cmvn.sh: no such file data/train/feats.scp

steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc

make_cmvn.sh: no such file data/test/feats.scp

===== PREPARING LANGUAGE DATA =====

utils/prepare_lang.sh data/local/dict <UNK> data/local/lang data/lang

Checking data/local/dict/silence_phones.txt ...

--> reading data/local/dict/silence_phones.txt

--> data/local/dict/silence_phones.txt is OK

Checking data/local/dict/optional_silence.txt ...

--> reading data/local/dict/optional_silence.txt

--> data/local/dict/optional_silence.txt is OK

Checking data/local/dict/nonsilence_phones.txt ...

--> reading data/local/dict/nonsilence_phones.txt

--> ERROR: empty line in data/local/dict/nonsilence_phones.txt (line 18)

--> ERROR: empty line in data/local/dict/nonsilence_phones.txt (line 19)

--> ERROR: empty line in data/local/dict/nonsilence_phones.txt (line 20)

Checking disjoint: silence_phones.txt, nonsilence_phones.txt

--> disjoint property is OK.

Checking data/local/dict/lexicon.txt

--> reading data/local/dict/lexicon.txt

--> ERROR: data/local/dict/lexicon.txt contains Carriage Return (^M) characters.

Checking data/local/dict/extra_questions.txt ...

--> data/local/dict/extra_questions.txt is empty (this is OK)

--> ERROR validating dictionary directory data/local/dict (see detailed error messages above)

*Error validating directory data/local/dict*

===== LANGUAGE MODEL CREATION =====

===== MAKING lm.arpa =====

===== MAKING G.fst =====

./run.sh: line 93: arpa2fst: command not found

===== MONO TRAINING =====

steps/train_mono.sh --nj 1 --cmd run.pl data/train data/lang exp/mono

cat: data/lang/oov.int: No such file or directory

Below is the scrip I am running

#!/bin/bash

. ./path.sh || exit 1

. ./cmd.sh || exit 1

nj=1 # number of parallel jobs - 1 is perfect for such a small data set

lm_order=1 # language model order (n-gram quantity) - 1 is enough for digits grammar

# Safety mechanism (possible running this script with modified arguments)

. utils/parse_options.sh || exit 1

[[ $# -ge 1 ]] && { echo "Wrong arguments!"; exit 1; }

# Removing previously created data (from last run.sh execution)

rm -rf exp mfcc data/train/spk2utt data/train/cmvn.scp data/train/feats.scp data/train/split1 data/test/spk2utt data/test/cmvn.scp data/test/feats.scp data/test/split1 data/local/lang data/lang data/local/tmp data/local/dict/lexiconp.txt

echo

echo "===== PREPARING ACOUSTIC DATA ====="

echo

# Needs to be prepared by hand (or using self written scripts):

#

# spk2gender [<speaker-id> <gender>]

# wav.scp [<uterranceID> <full_path_to_audio_file>]

# text [<uterranceID> <text_transcription>]

# utt2spk [<uterranceID> <speakerID>]

# corpus.txt [<text_transcription>]

# Making spk2utt files

utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt

utils/utt2spk_to_spk2utt.pl data/test/utt2spk > data/test/spk2utt

echo

echo "===== FEATURES EXTRACTION ====="

echo

# Making feats.scp files

mfccdir=mfcc

# Uncomment and modify arguments in scripts below if you have any problems with data sorting

# utils/validate_data_dir.sh data/train # script for checking prepared data - here: for data/train directory

# utils/fix_data_dir.sh data/train # tool for data proper sorting if needed - here: for data/train directory

steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir

steps/make_mfcc.sh --nj $nj --cmd "$train_cmd" data/test exp/make_mfcc/test $mfccdir

# Making cmvn.scp files

steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir

steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test $mfccdir

echo

echo "===== PREPARING LANGUAGE DATA ====="

echo

# Needs to be prepared by hand (or using self written scripts):

#

# lexicon.txt [<word> <phone 1> <phone 2> ...]

# nonsilence_phones.txt [<phone>]

# silence_phones.txt [<phone>]

# optional_silence.txt [<phone>]

# Preparing language data

utils/prepare_lang.sh data/local/dict "<UNK>" data/local/lang data/lang

echo

echo "===== LANGUAGE MODEL CREATION ====="

echo "===== MAKING lm.arpa ====="

echo

loc=`which ngram-count`;

if [ -z $loc ]; then

if uname -a | grep 64 >/dev/null; then

sdir=$KALDI_ROOT/tools/srilm/bin/i686-m64

else

sdir=$KALDI_ROOT/tools/srilm/bin/i686

fi

if [ -f $sdir/ngram-count ]; then

echo "Using SRILM language modelling tool from $sdir"

export PATH=$PATH:$sdir

else

echo "SRILM toolkit is probably not installed.

Instructions: tools/install_srilm.sh"

exit 1

fi

local=data/local

mkdir $local/tmp

ngram-count -order $lm_order -write-vocab $local/tmp/vocab-full.txt -wbdiscount -text $local/corpus.txt -lm $local/tmp/lm.arpa

echo

echo "===== MAKING G.fst ====="

echo

lang=data/lang

arpa2fst --disambig-symbol=#0 --read-symbol-table=$lang/words.txt $local/tmp/lm.arpa $lang/G.fst

echo

echo "===== MONO TRAINING ====="

echo

steps/train_mono.sh --nj $nj --cmd "$train_cmd" data/train data/lang exp/mono || exit 1

echo

echo "===== MONO DECODING ====="

echo

utils/mkgraph.sh --mono data/lang exp/mono exp/mono/graph || exit 1

steps/decode.sh --config conf/decode.config --nj $nj --cmd "$decode_cmd" exp/mono/graph data/test exp/mono/decode

echo

echo "===== MONO ALIGNMENT ====="

echo

steps/align_si.sh --nj $nj --cmd "$train_cmd" data/train data/lang exp/mono exp/mono_ali || exit 1

echo

echo "===== TRI1 (first triphone pass) TRAINING ====="

echo

steps/train_deltas.sh --cmd "$train_cmd" 2000 11000 data/train data/lang exp/mono_ali exp/tri1 || exit 1

echo

echo "===== TRI1 (first triphone pass) DECODING ====="

echo

utils/mkgraph.sh data/lang exp/tri1 exp/tri1/graph || exit 1

steps/decode.sh --config conf/decode.config --nj $nj --cmd "$decode_cmd" exp/tri1/graph data/test exp/tri1/decode

echo

echo "===== run.sh script is finished ====="

echo

Daniel Povey

unread,

Jun 23, 2017, 1:23:13 AM6/23/17

to Patrice Yemmene, kaldi-help, Chiang, Chi-Lung

I haven't heard of this problem before; you'll have to run it step by
step and figure out which command removes utt2spk.
the problem with arpa2fst not found means you need to make sure Kaldi
is installed, and then ensure your path is set up. Because your
directory is one level shallower than normal experimental, path.sh
should refer to ../.. where it would normally refer to ../../..

Patrice Yemmene

unread,

Jun 23, 2017, 7:44:36 AM6/23/17

to Daniel Povey, kaldi-help, Chiang, Chi-Lung

Thank you. I will review what I may have missed

veerender reddy

unread,

Oct 20, 2017, 1:04:28 PM10/20/17

to kaldi-help

Respected Dan,
I was facing difficulty in mfcc extraction, following error occured when i tried for an4 dataset:

"steps/make_mfcc.sh --nj 1 --cmd data/test exp/make_mfcc/test mfcc
steps/make_mfcc.sh: empty argument to --cmd option

steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc
make_cmvn.sh: no such file data/test/feats.scp

fix_data_dir.sh: kept all 130 utterances.
fix_data_dir.sh: old files are kept in data/test/.backup
steps/make_mfcc.sh --nj 1 --cmd data/train exp/make_mfcc/train mfcc
steps/make_mfcc.sh: empty argument to --cmd option

steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc
make_cmvn.sh: no such file data/train/feats.scp

fix_data_dir.sh: kept all 948 utterances.
fix_data_dir.sh: old files are kept in data/train/.backup"

I am new to kaldi so unable to sort out, please help me out in this issue....

Daniel Povey

unread,

Oct 20, 2017, 1:09:45 PM10/20/17

to kaldi-help

Likely you did not set the "train_cmd" shell variable, should be something like
train_cmd=run.pl
train_cmd=queue.pl

check your cmd.sh.
That recipe does in run.sh:
. cmd.sh
instead of (what I recommend)
. ./cmd.sh
so if you have another file called cmd.sh on your path somewhere, it
wouldn't be invoking the current directory's cmd.sh.
Change the run.sh to say:
. ./cmd.sh
that might help.

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/eaa5f33e-d194-4188-a78d-32610c91582b%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

veerender reddy

unread,

Oct 20, 2017, 1:21:50 PM10/20/17

to kaldi-help

Thanks...It worked out sir...I missed to execute . ./cmd.sh when i invoked terminal...

veerender reddy

unread,

Nov 16, 2017, 8:25:17 PM11/16/17

to kaldi-help

Respected Dan,
I had loaded dataset of timit in my path "/home/veerender/speech-datasets/TIMIT" but, when i executed line "timit= /home/veerender/speech-datasets/TIMIT" after setting acoustic model parameters, i am getting following error.

"bash: /home/veerender/speech-datasets/TIMIT: Is a directory"

when i worked with an4 dataset i also placed dataset in similar path, it worked out but, in this case it is pointing as directory and loading variable to point to its location...may i know how to handle this...

Daniel Povey

unread,

Nov 16, 2017, 8:28:02 PM11/16/17

to kaldi-help

remove the space after '='

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/73bee67d-236c-46ec-93f7-3825d047d018%40googlegroups.com.

veerender reddy

unread,

Nov 17, 2017, 1:22:28 PM11/17/17

to kaldi-help

Respected Dan,
I tried to run "local/timit_data_prep.sh $timit" and i encountered following error:

" wav-to-duration --read-entire-file=true scp:train_wav.scp ark,t:train_dur.ark

LOG (wav-to-duration[5.2.97~1-5846a2d]:main():wav-to-duration.cc:92) Printed duration for 3696 audio files.
LOG (wav-to-duration[5.2.97~1-5846a2d]:main():wav-to-duration.cc:94) Mean duration was 3.06336, min and max durations were 0.91525, 7.78881
awk: line 12: function gensub never defined "

Is it related to train and test files unloading properly? I wish to know what these errors exactly mean, and how to rectify this......

Daniel Povey

unread,

Nov 17, 2017, 1:23:25 PM11/17/17

to kaldi-help, Karel Vesely

That is an awk version issue. In general I prefer to use perl for things like this, as it has fewer compatibility problems.

Karel, do you have time to fix this?
Dan

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6bc36b5c-4ed2-473a-8054-3687760cbe5e%40googlegroups.com.

Reply all

Reply to author

Forward