Error detected while creating train/valid egs

Rachel Beeson

unread,

Jun 28, 2022, 8:20:58 AM6/28/22

to kaldi-help

Context: I am doing a project on speech recognition from articular (lip and tongue) positions rather than some kind of acoustic feature. I am using vectors which correspond to the x,y position of various articulators which have been extracted using DeepLabCut, so I'm not using Kaldi to create mfccs or anything. I am also extremely new to Kaldi. I have cobbled together a run.sh script from the Kaldi tutorial and mini-librispeech (attached in case it's useful). I run monophone and triphone alignment, and then nnet2 DNN.

Problem: When I run my script on one lexicon, I get to the end of training and scoring no problem. However, when I switch this lexicon out for another lexicon (American vs British pronunciation), I get the following terminal output (log files and lexicons attached):

steps/nnet2/train_tanh_fast.sh --stage -10 --num-threads 16 --parallel-opts --num-threads 16 --minibatch-size 128 --num-jobs-nnet 8 --samples-per-iter 400000 --mix-up 8000 --initial-learning-rate 0.01 --final-learning-rate 0.001 --num-hidden-layers 4 --hidden-layer-dim 1024 --cmd run.pl data/train data/lang exp/tri3b_ali exp/nnet5c
steps/nnet2/train_tanh_fast.sh: calling get_lda.sh
steps/nnet2/get_lda.sh --transform-dir exp/tri3b_ali --splice-width 4 --cmd run.pl data/train data/lang exp/tri3b_ali exp/nnet5c
steps/nnet2/get_lda.sh: feature type is lda
steps/nnet2/get_lda.sh: using transforms from exp/tri3b_ali
feat-to-dim 'ark,s,cs:utils/subset_scp.pl --quiet 2500 data/train/split4/1/feats.scp | apply-cmvn --utt2spk=ark:data/train/split4/1/utt2spk scp:data/train/split4/1/cmvn.scp scp:- ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/nnet5c/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split4/1/utt2spk ark:exp/tri3b_ali/trans.1 ark:- ark:- |' -
transform-feats exp/nnet5c/final.mat ark:- ark:-
apply-cmvn --utt2spk=ark:data/train/split4/1/utt2spk scp:data/train/split4/1/cmvn.scp scp:- ark:-
splice-feats --left-context=3 --right-context=3 ark:- ark:-
transform-feats --utt2spk=ark:data/train/split4/1/utt2spk ark:exp/tri3b_ali/trans.1 ark:- ark:-
WARNING (feat-to-dim[5.5.1032~1-ac29a]:Close():kaldi-io.cc:515) Pipe utils/subset_scp.pl --quiet 2500 data/train/split4/1/feats.scp | apply-cmvn --utt2spk=ark:data/train/split4/1/utt2spk scp:data/train/split4/1/cmvn.scp scp:- ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/nnet5c/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split4/1/utt2spk ark:exp/tri3b_ali/trans.1 ark:- ark:- | had nonzero return status 36096
feat-to-dim 'ark,s,cs:utils/subset_scp.pl --quiet 2500 data/train/split4/1/feats.scp | apply-cmvn --utt2spk=ark:data/train/split4/1/utt2spk scp:data/train/split4/1/cmvn.scp scp:- ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/nnet5c/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split4/1/utt2spk ark:exp/tri3b_ali/trans.1 ark:- ark:- | splice-feats --left-context=4 --right-context=4 ark:- ark:- |' -
transform-feats exp/nnet5c/final.mat ark:- ark:-
splice-feats --left-context=3 --right-context=3 ark:- ark:-
transform-feats --utt2spk=ark:data/train/split4/1/utt2spk ark:exp/tri3b_ali/trans.1 ark:- ark:-
apply-cmvn --utt2spk=ark:data/train/split4/1/utt2spk scp:data/train/split4/1/cmvn.scp scp:- ark:-
splice-feats --left-context=4 --right-context=4 ark:- ark:-
WARNING (feat-to-dim[5.5.1032~1-ac29a]:Close():kaldi-io.cc:515) Pipe utils/subset_scp.pl --quiet 2500 data/train/split4/1/feats.scp | apply-cmvn --utt2spk=ark:data/train/split4/1/utt2spk scp:data/train/split4/1/cmvn.scp scp:- ark:- | splice-feats --left-context=3 --right-context=3 ark:- ark:- | transform-feats exp/nnet5c/final.mat ark:- ark:- | transform-feats --utt2spk=ark:data/train/split4/1/utt2spk ark:exp/tri3b_ali/trans.1 ark:- ark:- | splice-feats --left-context=4 --right-context=4 ark:- ark:- | had nonzero return status 36096
steps/nnet2/get_lda.sh: Accumulating LDA statistics.
steps/nnet2/get_lda.sh: Finished estimating LDA
steps/nnet2/train_tanh_fast.sh: calling get_egs.sh
steps/nnet2/get_egs.sh --transform-dir exp/tri3b_ali --splice-width 4 --samples-per-iter 400000 --num-jobs-nnet 8 --stage 0 --cmd run.pl --io-opts --max-jobs-run 5 data/train data/lang exp/tri3b_ali exp/nnet5c
steps/nnet2/get_egs.sh: feature type is lda
steps/nnet2/get_egs.sh: using transforms from exp/tri3b_ali
steps/nnet2/get_egs.sh: working out number of frames of training data
steps/nnet2/get_egs.sh: Every epoch, splitting the data up into 1 iterations,
steps/nnet2/get_egs.sh: giving samples-per-iteration of 13202 (you requested 400000).
Getting validation and training subset examples.
steps/nnet2/get_egs.sh: extracting validation and training-subset alignments.
copy-int-vector ark:- ark,t:-
LOG (copy-int-vector[5.5.1032~1-ac29a]:main():copy-int-vector.cc:83) Copied 50 vectors of int32.
run.pl: job failed, log is in exp/nnet5c/log/create_valid_subset.log
run.pl: job failed, log is in exp/nnet5c/log/create_train_subset.log
Error detected while creating train/valid egs

I have tried validating my data/train directory as well as all the split directories and they are all fine. I'm confused because I've seen others with the "No posterior for key..." error but usually it is all of them failing because there's an issue with sorting. I am not sure what's happening when I change my lexicon to cause this.

lexicon.txt

create_valid_subset.log

create_train_subset.log

run.sh

Rachel Beeson

unread,

Jun 28, 2022, 12:47:24 PM6/28/22

to kaldi-help

Hi all, believe I found the issue so this can be closed. The issue is that I am using a small data set in order to "try out" my run.sh. The lexicon that doesn't complete has about 4 more phones than the first one, and I believe that something is going wrong when trying to build triphone GMMs with these additional phones on such a small set of data. When I set prepare_lang so that it builds position independent phone models, I can run the model to completion on the other lexicon. If that seems like a reasonable explanation then this thread can probably be closed/ignored.

Sage Khan

unread,

Jun 28, 2022, 12:53:46 PM6/28/22

to kaldi-help

Just curious... Does your wav.scp in each split (train test and further split) has the correct paths?

Sage Khan

unread,

Jun 28, 2022, 12:54:32 PM6/28/22

to kaldi-help

Also, no. of jobs < no. of spkrs...

Rachel Beeson

unread,

Jun 29, 2022, 12:10:39 PM6/29/22

to kaldi-help

I am actually now more confused than ever and am considering if there is something inherently wrong with my features. With the help of a friend we confirmed the issue was not a mismatch between the alignment and training splits. I inspected more of the log files and we found that there were lots of decoding failures in align.1.log. I thought maybe this was due to an issue with poor transcript normalization (sentences ending with periods, commas interspersed) so I went ahead and normalized everything to reduce OOV mistakes. Well, after doing this, almost nothing gets decoded. The align.1.log is chock full of the following:

WARNING (gmm-boost-silence[5.5.1032~1-ac29a]:main():gmm-boost-silence.cc:82) The pdfs for the silence phones may be shared by other phones (note: this probably does not matter.)
LOG (gmm-boost-silence[5.5.1032~1-ac29a]:main():gmm-boost-silence.cc:93) Boosted weights for 5 pdfs, by factor of 1
LOG (gmm-boost-silence[5.5.1032~1-ac29a]:main():gmm-boost-silence.cc:103) Wrote model to -
add-deltas ark:- ark:-
apply-cmvn --utt2spk=ark:data/train/split1/1/utt2spk scp:data/train/split1/1/cmvn.scp scp:data/train/split1/1/feats.scp ark:-
LOG (gmm-align-compiled[5.5.1032~1-ac29a]:main():gmm-align-compiled.cc:127) 01fi-002_cal
WARNING (gmm-align-compiled[5.5.1032~1-ac29a]:AlignUtteranceWrapper():decoder-wrappers.cc:617) Retrying utterance 01fi-002_cal with beam 100
WARNING (gmm-align-compiled[5.5.1032~1-ac29a]:AlignUtteranceWrapper():decoder-wrappers.cc:626) Did not successfully decode file 01fi-002_cal, len = 313
LOG (gmm-align-compiled[5.5.1032~1-ac29a]:main():gmm-align-compiled.cc:127) 01fi-091_aud
WARNING (gmm-align-compiled[5.5.1032~1-ac29a]:AlignUtteranceWrapper():decoder-wrappers.cc:617) Retrying utterance 01fi-091_aud with beam 100
WARNING (gmm-align-compiled[5.5.1032~1-ac29a]:AlignUtteranceWrapper():decoder-wrappers.cc:626) Did not successfully decode file 01fi-091_aud, len = 226

In my analyze_alignments.log, there are only about ~60 phones mentioned, which is a far cry from the 192 phones in phones.txt in exp/mono. This makes me think there are several phones which are not aligning with any data, if that is possible.

I ran my pipeline with regular audio/MFCC features, and it ran fine with few decoding failures. This makes me think that there is something wrong with the features I'm using, though I'm not really sure what to do about that. My feats.scp is much larger for the MFCC features than my positional ones, which makes me think that there is a problem with data sampling- my positional features are extracted at a rate of 60 FPS, with no sliding window. Or maybe somewhere down the pipeline there is an assumption about how much "time" elapses with each frame (because MFCCs have a typical sampling/windowing pattern, around 100 FPS). I don't really know how to go about confirming any of these hypotheses. I could try "upsampling" my features and seeing what happens. I also wonder if my positional features need to be "gaussianized" somehow, or made in someway to "act" more like MFCCs.

Rachel Beeson

unread,

Jul 13, 2022, 6:25:34 PM7/13/22

to kaldi-help

I was able to get this working on my small data set by standardizing (zero mean and unit variance) before passing them into Kaldi. Doing PCA on my features to reduce the size of the vector also helped.

I've now scaled up my data sets to about 15000 utterances, split about 10% and 90% on training and testing. I have therefore created new data directory files for my test and training sets, and new feats.scp and feats.ark. I have not changed anything else about my set up since running through my mini-demo.

New issue: Data directory issues. If I run validate_data_dir I get the following:

utils/validate_data_dir.sh: file data/train/feats.scp is not sorted or has duplicates
utils/validate_data_dir.sh: file data/test/feats.scp is not sorted or has duplicates

If I then run fix_data_dir:

utils/validate_data_dir.sh: file data/train/feats.scp is not sorted or has duplicates
utils/fix_data_dir.sh: file data/train/feats.scp is not in sorted order or not unique, sorting it
utils/fix_data_dir.sh: file data/train/spk2gender is not in sorted order or not unique, sorting it
fix_data_dir.sh: kept 133 utterances out of 13024
utils/fix_data_dir.sh: filtered data/train/spk2gender from 81 to 71 lines based on filter /tmp/kaldi.Ohr0/speakers.
fix_data_dir.sh: old files are kept in data/train/.backup

If I don't run fix_data_dir, then it fails after the first pass at monophone alignment, and the mono align log shows the following kind of error:

WARNING (gmm-align-compiled[5.5.1035~1-3dd90]:main():gmm-align-compiled.cc:103) No features for utterance 01fi-021_xaud
WARNING (gmm-align-compiled[5.5.1035~1-3dd90]:main():gmm-align-compiled.cc:103) No features for utterance 01fi-022_xaud
WARNING (gmm-align-compiled[5.5.1035~1-3dd90]:main():gmm-align-compiled.cc:103) No features for utterance 01fi-023_xaud
WARNING (gmm-align-compiled[5.5.1035~1-3dd90]:main():gmm-align-compiled.cc:103) No features for utterance 01fi-024_xaud
WARNING (gmm-align-compiled[5.5.1035~1-3dd90]:main():gmm-align-compiled.cc:103) No features for utterance 01fi-025_xaud
WARNING (gmm-align-compiled[5.5.1035~1-3dd90]:main():gmm-align-compiled.cc:103) No features for utterance 01fi-026_xaud

In my data/train I have the following files when I start my run.sh:

feats.scp trainfeats.ark utt2spk text wav.scp spk2gender

I have attached my utt2spk and feats.scp from data/train as a sanity check... I think it is formatted correctly but at this point I don't trust myself. I'm hoping someone can point out to me something I'm very obviously doing wrong.

I have export LC_ALL=C in my path.sh.

feats.scp

utt2spk

Rachel Beeson

unread,

Jul 13, 2022, 6:36:32 PM7/13/22

to kaldi-help

spk2gender file as well:

spk2gender

Sage Khan

unread,

Jul 14, 2022, 12:48:10 AM7/14/22

to kaldi-help

I faced a similar issue. It was due to commas in one of the files. I think text or wav.scp. Also, there may be something wrong with the naming. Like if file name is U0010 I ended up writing U010 or something like that. Also happens when we miss out file extensions in the path in wav.scp....

Rachel Beeson

unread,

Jul 14, 2022, 7:44:40 AM7/14/22

to kaldi-help

I double-checked to be sure and there are no commas in my text or wav.scp files. Wav.scp has file extensions. Also the naming is done automatically with a script (I'm not writing these files by hand), the same one I used for my demo project which did work, and the naming conventions for utterances and speakers seems consistent across the files

Rachel Beeson

unread,

Jul 14, 2022, 10:46:28 AM7/14/22

to kaldi-help

Fixed, I had the old cmvn.scp file hanging out in data/train, and the fix_data check was coming before the new cmvn files were made. Deleted the old cmvn, data check stopped complaining that there was a discrepancy between cmvn.scp and feats.scp.

Sage Khan

unread,

Jul 14, 2022, 11:41:38 PM7/14/22

to Daniel Povey Kaldi ASR

Oh yeah. Whenever you rerun the scripts, make sure to get rid of old derived files like cmvn, feats.scp etc.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/-PfddF4jaWU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/64535d82-8a27-4dc4-a0d2-589852bfcd9an%40googlegroups.com.

Reply all

Reply to author

Forward