can not create feats.scp

4,269 views
Skip to first unread message

tbz

unread,
Jan 11, 2017, 4:14:35 PM1/11/17
to kaldi-help
 Hi ,
 I am trying gale_arabic example. I am testing step by step.
local/gale_data_prep_audio.sh   "${audio[@]}" $galeData
....
mfccdir=mfcc
for x in train test ; do
  steps/make_mfcc.sh --cmd "$train_cmd" --nj $nJobs \
    data/$x exp/make_mfcc/$x $mfccdir
  utils/fix_data_dir.sh data/$x # some files fail to get mfcc for many reasons
  steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x $mfccdir
done

after running these steps, I still don't get the feats.scp. I am thinking that wav.scp is not created correctly. this is one line in wav.scp
ALAM_WITHEVENT_ARB_20070116_205800 sox  -r 16000 -t wav - |

and the shell looks like the following:
steps/make_mfcc.sh --cmd run.pl  --nj 120 data/train exp/make_mfcc/train mfcc
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_mfcc.sh [info]: segments file exists: using that.
run.pl: 120 / 120 failed, log is in exp/make_mfcc/train/make_mfcc_train.*.log
fix_data_dir.sh: kept all 47644 utterances.
fix_data_dir.sh: old files are kept in data/train/.backup
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc
make_cmvn.sh: no such file data/train/feats.scp
steps/make_mfcc.sh --cmd run.pl  --nj 120 data/test exp/make_mfcc/test mfcc
utils/validate_data_dir.sh: Successfully validated data-directory data/test
steps/make_mfcc.sh [info]: segments file exists: using that.
run.pl: 120 / 120 failed, log is in exp/make_mfcc/test/make_mfcc_test.*.log
fix_data_dir.sh: kept all 676 utterances.
fix_data_dir.sh: old files are kept in data/test/.backup
steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc
make_cmvn.sh: no such file data/test/feats.scp

Also all log files contain same error, which is  :
# extract-segments scp,p:data/test/wav.scp exp/make_mfcc/test/segments.1 ark:- | compute-mfcc-feats --verbose=2 --config=conf/mfcc.conf ark:- ark:- | copy-feats --compress=true ark:- ark,scp:/home/user/kaldi/egs/gale_arabic/s5/mfcc/raw_mfcc_test.1.ark,/home/user/kaldi/egs/gale_arabic/s5/mfcc/raw_mfcc_test.1.scp
# Started at Tue Jan 10 14:12:43 EST 2017
#
copy-feats --compress=true ark:- ark,scp:/home/user/kaldi/egs/gale_arabic/s5/mfcc/raw_mfcc_test.1.ark,/home/user/kaldi/egs/gale_arabic/s5/mfcc/raw_mfcc_test.1.scp
compute-mfcc-feats --verbose=2 --config=conf/mfcc.conf ark:- ark:-
extract-segments scp,p:data/test/wav.scp exp/make_mfcc/test/segments.1 ark:-
sox FAIL sox: Not enough input filenames specified

ERROR (extract-segments:Read():wave-reader.cc:119) WaveData: expected RIFF or RIFX, got sox:

by the way I tried the
gale_data_prep_audio.sh from gale_mandarin example but got the same errors.

the data I tested was from LDC:

LDC2013S02 which is already .wav audio



Please can any one help me in fixing this problem and checking if the
gale_data_prep_audio.sh is correctly producing the wav.scp ?

Thanks, and Best Regards

Daniel Povey

unread,
Jan 11, 2017, 4:17:33 PM1/11/17
to kaldi-help, Ahmed M. Ali
Ahmed [cc'd] may be able to figure out what is going wrong, he is
working on a new version of the gale-arabic recipe [s5b].
Ahmed, can you please also update the original s5 recipe in your PR,
do do more careful checking of error conditions?

Dan
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ahmed Ali

unread,
Jan 11, 2017, 4:28:24 PM1/11/17
to kaldi-help, am...@hbku.edu.qa, dpo...@gmail.com

Most likely your script failed in the data preparation, make sure your script is pointing to the right location for LDC data on your machine:

i.e edit this section in run.sh

audio=(

  /data/sls/scratch/amali/data/GALE/LDC2013S02

  /data/sls/scratch/amali/data/GALE/LDC2013S07

  /data/sls/scratch/amali/data/GALE/LDC2014S07

)

text=(

  /data/sls/scratch/amali/data/GALE/LDC2013T17.tgz

  /data/sls/scratch/amali/data/GALE/LDC2013T04.tgz

  /data/sls/scratch/amali/data/GALE/LDC2014T17.tgz

)

 Sure, I will push some checking to make sure the script fails it doesn't pass this data preparation stage.

 

Ahmed

tbz

unread,
Jan 12, 2017, 10:50:18 AM1/12/17
to kaldi-help, am...@hbku.edu.qa, dpo...@gmail.com
ok,
the path that I am using is exactly where the audio wav file exist. the files are in four folders because I recived 4 CDs from LDC for LDC2013S02 catalog, so this is how I refer to them:-

 audio=(
 data/GALE/GALEPhase2Part1/gale_p2_arb_bc_speech_p1_d1/data
 data/GALE/GALEPhase2Part1/gale_p2_arb_bc_speech_p1_d2/data
data/GALE/GALEPhase2Part1/gale_p2_arb_bc_speech_p1_d3/data
 data/GALE/GALEPhase2Part1/gale_p2_arb_bc_speech_p1_d4/data)
text=(data/GALE/LDC2013T04.tgz)

Daniel Povey

unread,
Jan 12, 2017, 12:34:28 PM1/12/17
to tbz, kaldi-help, Ahmed M. Ali
It's generally best to use absolute pathnames for the audio dirs, but
the script should check for that.

Ahmed-- I notice that the script local/gale_data_prep_audio.sh doesn't
do any checking of its args [e.g. to make sure they contain the right
data, and are absolute pathnames if required], or any checking
internally [e.g. to make sure variables that should be nonempty are
nonempty; to make sure that file-lists have the right size].
It looks to me like here:
for w in `find $wavedir -name *.flac` ; do
base=`basename $w .flac`
fullpath=`readlink -f $w`
echo "$base sox $fullpath -r 16000 -t wav - |"
done
fullpath may end up empty.

Could you please try to make the script a bit more robust?
Also, I notice that the script creates a subdirectory of the Gale
data-directory, called '$galeData/wav'. I don't think this is a good
idea, because often the source data will be on a volume that is not
writable by the user. In any case, it's just creating soft links, so
it shouldn't be necessary anyway, you can point to the original
locations of the files. It would be better to have the script write
only locally, to data/local/[something].
And the script should document itself by printing "Usage: ... " and
"e.g.: ... " to explain its args, if called with no args.

Also, the following lines could use a little explanation via a comment
in the script, as they are using some obscure bash features:

galeData=$(readlink -f "${@: -1}" );
length=$(($#-1))
args=${@:1:$length}


Dan

tbz

unread,
Jan 13, 2017, 10:02:35 AM1/13/17
to kaldi-help, tiba...@gmail.com, am...@hbku.edu.qa, dpo...@gmail.com
Ok, I used absolute pathname but still no feats.scp. The problem I guess is the following:-
after I made  bash -x local/gale_data_prep_audio.sh    "${audio[@]}" $galeData

I noticed that the 'fullpath' variable is empty that is causing wav.scp to be incorrect.

this is one line in wav.scp

ALAM_IRAQNOW_ARB_20070109_085800 sox  -r 16000 -t wav - |

sox arguments are missing the file name which could be the 'fullpath' variable.

the code part in gale_data_prep_audio.sh
for w in `find $wavedir -name *.wav` ; do
  base=`basename $w .wav`guess

  fullpath=`readlink -f $w`
  echo "$base sox $fullpath -r 16000 -t wav - |"
done

Please find the attached log files.

Best Regards
make_mfcc_train.1.log
run.log

Ahmed Ali

unread,
Jan 13, 2017, 10:34:10 AM1/13/17
to kaldi...@googlegroups.com, tiba...@gmail.com, am...@hbku.edu.qa, dpo...@gmail.com
You are not using the full path:
find: 'home/user/kaldi/egs/gale_arabic/s5/data/GALE/GALEPhase2Part1': No such file or directory
audio=(home/user/kaldi/egs/gale_arabic/s5/data/GALE/GALEPhase2Part1)..etc
Absolute path will start with '/'  i.e audio=(/home/user/...)

Yes, this scripts is not robust and will exit with 0 even if it fails, few edits have been made that I can't recognize, but still working for most of others.
Initially, the script used to create local GALE folder to copy the flac files and sox them locally, Now it all done on the fly during the mfcc extraction, so I assume no need to create the local GALE folder.

I will push an updated/robust version in a couple of days or so...

Ahmed

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/4rn49bsBXJQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

tbz

unread,
Jan 13, 2017, 10:43:50 AM1/13/17
to kaldi-help, tiba...@gmail.com, am...@hbku.edu.qa, dpo...@gmail.com
I added the '/'  but still same error. Please find the attached log. Sorry I am taking your time, I think I should wait for your updates and see how it 'll work.
Thanks again
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
run.log

Julian Fritsch

unread,
Jul 17, 2017, 12:47:07 PM7/17/17
to kaldi-help

Hi, Im facing a similar issue.


Im currently trying to add more data to my corpus. I have quite many files, around 90k...


Now, when I add just a few files, everything works just fine, but once I add more than around 40k
I get an error around steps/compute_cmvn_stats.sh because feats.scp is not created for some reason.


ERROR (apply-cmvn[5.1.80~1-e5275]:SequentialTableReader():util/kaldi-table-inl.h:876) Error constructing TableReader: rspecifier is scp:data/train/split8/1/feats.scp


Do you have a hint on why this happens?


Thanks already for your help.

Julian

Daniel Povey

unread,
Jul 17, 2017, 12:48:17 PM7/17/17
to kaldi-help
I suspect there was an earlier error that you did not notice.  


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

Ahmed Ali

unread,
Jul 17, 2017, 12:53:22 PM7/17/17
to kaldi...@googlegroups.com
If you have too many files. This could be disk IO problem, you can try fewer jobs for feature extraction.

Ahmed 

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/4rn49bsBXJQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Sent from Gmail Mobile

Julian Fritsch

unread,
Jul 17, 2017, 12:57:37 PM7/17/17
to kaldi-help, dpo...@gmail.com
Hey Dan,

can this be anything, or is this for sure related to the files like wav.scp, utt2spk, spk2gender... ?
Any hint what I could look for, some specific scripts or log files maybe?
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Jul 17, 2017, 1:04:16 PM7/17/17
to Julian Fritsch, kaldi-help
It would have been printed on the screen when you ran the scripts.
Also any time you paste errors, paste a few lines of context.

Julian Fritsch

unread,
Jul 21, 2017, 10:54:16 AM7/21/17
to kaldi-help, fritsc...@googlemail.com, dpo...@gmail.com
Hey Dan, Im still struggling with this issue.

for x in train dev test ; do

    utils/fix_data_dir.sh data/$x # some files fail to get mfcc for many reasons
    steps/make_mfcc.sh --cmd "$train_cmd" --nj $nJobs data/$x exp/make_mfcc/$x $mfccdir
    utils/fix_data_dir.sh data/$x # some files fail to get mfcc for many reasons
    steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x $mfccdir
    utils/fix_data_dir.sh data/$x
done

steps/train_mono.sh --nj $nJobs --cmd "$train_cmd" \
  data/train data/lang exp/mono || exit 1;


Here's my error message, does this maybe tell you somehting?
Thanks in advance for your help, that's so great! :)


training jobs: 8
decode jobs: 5
- data/train/utt2spk differ: char 1176906, line 17027
utt2spk is not in sorted order when sorted first on speaker-id    <-- is this the problem
(fix this by making speaker-ids prefixes of utt-ids)
steps/make_mfcc.sh --cmd utils/run.pl --nj 8 data/train exp/make_mfcc/train mfcc
utils/validate_data_dir.sh: utt2spk is not in sorted order when sorted first on speaker-id
(fix this by making speaker-ids prefixes of utt-ids)
utils/fix_data_dir.sh: file data/train/spk2utt is not in sorted order or not unique, sorting it
- data/train/utt2spk differ: char 1176906, line 17027
utt2spk is not in sorted order when sorted first on speaker-id
(fix this by making speaker-ids prefixes of utt-ids)

steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train mfcc
make_cmvn.sh: no such file data/train/feats.scp
utils/fix_data_dir.sh: file data/train/spk2utt is not in sorted order or not unique, sorting it
- data/train/utt2spk differ: char 1176906, line 17027
utt2spk is not in sorted order when sorted first on speaker-id
(fix this by making speaker-ids prefixes of utt-ids)
fix_data_dir.sh: kept all 1084 utterances.
fix_data_dir.sh: old files are kept in data/dev/.backup
steps/make_mfcc.sh --cmd utils/run.pl --nj 8 data/dev exp/make_mfcc/dev mfcc
utils/validate_data_dir.sh: Successfully validated data-directory data/dev
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
It seems not all of the feature files were successfully processed (1088 != 1084);
consider using utils/fix_data_dir.sh data/dev
Succeeded creating MFCC features for dev
utils/fix_data_dir.sh: file data/dev/feats.scp is not in sorted order or not unique, sorting it
fix_data_dir.sh: kept 1080 utterances out of 1084
fix_data_dir.sh: old files are kept in data/dev/.backup
steps/compute_cmvn_stats.sh data/dev exp/make_mfcc/dev mfcc
Succeeded creating CMVN stats for dev
5. runHMM dataset: dev , mfccdir: mfcc
fix_data_dir.sh: kept all 1080 utterances.
fix_data_dir.sh: old files are kept in data/dev/.backup
fix_data_dir.sh: kept all 1025 utterances.

fix_data_dir.sh: old files are kept in data/test/.backup
steps/make_mfcc.sh --cmd utils/run.pl --nj 8 data/test exp/make_mfcc/test mfcc

utils/validate_data_dir.sh: Successfully validated data-directory data/test
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
It seems not all of the feature files were successfully processed (1029 != 1025);
consider using utils/fix_data_dir.sh data/test
Succeeded creating MFCC features for test
utils/fix_data_dir.sh: file data/test/feats.scp is not in sorted order or not unique, sorting it
fix_data_dir.sh: kept 1021 utterances out of 1025

fix_data_dir.sh: old files are kept in data/test/.backup
steps/compute_cmvn_stats.sh data/test exp/make_mfcc/test mfcc
Succeeded creating CMVN stats for test
fix_data_dir.sh: kept all 1021 utterances.

fix_data_dir.sh: old files are kept in data/test/.backup
steps/train_mono.sh --nj 8 --cmd utils/run.pl data/train data/lang exp/mono
steps/train_mono.sh: Initializing monophone system.
feat-to-dim 'ark,s,cs:apply-cmvn --utt2spk=ark:data/train/split8/1/utt2spk scp:data/train/split8/1/cmvn.scp scp:data/train/split8/1/feats.scp ark:- | add-deltas ark:- ark:- |' -
apply-cmvn --utt2spk=ark:data/train/split8/1/utt2spk scp:data/train/split8/1/cmvn.scp scp:data/train/split8/1/feats.scp ark:-
WARNING (apply-cmvn[5.1.80~1-e5275]:Open():util/kaldi-table-inl.h:106) Failed to open script file data/train/split8/1/feats.scp

ERROR (apply-cmvn[5.1.80~1-e5275]:SequentialTableReader():util/kaldi-table-inl.h:876) Error constructing TableReader: rspecifier is scp:data/train/split8/1/feats.scp

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::SequentialTableReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
main
__libc_start_main
_start

add-deltas ark:- ark:-
ERROR (feat-to-dim[5.1.80~1-e5275]:main():feat-to-dim.cc:58) Could not read any features (empty archive?)

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
main
__libc_start_main
_start

error getting feature dimension

Julian Fritsch

unread,
Jul 21, 2017, 11:05:25 AM7/21/17
to kaldi-help, fritsc...@googlemail.com, dpo...@gmail.com
Forgot this: this is how my utt2spk-file looks:


02dae828-4f10-4451-a8de-85538da6fdec_2014-03-17-14-03-55 02dae828-4f10-4451-a8de-85538da6fdec
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-40-53 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-41-12 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-41-22 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-41-33 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-41-42 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-41-55 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-42-02 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-42-22 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-42-38 049415f9-6afc-4e0c-b2b7-9d995e9d001f
049415f9-6afc-4e0c-b2b7-9d995e9d001f_2014-08-27-11-42-44 049415f9-6afc-4e0c-b2b7-9d995e9d001f
.
.
.
xemilien_2017-01-01-18-50-16 xemilien
xemilien_2017-01-01-19-11-22 xemilien
xemilien_2017-01-01-19-3-4 xemilien
zeNF3_2017-01-01-13-13-26 zeNF3
zeNF3_2017-01-01-14-11-20 zeNF3
zeroone_2017-01-01-13-55-53 zeroone
zeroone_2017-01-01-15-2-56 zeroone
zeroone_2017-01-01-17-32-19 zeroone
zeroone_2017-01-01-19-8-41 zeroone
zhyniq_2017-01-01-16-28-32 zhyniq
ziehel_2017-01-01-13-32-44 ziehel
ziehel_2017-01-01-14-58-16 ziehel
ziehel_2017-01-01-15-59-29 ziehel
ziehel_2017-01-01-16-20-38 ziehel
ziehel_2017-01-01-16-37-42 ziehel
ziehel_2017-01-01-17-55-3 ziehel
ziehel_2017-01-01-18-25-16 ziehel
ziehel_2017-01-01-18-34-39 ziehel
ziehel_2017-01-01-18-54-21 ziehel
ziehel_2017-01-01-19-37-18 ziehel
ziehel_2017-01-01-19-41-11 ziehel
ziehel_2017-01-01-19-44-41 ziehel
zulu34sx_2017-01-01-15-15-44 zulu34sx
zulu34sx_2017-01-01-15-48-7 zulu34sx

Daniel Povey

unread,
Jul 21, 2017, 1:59:03 PM7/21/17
to Julian Fritsch, kaldi-help
Yes, it's that sorting issue.  Do what it suggests in the printed error message.
Read the documentation on data preparation at kaldi-asr.org/doc/.

Julian Fritsch

unread,
Jul 22, 2017, 9:50:28 AM7/22/17
to kaldi-help, fritsc...@googlemail.com, dpo...@gmail.com
I've been able to solve it now.
I had some issues with speakerIDs and capitalisation, Im now only using guIDs.

Thanks a lot, Dan!
Reply all
Reply to author
Forward
0 new messages