Combining data from mutiple sources

613 views
Skip to first unread message

Stefan Watson

unread,
Jul 31, 2017, 1:28:23 PM7/31/17
to kaldi-help
Hello 

I am trying to combine data from two datasets using the utils/combine_data.sh script. At the validation, step it keeps removing one of the datasets using the filters the issue is shown below

utils/combine_data.sh: combined utt2spk
utils/combine_data.sh [info]: not combining utt2lang as it does not exist
utils/combine_data.sh [info]: not combining utt2dur as it does not exist
utils/combine_data.sh [info]: not combining feats.scp as it does not exist
utils/combine_data.sh: combined text
utils/combine_data.sh [info]: not combining cmvn.scp as it does not exist
utils/combine_data.sh [info]: not combining reco2file_and_channel as it does not exist
utils/combine_data.sh: combined wav.scp
utils/combine_data.sh [info]: not combining spk2gender as it does not exist
utils/validate_data_dir.sh: Error: in data/combine_AmJm_2000_tmp, recording-ids extracted from segments and wav.scp
utils/validate_data_dir.sh: differ, partial diff is:
73a74,4896
> sp0.9-fabm2aa1
> sp0.9-fabm2ab2
> sp0.9-fabm2ac1
> sp0.9-fabm2ad2
> sp0.9-fabm2ae2
...
> sp1.1-mwjk2dq2
> sp1.1-mwjk2dr2
> sp1.1-mwjk2ds2
> sp1.1-mwjk2du2
> sp1.1-mwjk2dv2
> sp1.1-mwjk2dw2
[Lengths are kaldi.FfVS/recordings=146 versus kaldi.FfVS/recordings.wav=9792]
steps/make_mfcc.sh --cmd run.pl --nj 50 data/combine_AmJm_2000_tmp exp/make_mfcc/combine_AmJm_2000_tmp mfcc_perturbed
utils/validate_data_dir.sh: Error: in data/combine_AmJm_2000_tmp, recording-ids extracted from segments and wav.scp
utils/validate_data_dir.sh: differ, partial diff is:
73a74,4896
> sp0.9-fabm2aa1
> sp0.9-fabm2ab2
> sp0.9-fabm2ac1
> sp0.9-fabm2ad2
> sp0.9-fabm2ae2
...
> sp1.1-mwjk2dq2
> sp1.1-mwjk2dr2
> sp1.1-mwjk2ds2
> sp1.1-mwjk2du2
> sp1.1-mwjk2dv2
> sp1.1-mwjk2dw2
[Lengths are kaldi.gdNw/recordings=146 versus kaldi.gdNw/recordings.wav=9792]

Can anyone give me some insight into fixing this issue?

Thanks in advance

Daniel Povey

unread,
Jul 31, 2017, 3:02:59 PM7/31/17
to kaldi-help
I think either your Kaldi copy is very old or you have not pasted the
entire output, or you have modified the script, because it should say
either:
echo "$0: combined segments"
or
echo "$0 [info]: not combining segments as it does not exist"
A long time ago the script would not do the right thing when combining
one source that had a segments file and one that did not.

Dan
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Stefan Watson

unread,
Jul 31, 2017, 9:17:40 PM7/31/17
to kaldi-help, dpo...@gmail.com
Thanks, Dan. 

I found the issue. The script that creates the segments file was missing.

hazir

unread,
Aug 1, 2017, 6:02:53 AM8/1/17
to kaldi-help
Hi,

When combining two different datasets. If one of them have a segment file and the other has not (as the audio files are already segmented), is there a way to combine them without making the second segment file or it's necessary to do it putting the segment of the entire audio ?

Thank you in advance !

Daniel Povey

unread,
Aug 1, 2017, 4:02:52 PM8/1/17
to kaldi-help
The current version of Kaldi automatically creates a segments file
consisting of the entire audio, when combining in that situation, so
the user has to do nothing. Either Stefan had an old copy of Kaldi or
he changed something.
Dan

tbz

unread,
Aug 4, 2017, 7:34:26 AM8/4/17
to kaldi-help, dpo...@gmail.com
Hi ,
if I have two folders for training data : data/train1c data/train2c  ( one channel and two channel) do I need to combine them in one folder data/train.
I am now trying to apply
steps/align_si.sh --nj $nJobs --cmd "$train_cmd" data/train1c data/lang exp/mono exp/mono_ali || exit 1;
steps/align_si.sh --nj $nJobs --cmd "$train_cmd" data/train2c data/lang exp/mono exp/mono_ali || exit 1;

but would the next command cancels the first one? do I need to use  utils/combine_data.sh script  from the begining?
how would the training steps be applied?
please if there is a repository that is similar to my case, then just refer me to it.

Thanks, with all respect.
tzd

tbz

unread,
Aug 4, 2017, 11:18:02 AM8/4/17
to kaldi-help, dpo...@gmail.com
here is what I will do
I will cat | sort both my train folders
then redo the rest
if you have other better suggestions, I will appreciate sharing.

tbz

unread,
Aug 4, 2017, 9:10:55 PM8/4/17
to kaldi-help, dpo...@gmail.com
 issue solved. It was very simple. just to record for others, who may need to do so, I have just called :-
utils/combine_data.sh data/train data/train1c data/train2c

thanks for Dan :)
Reply all
Reply to author
Forward
0 new messages