My thanks for both of you for replying.
I checked the page (
https://kaldi-asr.org/doc/data_prep.html), and I prepared some files named "
text", "
utt2spk", "
wav.scp".
And, I did
$ utils/fix_data_dir.sh data/train then spk2utt was made automatically.
So, in the directory (sre16/v2/data/train/) is;
text, utt2spk, spk2utt, wav.scp , data(directory, contains wav files)
Then I tried $ steps/make_mfcc.sh data/train
And the result is:
steps/make_mfcc.sh data/trainutils/validate_data_dir.sh: Error: in data/train, utterance lists extracted from utt2spk and wav.scp
utils/validate_data_dir.sh: differ, partial diff is:
--- /tmp/kaldi.Cuc2/utts 2022-08-02 18:11:37.127440531 +0900
+++ /tmp/kaldi.Cuc2/utts.wav 2022-08-02 18:11:37.135440438 +0900
@@ -1,400 +1,400 @@
-jvs001-001_jvs001-001
-jvs001-002_jvs001-002
-jvs001-003_jvs001-003
...
+jvs004-095
+jvs004-096
+jvs004-097
+jvs004-098
+jvs004-099
+jvs004-100
[Lengths are /tmp/kaldi.Cuc2/utts=400 versus /tmp/kaldi.Cuc2/utts.wav=400]
How can I solve this?
(The corpora I plan to use is
JVS-corpus, particularly parallel100 set)
I prepared the IDs and filename as belows:
- speaker-id = jvs$1 ($1 is a number from 001 to 004)
- recording-id = jvs$1-$2 ($2 is a number from 001 to 100)
- utterance-id = jvs$1-$2_jvs$1-$2 (I want to handle one file as one uttrance for now, so I set them like duplicate of recording-id)
- extended-filename = path for the data files
So the files are like:
test;
jvs001-001_jvs001-001 *Japanese text*
jvs001-002_jvs001-002 *Japanese text*
...
jvs004-100_jvs004-100 *Japanese text*
utt2spk;
jvs001-001_jvs001-001 jvs001
jvs001-002_jvs001-002 jvs001
...
jvs004-100_jvs004-100 jvs004
wav.scp;
jvs001-001 ./speakers/jvs001/VOICEACTRESS100_001.wav
jvs001-002 ./speakers/jvs001/VOICEACTRESS100_002.wav
...
jvs004-100 ./speakers/jvs004/VOICEACTRESS100_100.wav
Regards.