WSJ data training. Where can I get all audio data transcription?

466 views
Skip to first unread message

hyungwon yang

unread,
Apr 25, 2017, 2:50:36 AM4/25/17
to kaldi-help
Hello all,

After I downloaded all wsj data(csr1 and csr2) from LDC, I ran ./wsj/s5 script.
There ware few problems occurred during training so I got a nice result when I finished the training. (So I didn't complain at all)
However, I recently figured out that not all audio files have their corresponding transcription files. I mean, for ASR training, each audio file should have its corresponding transcription file. (e.g., abc.wav and abc.txt)
In order to find out how many audio files have missed their related transcription files, I collected all audio files from csr1 and csr2 and the total number of them was 131,075.
In contrast to this figure, the number of the transcription files that I've collected (dot format file in wsj directory. wsj_data_prep.sh tells me where I can find the transcription files.) was just 39,923 which is almost the 1/3 of the audio files.
Does this mean that I used only 1/3 of the wsj audio files for training?
If this is true, then where can I get the rest of the transcription files so as to train the all audio data.
I will email to LDC in regard to this, but I just leave a question here too in case someone who already raise this question and know how to get all those transcriptions. 
So please let me know if anyone knows where I can get the full wsj audio file transcriptions. 

Thanks in advance.

Daniel Povey

unread,
Apr 25, 2017, 2:49:50 PM4/25/17
to kaldi-help
When I do a word-count of the prepared data from kaldi, I get this:

 wc data/train_si{84,284}/utt2spk

  7138  14276  92794 data/train_si84/utt2spk

 37416  74832 486408 data/train_si284/utt2spk

(note: si84 is a subset of si-284).

I think you have misunderstood something but I don't think it's a good use of my time to figure out exactly what you have misunderstood.

Dan



--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

hyungwon yang

unread,
Apr 25, 2017, 10:05:53 PM4/25/17
to kaldi-help, dpo...@gmail.com
Thanks for your response!


Just like the word count result you showed me, I got the same result. (So I think I ran the code successfully because I didn't miss any data processing steps.)

I think I should've said in a different way. (I should not have said that "Does this mean that I used only 1/3 of the wsj audio files for training?") 

I want the whole transcription files that matched to every wave files in wsj and I wonder where I can get them.

Since the total number of wave files from csr1 and csr2 is 131,075, I thought that we also need to have 131,075 transcription files before training or testing models.

The number of text file that I mentioned above(39,923) is derived by collecting all text files in every directory in "data" directory (not only train_*** but also test_***).

The reason why I raised this question was that it is too bad not to use whole wave files for training and testing models. 

Please let me know if anyone knows how to get whole transcription files for adding more audio and transcription files in training and testing steps 

or simply there is no way to get those whole transcription files. 


2017년 4월 26일 수요일 오전 3시 49분 50초 UTC+9, Dan Povey 님의 말:
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Apr 25, 2017, 10:08:14 PM4/25/17
to hyungwon yang, kaldi-help
I think WSJ contains the same utterances spoken by many speakers.  Anywy it's all explained in the README files in the LDC distribution and/or the paper about WSJs.
Reply all
Reply to author
Forward
0 new messages