about the corpus of Multilingual LibriSpeech ?

asr-speech

unread,

Apr 23, 2021, 3:34:44 AM4/23/21

to kaldi-help

Hello , everyone:

I want to ask a question about the Dataset Multilingual Librispeech ,about 40000 hours. I trained a tdnnf model using these Data ( the OPUS files in openslr ),but the wer on the test and dev is worse than the paper mentioned. Has anyone trained a model on this dataset by using kaldi?

nshm...@gmail.com

unread,

Apr 24, 2021, 2:14:59 AM4/24/21

to kaldi-help

Hello

To start a practical discussion on this you'd better provide the following information:

1. What is the WER you got

2. What are the main parameters of the model you trained.

3. Which paper do you refer too? Is it MLS: A LARGE-SCALE MULTILINGUAL DATASET FOR SPEECH RESEARCH A by Facebook? https://arxiv.org/pdf/2012.03411.pdf? I believe Facebook system had a bit more advanced setup than default Kaldi recipes.

asr-speech

unread,

Apr 25, 2021, 12:32:05 AM4/25/21

to kaldi-help

Hi, Thank you for your reply,I'm sorry I didn't provide enough information before.

I traned the model using 40000 hours with opus files , and the wer in test is 14.4%, but the paper that you mentioned( MLS: A LARGE-SCALE MULTILINGUAL DATASET FOR SPEECH RESEARCH ) is 6.99%. The main parameters of the model I trained is as follows: 24-tdnnf layers ,1280 dim and 160 bottleneck-dim,and from the tdnnf2 to tdnnf7 ,the time-stride is 1, and from tdnnf9 to tdnnf24,the time-stride is 2.

If you want to know any more information, please contact me without hesitation.

Thank you , very much

nshm...@gmail.com

unread,

Apr 25, 2021, 4:46:20 AM4/25/21

to kaldi-help

Ok, well. Given the flashlight recipe to train Facebook models and the models are available:

https://github.com/flashlight/wav2letter/tree/master/recipes/mls

you can try to analyze what is different. To approach their WER I would try:

1. Sync feature extraction (80 mels instead of 40)

2. Sync the CNN frontend (Try TDNN-CNN model instead of pure TDNN, example is in multi_cn recipe).

3. Increase significantly model size and number of layers. Their AM model is 1Gb as published on github, your is just 50Mb, its unrealistic to expect your model to match theirs. You need to increase number of layers up to 50 and the size of the layer to 4096 with bottleneck 512 so that your model is at least 300Mb in size. You will see much better accuracy then.

4. Employ more advanced dropout, longer training, specaugment if you didn't enable specaugment yet.

5. I don't think transformer AM is available in raw kaldi but you can try some python hacks like https://github.com/jzlianglu/pykaldi2 if you still want to stay within kaldi framework.

6. They cite 2.0 / 2.5 WER on librispeech with librispeech training set with 4-gram LM. It is probably easier to first try to reproduce that number before you start training on MLS. The most advanced system reported for Kaldi like Multistream CNN from ASAPP https://arxiv.org/pdf/2005.10469.pdf https://arxiv.org/pdf/2005.10469.pdf only reached 2.6 /2.8 WER, not 2.0 / 2.5 WER.

7. In general it is more about model size than about the architecture. You can actually get a lot by increasing CNN encoder frontend size. For example, Nvidia Citrinet models get very great results with good CNN encoder and simple decoder. There is no much need for advanced transformers but CNN layers have to be large.