Ok, well. Given the flashlight recipe to train Facebook models and the models are available:
you can try to analyze what is different. To approach their WER I would try:
1. Sync feature extraction (80 mels instead of 40)
2. Sync the CNN frontend (Try TDNN-CNN model instead of pure TDNN, example is in multi_cn recipe).
3. Increase significantly model size and number of layers. Their AM model is 1Gb as published on github, your is just 50Mb, its unrealistic to expect your model to match theirs. You need to increase number of layers up to 50 and the size of the layer to 4096 with bottleneck 512 so that your model is at least 300Mb in size. You will see much better accuracy then.
4. Employ more advanced dropout, longer training, specaugment if you didn't enable specaugment yet.
6. They cite 2.0 / 2.5 WER on librispeech with librispeech training set with 4-gram LM. It is probably easier to first try to reproduce that number before you start training on MLS. The most advanced system reported for Kaldi like Multistream CNN from ASAPP
https://arxiv.org/pdf/2005.10469.pdf https://arxiv.org/pdf/2005.10469.pdf only reached 2.6 /2.8 WER, not 2.0 / 2.5 WER.
7. In general it is more about model size than about the architecture. You can actually get a lot by increasing CNN encoder frontend size. For example, Nvidia Citrinet models get very great results with good CNN encoder and simple decoder. There is no much need for advanced transformers but CNN layers have to be large.
To run all those experiments make sure you have access to 64 V100 GPUs.
Thats it.