SSL features instead of MFCC in Kaldi?

Max Lvov

unread,

Mar 18, 2024, 11:43:53 AM3/18/24

to kaldi-developers

Following this paper for End2End models:

"AN EXPLORATION OF SELF-SUPERVISED PRETRAINED REPRESENTATIONS FOR
END-TO-END SPEECH RECOGNITION"

Has anyone tried using SSL pretrained models (like HuBERT) for extracting features, instead of MFCC, and then training a Hybrid model on top of them?

ondrej...@gmail.com

unread,

Mar 19, 2024, 4:47:37 AM3/19/24

to kaldi-developers

Hi Max,

we trained a small TDNN-F model on top of features extracted with xlsr-53 in "Comparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models" and the SSL features helped a lot. Another benefit of using SSL features is that you can do self-supervised continual pretraining with untranscribed data even when semi-supervised training doesn't work that well due to a weak language model.

Best regards,

Ondrej

Max Lvov

unread,

Mar 19, 2024, 3:20:30 PM3/19/24

to kaldi-developers

Thanks Ondrej!

Did you try other SSL pretrained models, other than XLSR, like HuBERT or WavLM?

ondrej...@gmail.com

unread,

Mar 20, 2024, 5:09:53 AM3/20/24

to kaldi-developers

I tried XLS-R, XLSR-53, wav2vec 2.0, and HuBERT. They all worked better than MFCC features, but XLS-R worked best for low-resource languages.

Aditya Parikh

unread,

Apr 1, 2024, 6:51:20 AM4/1/24

to kaldi-developers

Hi Ondrej,

Thanks for mentioning the paper. Is there any codebase available to replicate the methodology?

I am specifically talking about these lines:

"We trained a five-lingual (four Bantu languages + English) South African acoustic models which used either 40- dimensional MFCC features or 1024-dimensional XLSR-53 features as inputs. Both types of models were trained using Kaldi toolkit [34] and used the same alignments obtained with a standard GMM model."

I am trying to use this method for train a phoneme recognition model.

Thanks,

Aditya.

Reply all

Reply to author

Forward