Kaldi Non-English Features Trouble

34 views
Skip to first unread message

Travis C

unread,
Jul 2, 2025, 12:09:52 PMJul 2
to kaldi-developers

Using the Montreal Forced Aligner as our goal and for reference of how to use Kaldi, we implemented a barebones C++ aligner interfacing with Kaldi directly. We align by generating MFCCs, apply CMVN, run GMM alignment, and try to get a CTM based off of the output. *Disclaimer: I do not actually know how most of this works, very much still learning.


We are currently ingesting MFA pre-trained models, and even use the MFA CLI tool to extract the necessary files/features from them for consumption with our code. We have it working reliably well for the English model, but as we've tried to implement other languages with additional features like Pitch and Deltas, we've run into trouble.


Currently, the Spanish model we’re using results in a short script of 20 words having the first 19 incorrectly crammed into the first half second, with the last word taking up the other 7 seconds of audio. Adding things like pitch and delta features didn’t seem to make a difference, so we’re at a loss for what we’re missing since we’re really inexperienced here.


I’m hoping to find one of two things here:
1. Identifying what else would be missing to make English work okay, but Spanish/others to not work at all.

2. Someone who’s willing to work with us more directly to get this implementation in a better spot, and we are willing to compensate for your time.


Thanks and happy to answer any questions, I know this is a bit open ended.


Nickolay Shmyrev

unread,
Jul 6, 2025, 5:25:07 AMJul 6
to kaldi-developers
It is better to use nnet3 models than GMMs, they are more accurate for alignment in case of diverse inputs in real-life conditions. You can use pretrained models for Spanish.

You  certainly do not need pitch feature for alignment, it has very specific usecases and not really practical.

Travis C

unread,
Jul 11, 2025, 11:50:43 AMJul 11
to kaldi-developers
Thanks Nickolay, good to know about the pitch feature, it was tough for me to parse out how important it is for our goal. I'll look into nnet3 models for sure, I was just hoping we could at least figure out how to get our GMM setup working correctly before experimenting too much, sounds like I've got some more learning to do.
Reply all
Reply to author
Forward
0 new messages