Using the Montreal Forced Aligner as our goal and for reference of how to use Kaldi, we implemented a barebones C++ aligner interfacing with Kaldi directly. We align by generating MFCCs, apply CMVN, run GMM alignment, and try to get a CTM based off of the output. *Disclaimer: I do not actually know how most of this works, very much still learning.
We are currently ingesting MFA pre-trained models, and even use the MFA CLI tool to extract the necessary files/features from them for consumption with our code. We have it working reliably well for the English model, but as we've tried to implement other languages with additional features like Pitch and Deltas, we've run into trouble.
Currently, the Spanish model we’re using results in a short script of 20 words having the first 19 incorrectly crammed into the first half second, with the last word taking up the other 7 seconds of audio. Adding things like pitch and delta features didn’t seem to make a difference, so we’re at a loss for what we’re missing since we’re really inexperienced here.
I’m hoping to find one of two things here:
1. Identifying what else would be missing to make English work okay, but Spanish/others to not work at all.
2. Someone who’s willing to work with us more directly to get this implementation in a better spot, and we are willing to compensate for your time.
Thanks and happy to answer any questions, I know this is a bit open ended.