Implementation of per-speaker dictionaries

Dan Wells

unread,

Dec 14, 2021, 11:45:54 AM12/14/21

to MFA Users

Hi, I'm curious about the implementation of per-speaker dictionaries in MFA and how to replicate it in my own Kaldi recipes.

Checking the logs of a recent MFA (v2.0.0b8) run using two dictionaries for different speaker groups, it seems like we do the following:

Merge phones (and words?) from both dictionaries so that numeric indices line up and HMM topo files end up identical (?)
For context-dependent models, cluster phones and build decision trees using info from a single dictionary (fine because all phones are present and we have stats from initial monophone model)
Compile two sets of training graphs, for utterances from each dictionary
Run two separate alignments, for utterances from each dictionary
Accumulate statistics and update models once, pooling data from all utterances across both dictionaries
Iterate alignment per dictionary with global updates until model is trained

Do I have the right idea there?

To me it seems like matching up the phone and word indices so that both dictionaries can be used consistently is the main practical consideration. Anything else like how much variation you expect in the realisation of phones using the same symbol across two speaker groups is a modelling question, and could maybe be addressed by adjusting number of GMM components or decision tree leaves or something... Does that line up with your experience implementing this feature?

Thanks!

michael.e...@gmail.com

unread,

Dec 14, 2021, 3:58:46 PM12/14/21

to MFA Users

I did a blog post a while back about my experiences here: https://memcauliffe.com/speaker-dictionaries-and-multilingual-ipa.html.

I haven't tried out the training aspect too much (other than making sure that runs). I've mostly used it as a way to apply US English models to UK speech without blowing up the lexicon size. That said, I'm hoping to start experimenting with multilingual/multi-dialectal training to see if it outperforms monolingual training.

The implementation is basically:

Scan through all dictionaries and collect all their phones and generate all relevant files using this phone set (topo, roots, sets, phone ids etc)
Generate word IDs within each dictionary (to avoid blowing up the lexicon with dispreferred variants)
1. I'm not sure if there's any benefit to having consistent word IDs across the dictionaries, my mind was on multiple languages with completely different word lists (or very different pronunciations for cognates or faux amis), but if you know that there's a largely overlapping set of words or need the IDs to be consistent, you could implement it that way too.
2. I did have some issues when originally
Construct per-dictionary L.fst files for alignment
Compile per-dictionary, per job training graphs and train on that (you can see an example call here: https://montreal-forced-aligner.readthedocs.io/en/latest/_modules/montreal_forced_aligner/alignment/multiprocessing.html#align_func)
1. This does lead to an unfortunate increase in the number of files generated
And then yeah, global updates to the model across all dictionary/job combos

So yeah, I haven't really played around too much with the modeling implications or experimented with GMM or decision trees yet (though for the multilingual IPA, I've been putting together some IPA-specific extra_questions and ARPA-specific extra_questions based on the LibriSpeech eg). Hopefully that helps, and let me know if you have any questions along the way! I'd also be super curious if you find anything during your modeling investigations, since that's largely been an area that I've just copied existing recipes and not dived too deeply into tuning parameters.

Cheers,

Michael

Dan Wells

unread,

Dec 15, 2021, 9:17:06 AM12/15/21

to MFA Users

Thanks very much for the detailed description! I think I have a handle on how to implement this now.

I hadn't made the link to your efforts on multilingual training, but what you say about keeping cross-lingual homographs out of the word lists/lexicon FSTs for each speaker group makes perfect sense. My mistake about also unifying word IDs -- the corpus I'm using has a lot of repeated prompts across speakers and by accident it seems to produce identical word lists between my two speaker groups.

It may be a while before I get round to experimenting with this properly, but I'll make sure to feed back anything that could make for useful guidance on model configuration :)