Forced alignment to phonemes, without a dictionary?

Joseph Turian

unread,

Apr 4, 2021, 10:50:44 AM4/4/21

to MFA Users

Dear MFA,

I did my postdoc with Yoshua at l'Université de Montréal, so I love everything Montreal. :)

I am curious how to align text to existing phonemes, but without a dictionary. I don't care about graphemes.

I have a multi-speaker corpus of English speakers saying sentences with nonsense words, and phoneme transcriptions of those sentences. I want to align the speech to the phonemes, but I don't care about the graphemes. And I can't use a dictionary since most of the words are not in the dictionary.

How can I use MFA to align the speech to phonemes, without the use of a dictionary?

Best,

Joseph

michael.e...@gmail.com

unread,

Apr 4, 2021, 12:13:20 PM4/4/21

to MFA Users

A couple of ideas I can think of would be to use a dummy dictionary that has something of the format "utterance_id s t r i n g o f p h o n e s", so each utterance id "word" would correspond to the phones that are transcribed. And then for each utterance, just have the transcription be the single utterance_id "word". This has a downside of not capturing silence between the phones very well, so it might not be the best. Another approach would be to have the dictionary just be an identity mapping of phones to themselves, so something like "AA1 AA1" as an example, that might be the best? In general though, if you can create a dictionary of the non sense words with their phone transcriptions and then convert your labels to be orthographic transcriptions, that'll probably work the best since that's the intended use case. As mentioned with the silence modelling, a lot of internals of MFA and Kaldi assume a word level representation.

Joseph Turian

unread,

Apr 4, 2021, 12:21:55 PM4/4/21

to MFA Users, michael.e...@gmail.com

Michael,

Thank you for the feedback. Right now I am preparing a PHONE => PHONE dictionary as you propose.

I don't understand what you mean by "As mentioned with the silence modelling, a lot of internals of MFA and Kaldi assume a word level representation." Does this mean a PHONE => PHONE dictionary would be worse? I don't understand how having "NONSENSEWORD => list of PHONES" would be worse at modeling silence.

Best,
Joseph

--
You received this message because you are subscribed to a topic in the Google Groups "MFA Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mfa-users/u3oVGExeyqQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mfa-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mfa-users/a2424edc-8b91-4940-becb-eeb171206b9en%40googlegroups.com.

michael.e...@gmail.com

unread,

Apr 4, 2021, 12:32:55 PM4/4/21

to MFA Users

Right, so "NONSENSEWORD => list of PHONES" would be the best, because we have optional models for silence between words but not within words (the assumption here being that each word is fluently pronounced). With your example, if you model each phone as its own word, we're now allowing for silence to be inserted by the model within words (so you might run the risk of longer stop closures being modelled as "silence" rather than part of the closure, for instance). My original solution of "UTTERANCE => list of phones" is probably the worst performing of the lot, since the model wouldn't have the ability insert silence inside the "word", so you'd end up having silent sections attributed to neighboring phones or just need to have a really large beam width to deal with it.

Hope that helps!

David Lukeš

unread,

Apr 5, 2021, 3:56:53 PM4/5/21

to michael.e...@gmail.com, MFA Users

Just chiming in to say I’ve been using NONSENSEWORD => list of PHONES for this purpose for a while now and it’s been working well :) Here’s a concrete example for the benefit of anyone googling this discussion at some point in the future — if your utterance is word1 word2 word1, and your phonetic transcript is pron1 pron2 pron3, then you need a pronunciation dictionary which looks something like this:

word1_pron1 p r o n 1
word2_pron2 p r o n 2
word1_pron3 p r o n 3

In other words, NONSENSEWORD should be a unique identifier for the given word + pronunciation combination. It can be anything, but it can be helpful to make it human-readable, in which case concatenating the basic and phonetic transcript is a good choice imho.

Before running the aligner, you should also remember to convert the utterance from word1 word2 word1 to word1_pron1 word2_pron2 word1_pron3 — your utterances need to consist of words which can actually be found in the dictionary.