Any record of alignment quality

Jason Bohland

unread,

Jan 6, 2022, 10:54:55 AM1/6/22

to MFA Users

Hello. We've been using MFA align with a pretrained English acoustic model to perform alignment on a relatively large sample of lab speech experimental data. Without going into all the details, participants are asked to speak, for example, a prescribed sentence, but because of experimental manipulations, these particular utterances contain a large number of disfluencies, distortions, etc.

What I've been wondering is this: is it possible at all to recover any measure of "quality" (cost / posterior probability) of the alignment to our provided ("target" transcript) using MFA? I've been trying to dig into how MFA uses Kaldi to optimize the alignment, but I figured it might be better to just post a question here before spending too much time. What we'd ideally like is a measure of "quality" of alignment / recognition of an individual segment, but even a measure at the utterance level would be helpful.

The reason we'd like to get this information is that we're explicitly interested in errors / disfluencies the participants made, and we'd love to be able to "flag" individual trials for further (manual) analysis based on these kinds of metrics.

As far as we've been able to tell, there's no explicit way to output any of this information through MFA align. But I have to think that these types of metrics are being computed during the alignment process in order to decide the "best fit." Any help or suggestions here are greatly appreciated! We're willing to do a little work to make this possible, just not sure if it's feasible.

Thanks for any help!

--
Jason Bohland, PhD
Assistant Professor, Communication Science and Disorders
University of Pittsburgh, School of Health and Rehabilitation Sciences

David Lukeš

unread,

Jan 6, 2022, 11:44:19 AM1/6/22

to Jason Bohland, MFA Users

Just wanted to say +1, this would be really useful :) It’s one of
those things I gave up on way back when I was using vanilla Kaldi
instead of MFA, because I couldn’t find any recipe I could easily
adapt, given my extremely superficial knowledge of Kaldi.

David

michael.e...@gmail.com

unread,

Jan 6, 2022, 4:34:07 PM1/6/22

to MFA Users

(originally sent just to David, sorry about that!)

Yeah, I've looked into this at several points and the actual scores
that Kaldi reports on utterances haven't been the most useful to me.
With how I've been restructuring the code lately, it shouldn't be too
bad to export it in case it does prove to be helpful to you. Another
avenue that I want to explore is doing a pass over the validate
utility's `--test_transcription` flag, since that uses a per-utterance
unigram language model based on the utterance words and the N most
frequent words to transcribe the utterance, but it's been a while
since I looked at it, and I'd like to add the ability for it to allow
optional <unk> word insertions in addition to optional silence
insertion (this might be something more useful generally for
alignment, though I do worry that it might try to insert it more often
than it should). Another line that I've been thinking of is a way to
more explicitly model disfluencies (allow for specifying filler words
that can be inserted like "uh", "um", modelling cutoffs and
hesitations might be more tricky).

The root issue is that there are a lot of sources of variability going
into these "quality" measurements, so I think they'd be a noisy
signal, but that is still probably better than nothing.

Jason Bohland

unread,

Jan 6, 2022, 11:43:53 PM1/6/22

to MFA Users

Thanks for this. I'd definitely be interested in taking a look at the Kaldi score outputs to see if they could help us out in our application. If there's something I could do to help make that export available please do let me know.

That definitely makes sense about <unk> insertions and filler words. In our case, speech is produced under altered auditory feedback, which can drive many changes to what gets produced (from vowel distortions to sound repetitions). The goal would be to see if we could somehow sort trials into "good" vs. "bad" alignments, with the idea that the bad alignments are likely the trials with these kinds of errors.

michael.e...@gmail.com

unread,

Jan 7, 2022, 1:29:18 AM1/7/22

to MFA Users

Gotcha, vowel acoustic quality will likely not get detected by what Kaldi outputs with the default English, since it does its best to generalize outside of specific vowel categories and cluster phones in different contexts. You might have more success with training a monophone model, since that would be more sensitive to that kind of variation more generally.

Reply all

Reply to author

Forward