Hello. We've been using MFA align with a pretrained English acoustic model to perform alignment on a relatively large sample of lab speech experimental data. Without going into all the details, participants are asked to speak, for example, a prescribed sentence, but because of experimental manipulations, these particular utterances contain a large number of disfluencies, distortions, etc.
What I've been wondering is this: is it possible at all to recover any measure of "quality" (cost / posterior probability) of the alignment to our provided ("target" transcript) using MFA? I've been trying to dig into how MFA uses Kaldi to optimize the alignment, but I figured it might be better to just post a question here before spending too much time. What we'd ideally like is a measure of "quality" of alignment / recognition of an individual segment, but even a measure at the utterance level would be helpful.
The reason we'd like to get this information is that we're explicitly interested in errors / disfluencies the participants made, and we'd love to be able to "flag" individual trials for further (manual) analysis based on these kinds of metrics.
As far as we've been able to tell, there's no explicit way to output any of this information through MFA align. But I have to think that these types of metrics are being computed during the alignment process in order to decide the "best fit." Any help or suggestions here are greatly appreciated! We're willing to do a little work to make this possible, just not sure if it's feasible.
Thanks for any help!
--
Jason Bohland, PhD
Assistant Professor, Communication Science and Disorders
University of Pittsburgh, School of Health and Rehabilitation Sciences