Adding out of vocabulary words

42 views

Skip to first unread message

Brian Macwhinney

unread,

Jun 10, 2022, 2:22:02 PM6/10/22

to mfa-...@googlegroups.com

Dear MFA,

So far, I am delighted with the usability and accuracy of MFA. We have a lot of data in talkbank.org with transcripts and media that could benefit from alignment - much in English, but lots in other languages too. Right now, we have bumped into three problems.

The first problem related to the way in which *.lab data is interpreted in the *.TextGrid output after alignment. The problem is that the “” symbol in the TextGrid is being used both to represent pauses and ends of sentences. In our program that reads the TextGrid to convert back to our CHAT format, we had to deconfound the two by creating one method for aligning with words and pauses and a second method based on a template from our original input to figure out where the sentences ended. My programmer figured ways to do this, but it would have been easier if the two things hadn’t been conflated in the TextGrid output. So, this is solved, but I thought best to draw your attention to the issue and maybe we were just doing something wrong.

A second minor problem is a command given in Use Case #1. I am running MFA on a MacStudio with MacOS 12.4 Monterey and the inspect command is supposed to work, but gives this error:

(base) macw@BRIAN mfa_data % mfa model inspect acoustic english_us_arpa
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/miniconda3/bin/mfa", line 11, in <module>
  sys.exit(main())
  File "/opt/miniconda3/lib/python3.9/site-packages/montreal_forced_aligner/command_line/mfa.py", line 994, in main
  run_model(args)
  File "/opt/miniconda3/lib/python3.9/site-packages/montreal_forced_aligner/command_line/model.py", line 167, in run_model
  manager = ModelManager(token=args.github_token)
AttributeError: 'Namespace' object has no attribute 'github_token'
(base) macw@BRIAN mfa_data %

We figured this was not too fatal, because the validate command which then follows works well.

The third problem, which is the crucial one blocking our full usage of MFA is trying to understand how to add out-of-vocabulary words to the lexicon. Because we are dealing with spoken language, including exclamations, child forms, and so on, many things are missing.

For my test file, I see that validate creates oovs_found_english_us_arpa_1.txt with two missing words. but I don’t know how to add new forms to english_us_arpa

I am trying to follow the instructions at "Generating a pronunciation dictionary with a pretrained G2P model”

But then I run into the second problem noted above which is that the inspect command is not working and going further with these instructions seems chancey. What I was expecting was some way of combining a few new words in ARPABet notation with the existing forms in english_us_arpa, but I can’t see from the instructions how this would be done. I saw a couple postings vaguely related to this issue, but nothing exactly on the topic, probably mostly because I am a very new user.

Many thanks for help on this last problem,

—Brian MacWhinney, Prof CMU Psychology and LTI

michael.e...@gmail.com

unread,

Jun 17, 2022, 5:59:19 PM6/17/22

to MFA Users

Hi Brian,

If you upgrade to the latest version, those github_token errors will be resolved.

There's a bit of a UX gap with adding forms at the moment, when models/dictionaries are downloaded, they're saved to ~/Documents/MFA/pretrained_models. Once the OOVs have been G2Ped, then you can just add the pronunciations to the end of the dictionary in the pretrained_models/dictionary directory, and those will get picked up the next time you invoke it. It would be nice to update a dictionary from a text file directly in MFA, rather than having to navigate paths and copy-paste. I'll think about what the best way to implement that would be.

Hope that helps,

Michael

Reply all

Reply to author

Forward

0 new messages