Expected language model format

24 views
Skip to first unread message

Daniel Wolf

unread,
Apr 21, 2022, 11:43:31 AM4/21/22
to MFA Users
I'm trying to use mfa transcribe.

The documentation for this command states that language_model_path should be the "full path to [a] pre-trained language model", but it doesn't state the expected file format.

My language model is in ARPA format, that is, it's a text file with the sections \data\\1-grams:\2-grams:, and \1-grams:.

Running  mfa transcribe with this language model throws the error "Unknown archive format 'open-subtitles.lm'". So apparently MFA expects language model files to be some sort of archive.

I then found the documentation for mfa train_lm, which states that source_path should be the "full path to the source directory to train from, alternatively an ARPA format language model to convert for MFA use". That sounded promising, so I called this command with my language model file. This didn't work either, giving me the error "The specified corpus directory (./open-subtitles.lm) is not a directory." Given that an ARPA language model isn't a directory, this surprised me.

So I wonder:
  1. What is the expected file format for language models used with mfa transcribe?
  2. Am I right in assuming that mfa train_lm can be used to convert an ARPA language model to the expected format? If so, what is the correct syntax for doing so?
Reply all
Reply to author
Forward
0 new messages