Expected language model format

25 views

Skip to first unread message

Daniel Wolf

unread,

Apr 21, 2022, 11:43:31 AM4/21/22

to MFA Users

I'm trying to use mfa transcribe.

The documentation for this command states that language_model_path should be the "full path to [a] pre-trained language model", but it doesn't state the expected file format.

My language model is in ARPA format, that is, it's a text file with the sections \data\, \1-grams:, \2-grams:, and \1-grams:.

Running mfa transcribe with this language model throws the error "Unknown archive format 'open-subtitles.lm'". So apparently MFA expects language model files to be some sort of archive.

I then found the documentation for mfa train_lm, which states that source_path should be the "full path to the source directory to train from, alternatively an ARPA format language model to convert for MFA use". That sounded promising, so I called this command with my language model file. This didn't work either, giving me the error "The specified corpus directory (./open-subtitles.lm) is not a directory." Given that an ARPA language model isn't a directory, this surprised me.

So I wonder:

What is the expected file format for language models used with mfa transcribe?
Am I right in assuming that mfa train_lm can be used to convert an ARPA language model to the expected format? If so, what is the correct syntax for doing so?

Reply all

Reply to author

Forward

0 new messages