I'm trying to use
mfa transcribe.
The documentation for this command states that language_model_path should be the "full path to [a] pre-trained language model", but it doesn't state the expected file format.
My language model is in ARPA format, that is, it's a text file with the sections \data\, \1-grams:, \2-grams:, and \1-grams:.
Running
mfa transcribe with this language model throws the error "Unknown archive format 'open-subtitles.lm'". So apparently MFA expects language model files to be some sort of archive.
I then found the documentation for mfa train_lm, which states that source_path should be the "full path to the source directory to train from, alternatively an ARPA format language model to convert for MFA use". That sounded promising, so I called this command with my language model file. This didn't work either, giving me the error "The specified corpus directory (./open-subtitles.lm) is not a directory." Given that an ARPA language model isn't a directory, this surprised me.
So I wonder:
- What is the expected file format for language models used with mfa transcribe?
- Am I right in assuming that mfa train_lm can be used to convert an ARPA language model to the expected format? If so, what is the correct syntax for doing so?