Coincidentally, I recently began looking into this for my own use. I decided the easiest couse would probably be to adapt the excellent, open work done by Nick White for Ancient Greek. Unfortunately I'm not very far along yet, as part of the first steps are making sure I can correctly replicate the existing process for Ancient Greek on my own machine (the mftraining step in the grc repository seems to be taking quite some time).
Right now that should just build you (from the same Perseus sources):
- training_text.txt
- lat.word.txt
- lat.freq.txt
- lat.unicharambigs
- lat.wordlist
Note that this is very initial, as I've just trivially altered it at this point so that I can start figuring out what I need to clean up in the input/processing.
Note also that there's a modification here in tools/wordlistfromperseus.sh to strip <foreign> tags instead of skipping files with foreign words altogether. I think this would help Ancient Greek as well (though how much it will improve or alter overall accuracy I don't know). For Greek, this change results in the wordlist being 7202347 lines for me instead of 5605967, or a 28% increase in the size of the corpus. I originally did this with Saxon/XSLT, but the processing was slow, so I switched to using Perl so I could apply a non-greedy regex substitution instead (which is much faster):
https://github.com/ryanfb/ancientgreekocr-grctraining/commit/069648af2e2b45e41fd7e4ff4390343b45765f77
-Ryan