Train language model for new language

35 views

Skip to first unread message

Lane Schwartz

unread,

Apr 15, 2015, 3:42:28 PM4/15/15

to tesser...@googlegroups.com

Hi,

I've followed the instructions on https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3, and have successfully trained tesseract on a new language (ess). The results are OK, but I'd like to improve them.

My background is statistical machine translation, and from that context I would expect to be able to improve OCR quality through the use of character-level or word-level n-grams.

Is there a mechanism whereby I can plug in an n-gram language model for my new language so that tesseract will use it? I've seen some references to bigram-dawg, but I haven't had any luck finding instructions for that feature, or even a good description of what it is.

I may also look into dictionaries, but that will likely be problematic for me, because the language I'm working with is morphologically rich and highly agglutinative. Is there a mechanism for providing a morpheme dictionary?

Thanks,

Lane

Reply all

Reply to author

Forward

0 new messages