Hi,
My background is statistical machine translation, and from that context I would expect to be able to improve OCR quality through the use of character-level or word-level n-grams.
Is there a mechanism whereby I can plug in an n-gram language model for my new language so that tesseract will use it? I've seen some references to bigram-dawg, but I haven't had any luck finding instructions for that feature, or even a good description of what it is.
I may also look into dictionaries, but that will likely be problematic for me, because the language I'm working with is morphologically rich and highly agglutinative. Is there a mechanism for providing a morpheme dictionary?
Thanks,
Lane