Hey all,
I'm trying to build a tool to digitize some images of recipe,and just started experimenting with Tesseract. The result seems reasonable. But it seems could be further improved by supplying domain specific language model. For example, I'm seeing "fish sauce" being recognized as "iisir sauce", "shrimp" being recognized "shrmp" ...
Can someone point out where I can find more information regarding language model format. I saw the files with "eng.cube" prefix in language data. I would like to know how to interpret them.
Also, is there any tool to show me intermediate result of the process, for instance the result of layout analysis, and alternative word hypothesis.
Thanks
jia