NGraphs Models

113 views

Skip to first unread message

Luciano Édipo

unread,

Aug 23, 2012, 4:51:56 PM8/23/12

to ocr...@googlegroups.com

I am creating a language model using OCRopus-ngraph, there must be some pre-processing or preparation of the set of text used to generate the model? Some indication about it?

Tom

unread,

Aug 23, 2012, 9:28:30 PM8/23/12

to ocr...@googlegroups.com

No, no preprocessing is normally required. Whatever text you give it is simply used to determine the probabilities. Note that line breaks matter, since the model models the start of the line. Furthermore, contexts longer than 3-4 may cause the model to become too sparse (there is no back-off right now). The trickiest part in getting the language model to work is in finding the right weights for characters, language models, and whitespace (specified with command line parameters to ocropus-ngraphs during matching). They are a tradeoff between how well your documents match your corpus, document quality, and recognizer quality.