Is this too ambitious?

116 views
Skip to first unread message

maxim...@gmail.com

unread,
Jul 14, 2015, 4:43:23 PM7/14/15
to tesser...@googlegroups.com
I would like to use knowledge of the page layout and to greatly improve OCR accuracy. I am working with a large number of forms that are extremely repetitive in structure. Say I know that a particular field in the form holds the value for state/province, and another for city/town. 

Is it too ambitious to attempt to improve the accuracy of tesseract by using this knowledge? For example, I could hypothetically identify the field that holds the state/province, classify this as one of 50 possible states. Then I can have a list of cities in every state, and classify the contents of the city field by choosing the most likely city that is in that state? 

This type of approach could hypothetically be generalized to many other types of very structured information, for example, letting tesseract know that a particular field is likely to contain a year or a phone number, or even potentially a name and choosing from a long list of names. 

Are these types of goals realistic? And if so, is the best way to get started to spend a long time with the source code, make modifications, and compile it myself? Thanks very much!

maxim...@gmail.com

unread,
Jul 14, 2015, 5:05:42 PM7/14/15
to tesser...@googlegroups.com
Also, since 80-90 % of the text on the page is a repeat of text that my program will have seen many times before, is there a way to ignore it or prevent tesseract from processing it beyond understanding that is 99.99 % likely to be a repeat of previously seen words and characters? Thanks again! I'm trying to understand tesseract as fast as possible but it is complicated.

maxim...@gmail.com

unread,
Jul 15, 2015, 1:11:15 AM7/15/15
to tesser...@googlegroups.com
I think the UZN file format may be what I am looking for.

Tom Morris

unread,
Jul 16, 2015, 12:31:37 PM7/16/15
to tesser...@googlegroups.com
Your goals don't sound unreasonable, but I'd suggest using an approach that focuses on pre and post processing before diving in and hacking on tesseract itself.  That will allow you to easily continue to track improvements in base tesseract without having to worry about re-integrating your changes.

Tom

maxim...@gmail.com

unread,
Jul 16, 2015, 2:40:10 PM7/16/15
to tesser...@googlegroups.com
Thanks very much for the feedback!
Reply all
Reply to author
Forward
0 new messages