Improving OCR of Form

Travis DePriest

unread,

Jun 4, 2021, 3:40:20 PM6/4/21

to tesseract-ocr

https://www.slideshare.net/EdwardOHalloran1/officer-evaluation-form-20160908

How do I go about improving the OCR of the form above? I have tried a lot of methods, such as erasing the lines, cropping out individual rows, etc, and none seem to improve the tesseract OCR performance.

The biggest problem is the text that I need (the field) seems to do OK, but the surrounding identifier is sometimes poor, which makes extraction difficult using regex.

Jeremy Young

unread,

Jun 8, 2021, 4:09:26 AM6/8/21

to tesseract-ocr

You won't like this, but ....

We had a similar problem and we tackled it by doing an initial OCR run to locate the words, then a really simple mickey-mouse process to look for lines between the words, and then use the detected lines to identify regions which we re-OCRd one-by-one.

Enjoy!

Jeremy Young

unread,

Jun 8, 2021, 4:11:18 AM6/8/21

to tesseract-ocr

When I say Mickey-Mouse I mean looking for a series of black pixels in a line in the white-space between the words.

Works ok for a binary image ...

Reply all

Reply to author

Forward