Improving OCR of Form

93 views
Skip to first unread message

Travis DePriest

unread,
Jun 4, 2021, 3:40:20 PM6/4/21
to tesseract-ocr


How do I go about improving the OCR of the form above? I have tried a lot of methods, such as erasing the lines, cropping out individual rows, etc, and none seem to improve the tesseract OCR performance.

The biggest problem is the text that I need (the field) seems to do OK, but the surrounding identifier is sometimes poor, which makes extraction difficult using regex.

Jeremy Young

unread,
Jun 8, 2021, 4:09:26 AM6/8/21
to tesseract-ocr

You won't like this, but ....
We had a similar problem and we tackled it by doing an initial OCR run to locate the words, then a really simple mickey-mouse process to look for lines between the words, and then use the detected lines to identify regions which we re-OCRd one-by-one.
Enjoy!

Jeremy Young

unread,
Jun 8, 2021, 4:11:18 AM6/8/21
to tesseract-ocr
When I say Mickey-Mouse I mean looking for a series of black pixels in a line in the white-space between the words.
Works ok for a binary image ...

Reply all
Reply to author
Forward
0 new messages