Line Finding Problem in OCR

111 views
Skip to first unread message

Tian Dong

unread,
Jun 16, 2016, 4:01:34 AM6/16/16
to tesseract-ocr

In my OCR situation, Tesseract can not identify rows properly. Please see the attach box image below. (Blue squares are boxes found by Tesseract and red areas are marked as problematic area by me)

qq 20160615195248

It seems that Tesseract is not able to find the baseline correctly when the row spacing is small and the image is a little skew --- two chars in two rows are mistakenly vertically merged. Therefore, the OCR quality in "crowded" space is really poor.

How could I imporve the OCR quality in this situation? Are there any params can be used here?

Bojidar Stanchev

unread,
Jun 16, 2016, 6:28:25 AM6/16/16
to tesseract-ocr
About the topic - tesseract is bad if the lines are curved or tilted, you should preprocess the image to adjust the lines.

Anyway, as far as a see you probably have recognition problem because the rectangles are too tight, if you put those rectangles with code you probably found the contours around the characters and called boundingRect() on them. This is ok, just after you cut out the symbols add 3-4 rows and columns of white pixels on each side and then give the expanded image to tesseract. The problem is that when a character is cut too close and it touches the border of the image recognition drops significantly and just adding some "empty" space around the character or word the recognition % is way better.
Reply all
Reply to author
Forward
0 new messages