not detecting text in-between horizontal lines

41 views
Skip to first unread message

Glen Rubin

unread,
Jun 17, 2014, 4:06:54 PM6/17/14
to tesser...@googlegroups.com
Teseract is failing to OCR text on my page in-between 2 horizontal lines.  For example it would miss the following text:


___________________________________________________________

       This text is missed by Tesseract
____________________________________________________________

Any suggestions of how to overcome this.  I was looking at imagemagick scripts to get rid of the lines, but that seems rather involved.

Paul

unread,
Jun 17, 2014, 6:10:04 PM6/17/14
to tesser...@googlegroups.com
IN a preprocessing step you could do a connected component analysis (https://en.wikipedia.org/wiki/Connected_component_labeling)
and then filter out all blobs that have an aspect ratio larger than, say, 20 to 1 or something like that. That should be quite efficient if the
lines are not skewed. Since Tesseract already uses leptonica you probably also want to use that library to find the connected components
(see conncomp.c).
Reply all
Reply to author
Forward
0 new messages