Question about underlined text

359 views
Skip to first unread message

Jurgis Pasukonis

unread,
Oct 26, 2014, 6:48:16 AM10/26/14
to tesser...@googlegroups.com
Hi all,

I have a task to recognize printed timetables. I started experimenting with tesseract-OCR a week ago and I managed to train it to recognize the following kind of pictures perfectly:

I just used a single image containing all different digits, since the font is always the same.

Now, what I'm having problems with is the following - exactly the same font, just red and underlined.

What happens is that tesseract recognizes the whole word "06:05" including the underline as a single blob, and then of course it can't recognize what symbol it is. Funny thing, in some rare cases it does succeed (it ignores the underline, and then marks each symbol as a blob, and recognizes them correctly), and I can't figure out what it depends on. It somehow depends on the context - if I change the layout, keeping the text exactly the same, it would sometimes recognize it correctly, and sometimes not.

Perhaps some experts here could give an advice, how to go about solving this.. Most importantly how, to debug what's going on? My thoughts, and what I've been trying:

  • I tried including the red/underlined example in the training data as a "different font", but that doesn't help.
  • I've tried running with the options "psm -5", "psm -6" and it does change the behaviour significantly, but none works as it should. In any case, this is suggesting me that the problem is in the way tesseract splits the text into blobs, not with the actual symbol recognition. And the underlined text confuses it.
  • I've tried playing around with the underline recognition settings (e.g. textord_underline_threshold), but it made absolutely no difference
  • I've tried to dig deeper into the architecture of tesseract - page segmentation, blob recognition, chopping - because it seems that the problem is in one of those steps, but couldn't yet find a good way to debug it. Tried using this (https://code.google.com/p/tesseract-ocr/wiki/ViewerDebugging), but it's only telling me that the underlined text ends up as a single blob.

Thanks a lot for any suggestions,
Jurgis

Felix Bolivar

unread,
Jul 13, 2016, 2:35:25 PM7/13/16
to tesseract-ocr
I have a PDF with yellow background, black text and e-mail address underlined and in blue color,
Used convert (imageMagik project) to save as TIFF file changing image to monochrome.
The result, was a dirty TXT file, but it recognized the underlined text, maybe it's not a solution but a workaround.
Reply all
Reply to author
Forward
0 new messages