OCR Recognition for Underlined text

3,853 views
Skip to first unread message

Gunasekaran Velu

unread,
Mar 5, 2016, 5:11:55 AM3/5/16
to tesseract-ocr

Hi

>tesseract.exe Underline.png Underline -l eng -psm 1

Result: This is underline word @


Does it possible to do OCR recognition for underlined text/word on the image? or some image processing need to apply on the image?

Attached sample image.

Looking forward your reply.


Regards
Guna
Underline.png

Tom Morris

unread,
Mar 5, 2016, 12:12:18 PM3/5/16
to tesseract-ocr
On Saturday, March 5, 2016 at 5:11:55 AM UTC-5, Gunasekaran Velu wrote:

>tesseract.exe Underline.png Underline -l eng -psm 1

Result: This is underline word @

Does it possible to do OCR recognition for underlined text/word on the image? or some image processing need to apply on the image?

Attached sample image.

Tesseract knows how to recognize underlined text, as you can see from that fact that it got "underline" correct in your example. For some reason it's getting confused by the underlined word "test", perhaps because it's at the end of the line?

It could potentially represent a bug, but I'd try to recreate it with a less artificial example. Of course, pre-processing would improve the situation and removing underlines should be that hard to do.

Tom 

Gunasekaran Velu

unread,
Mar 6, 2016, 7:38:03 PM3/6/16
to tesseract-ocr
HI

I just sent own creation f image in paint and sent you.

Now i have attached the real document(Cropping from full image due to confidential data) underline text. 

In this case when i do the OCR the underline text completely skipped by tesseract.

Kindly update the same.


Regards
Guna 
1.png
2.png
3.png

Gunasekaran Velu

unread,
Mar 9, 2016, 4:36:58 AM3/9/16
to tesseract-ocr
Hi Tom

Any update regarding underline text problem?


Regards
Guna

Gunasekaran Velu

unread,
Apr 16, 2016, 5:06:12 AM4/16/16
to tesseract-ocr
Hi Tom

Does it possible to use config variable for underline text image?

Looking forward it.


Regards
Guna

On Monday, March 7, 2016 at 6:08:03 AM UTC+5:30, Gunasekaran Velu wrote:

Tom Morris

unread,
Apr 16, 2016, 1:56:37 PM4/16/16
to tesser...@googlegroups.com
There's a critical word missing from what I wrote and perhaps my English is a little ambiguous too, so let me try again:

It could potentially represent a bug, but, if I were you, I'd try to recreate it with a less artificial example and if you confirm that it's a real bug, file a bug report with all the details of your findings so that one of the developers can look at it. Of course, pre-processing would improve the situation and removing underlines should not be that hard to do.

The most direct route to success, in my opinion, is going to be pre-processing to remove the underlines. When you're working on this and testing the results, you should make sure that you work on representative images, not little tiny fragments of a few words. When Tesseract has normal page boundaries, multiple lines of text, etc, it has much more information available to it to figure out font size, line spacing, etc.

If you need help in figuring out how to do the line removal, there are tutorials available on the web, but any recipe is going to need tuning and experimentation to work best with your particular application.


If you've got additional question, feel free to address them to the list rather than me personally. I wasn't offering to help you debug this for free or to write the application for you.

Tom

Gunasekaran Velu

unread,
Apr 19, 2016, 9:29:55 PM4/19/16
to tesseract-ocr
Thanks Tom.

Regards
Guna

Felix Bolivar

unread,
Jul 13, 2016, 2:35:08 PM7/13/16
to tesseract-ocr
I will try your sugestion.

Thanks !
Reply all
Reply to author
Forward
0 new messages