Using OCR to recognize http urls

maxm007

unread,

Nov 20, 2009, 9:10:26 AM11/20/09

to tesseract-ocr

Hi,

I'm researching whether it is possible to use OCR to gather web
addresses from images. I've tried a tesseract online service and some
others and it seems OCR doesn't like web addresses.

Is it at all possible with current OCR technology to recognize the
following url from an image:
http://www.google.co.uk/search?source=ig&hl=en&rlz=&=&q=test&btnG=Google+Search&meta=lr%3D&aq=f&oq=

I would imagine the punctuation and the query string gibberish would
make dictionary matching more of a hassle than. Turning off the
dictionary would again allow for too many mistakes. And one mistake is
enough to make the url useless.

Would you be able to train tesseract to recognize hyperlinks? Could
you construct a dictionary that could help URL detection?

Your input would be greatly appreciated.

Max

SteveP

unread,

Nov 20, 2009, 2:38:05 PM11/20/09

to tesseract-ocr

I have noticed that OCR results are better when underlining is removed
by preprocessing before OCR is attempted. Could you try an experiment
where you manually remove the underlining from the images using Paint
or something similar? (If you need info on how to automate removal of
underlining, post about that. If anybody in the forum has ideas about
this, please post those. I am interested in ideas myself.)

Also ,usually urls in web pages are what the tesseract FAQ calls
"screen text", so if you have not already handled the small font
issue, resizing your image to make the lower case letters (such as
'x') about 20 to 30 pixels high is recommended.

On Nov 20, 6:10 am, maxm007 <max.hilla...@gmail.com> wrote:
> Hi,
>
> I'm researching whether it is possible to use OCR to gather web
> addresses from images. I've tried a tesseract online service and some
> others and it seems OCR doesn't like web addresses.
>
> Is it at all possible with current OCR technology to recognize the

> following url from an image:http://www.google.co.uk/search?source=ig&hl=en&rlz=&=&q=test&btnG=Goo...

Patrick Questembert

unread,

Nov 20, 2009, 2:57:19 PM11/20/09

to tesser...@googlegroups.com

I am definitely interested in any solution available in source code to
remove underlines - this is on my to-do list. If nobody has a solution
to share, I'll be happy to post mine when I get around to coding it.

Thanks,
Patrick

> --
>
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=.
>
>
>

Max Hillaert

unread,

Nov 21, 2009, 6:04:05 AM11/21/09

to tesser...@googlegroups.com

Hi thanks for replying.

Actually, the underlining was added by google mail. At a minimum I would like OCR to accuratly detect link characters without the underlining you see in the mail below.

I will try blowing up the image size though to see how that affects accuracy.

Btw, I'm using http://weocr.ocrgrid.org/ to test. It uses tesseract but maybe that has optimized the OCR towards more general OCR.

Not knowing anything about tesseract, I would guess it would need to deal well with dots and slashes instead of spaces and words that are not in the english dictionary. Could you optimize it for that?

Reply all

Reply to author

Forward