trying to use tesseract and getting bad results

tomlei

unread,

Dec 20, 2009, 11:45:39 PM12/20/09

to tesseract-ocr

I just installed tesseract for OCR usage and the first attempt the
it failed giving me the right txt (most of the words were weird
characters)

the pic is:
http://www.rentingtime.com/uploads/listing/l0033/0000033158/or48255.jpg

i run it through some free online OCR websites and they can ready it.

Can anybody explain what am i doing wrong or how to improve tesseract ?

SteveP

unread,

Dec 21, 2009, 1:33:08 PM12/21/09

to tesseract-ocr

Here are some things to try to get better results:
1) resize the image larger so characters such as 'e' are at least 20
to 30 pixels high.
2) threshold to remove noise; (make gray values above 130 or so all
get mapped to 255).
3) unsure what tesseract does with bullets; does anyone else know?
4) If this is a scanned image, rescan at 300 dpi.
5) I vaguely remember JPEG is not the preferred format; png, bmp, tiff
are better with tesseract if I remember correctly.

See some of my other posts for additional details. Or search other
posts in this group.

tomlei

unread,

Dec 22, 2009, 4:50:29 PM12/22/09

to tesseract-ocr

Thanks for the help.

After making the image bigger and then threshholding i got much better
results.
But still incomplete reading. the file is not being scanned but
downloaded from the web.
Changing the format didn't make a difference.

Any other way you can think about to improve?

Haiku Automation

unread,

Dec 29, 2009, 4:49:34 PM12/29/09

to tesseract-ocr

Your net scraping to OCR and republish/migrate to another real estate
system... Or so it seems. You have originated 72 dpi images, and
they are not going to give you squat for results. 150x150 is the real
minimum you will need, and 200x200 or better yet 300x300 dpi are
best. I have scanned up to 400+ dpi, and found anything over 300 dpi
really doesn't result in significant improved results. You need to
train a system much more tightly with low DPI, so find a tool that
will do it with your expert help (mapping a image region to a
particular unicode).

Reply all

Reply to author

Forward