Reading # from image only ~75% successful

Skip to first unread message

Ben Schipper

unread,

Sep 29, 2017, 2:38:30 PM9/29/17

to tesseract-ocr

I am attempting to read a fairly large 6 digit number from an image using Tesseract 3.02 on a windows 7 machine.

I have been able to get slightly better results by resampling the image to 300dpi using imagemagick, but I am still only able to get ~75% accuracy.

I have tried some other options (-lat, -blur, -contrast-stretch), but they only seem to make it worse. (I am not a graphic designer most sources of image manipulation help are greek to me)

Since the image does not contain many dictionary words I am using a config file to disable the dictionary (https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#dictionaries-word-lists-and-patterns)

load_freq_dawg 0

load_system_dawg 0

load_punc_dawg 0

Whitelisting numbers only didn't help because it just returned more characters as numbers which made it more difficult to pull the 6 digit # that I wanted out.

Unfortunately the data that I am pulling from the image can be located in different regions of the image so I can't crop the image.

Image samples attached. The largest text is the # that I would like to extract in both cases.

This correspondence may contain personal or confidential information. If you are not the intended recipient, please delete the e-mail and any attachments and notify London Hydro immediately.

224882.jpg

700848.jpg

Dmitri Silaev

unread,

Sep 29, 2017, 9:36:11 PM9/29/17

to tesser...@googlegroups.com

Hi Ben,

What you want to achieve is not possible with Tesseract alone. At all. And even with ABBYY, and any other OCR engine, if you use them out-of-the-box. Well, maybe something *might* be done, if you combine it with one's ImageMagick-fu, but I'm not sure, and only if you put some serious restrictions on images. Maybe an interactive mobile app would let disguise some of those restrictions in an unobtrusive manner.

I think, what you'd really want is a system that can work with arbitrary images, not only with scanned paper pages which are what a regular OCR system is designed for. There would be some special logic implemented to detect, enhance and recognize text in "OCR-tough" conditions. I'm not even going to list here what conditions in your images make them tough for OCR. There are lots of.

And yes, such systems exist. If you'd like to know more, just PM me, I'd be happy to help.

Best regards,
Dmitri Silaev
www.CustomOCR.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b02239a9-51bb-40de-af87-db2e2bea0574%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages