Any suggestions with getting Tesseract to OCR this image?

324 views
Skip to first unread message

Amit Rao

unread,
Aug 20, 2015, 3:56:16 AM8/20/15
to tesseract-ocr
HI folks, 

I am using Tesseract IOS SDK to OCR parking stubs. The parking stubs are primarily in 2 formats. Tesseract does quite well on one of the formats but the OCR text 
for the second format is pretty much useless. I have attached the image that Tesseract is unable to OCR. If someone is able to report any success with OCRing this image 
I would really appreciate it. So far I have tried the following but they do not help with the OCR results.

1. Cropping the image
2. Reducing the height and width of the image with same/different aspect ratio
3. Binarizing the image into black and white
4. Filtering the image to smoothen the image. 

I haven't tried augmenting the training data set yet. The font seems to be pretty standard (Lucida) and my understanding is that unless the fonts are non-standard 
augmenting the training data will not be very useful. 

Your help/suggestions will be greatly appreciated. 

Thank you,
Amit Rao




New.jpg

Allistair

unread,
Aug 20, 2015, 4:34:25 AM8/20/15
to tesser...@googlegroups.com
Which Lucinda font do you think this is? All Lucinda fonts I see in a Google Image search are nothing like this.

You're right, this does not OCR well. In fact, if you just crop out a part of it to remove other noise, say, 09:43 AM, even with lots of margin Tesseract isn't even finding anything it thinks looks like text in normal page segmentation.

The best I got (for the cropped out time) was:

39:43 HH

So 28% incorrect.

The definition of the 'M' is quite eroded already which is not great.



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f7ed92d0-6448-48c8-a404-774965d9b35a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Amit Rao

unread,
Aug 20, 2015, 5:52:12 AM8/20/15
to tesseract-ocr
Thanks, Allistair. I was guessing that this font was similar to Lucida Console. e.g.


However, I don't know for certain what font this is and I don't know of a tool that will help me know for sure which font the image uses. The only text I am really interested in is "HH:MM AM/PM" but if I crop the image to include only the time Tesseract is still not able to read it similar to what you reported.. I cropped the image to include 09:43 AM and it reads it as  @9243 Rh

If this is a font that Tesseract does not recognize would it help augmenting the training data set with data from images with this format and font? 

Thanks,
amit

Allistair

unread,
Aug 20, 2015, 8:04:08 AM8/20/15
to tesser...@googlegroups.com
The font does not look like that - look the shape of the 0 which has a strikethrough in your image but not in Lucinda of the M shape. I am not sure font training will do a lot here, I think it's more the quality of the edges in your image due to the dot matrix printing or however it's printed producing uncertain edges. 

Perhaps others can chip in.

Allistair

unread,
Aug 20, 2015, 8:09:58 AM8/20/15
to tesser...@googlegroups.com
So another thing you could try ... I notice that everything is horizontally compressed. You could try scaling the image horizontally only to stretch things out (like I attach). 

This would then make the problem similar to those looking to read e.g. digital clock text - there are a variety of threads on this group about LCD/clock type reading that may then reveal further things you could do from that point.


2_1.jpg

Amit Rao

unread,
Aug 20, 2015, 2:59:00 PM8/20/15
to tesseract-ocr
Thanks. I tried scaling the image horizontally trying different widths and heights and the best Tesseract could do for 09:43 AM was 

ocr string = @9243 RH


I'll check out the threads on LCD/clock type reading. Thanks for the pointer. 


-amit

Allistair C

unread,
Aug 20, 2015, 3:02:02 PM8/20/15
to tesser...@googlegroups.com
Try different psr too - I got close with psr 6

Sent from my iPhone

Amit Rao

unread,
Aug 20, 2015, 9:48:16 PM8/20/15
to tesseract-ocr
psr? 

Allistair C

unread,
Aug 21, 2015, 3:08:34 AM8/21/15
to tesser...@googlegroups.com
Psm sorry - page segmentation mode

Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages