Recognition affected by blank space

111 views
Skip to first unread message

AxB

unread,
Sep 5, 2015, 6:37:04 AM9/5/15
to tesseract-ocr
Hello everyone.

I just started using Tesseract-OCR 3.02 to recognise numbers only.

The number themselves are probably in Futura Bold font, styled in a particular manner (see images).

Using the "digits" parameter, Tesseract-OCR would either get it perfectly or fail completely (return a blank).

After quite a bit of testing, it appears that it is the "crop" of the image is what makes or break. For instance:

When poorly cropped as above, with quite a bit of horizontal and vertical blank, the engine will always fail to return anything


A crop like this, with a some space for extra digits would fail in this particular example, but succeed at time.


A crop like this, has so far always worked.


 
The problem is that I am capturing the image automatically and need to cover for a range of at least 5-7 digits. 

I would never need to crop as badly as the first example, but I do need more leeway than the last one allow.

Is there anything I could try to make something like the middle crop work better?

Thanks.

AxB

unread,
Sep 5, 2015, 2:08:15 PM9/5/15
to tesseract-ocr
Sorry for the repeated post. I missed the message mentioning that posts need to be approved the first time , and after waiting around 12 hours assumed that my first post wasn't posted and wrote this post. I think that it will be better for everyone if replies are kept here.

Upon more testing, I noticed that the crop can at time affect the quality of the result. It is very rare, but I ran into a situation where "8" was recognised as "3". Once again, changing the box slightly could allow it to get it right again.

It is really odd to me that despite the text itself not changing whatsoever, having more or less background area can make this much difference..

Tom Morris

unread,
Sep 10, 2015, 1:16:46 PM9/10/15
to tesseract-ocr
Are you doing any pre-processing besides cropping?  If those images are representative and the colors are constant, I'd replace the orange background with black and then invert the image to give black digits with no border on a white background.  Also use the page segmentation mode for a single line of text.

Tom
Reply all
Reply to author
Forward
0 new messages