Problem with recognition of numbers 3 and 8

557 views
Skip to first unread message

Federico C.

unread,
Feb 24, 2015, 10:31:36 AM2/24/15
to tesser...@googlegroups.com
Hi , I'm having a problem with recognition of an invoice image, the recognition is reading most of the 8 characters as 3s.

Attached is the image I'm using.

I have tried with different PSM and some basic configuration options (resolution, avoid loading dawgs).

Any help is appreciated.

test1.tif

Dmitri Silaev

unread,
Feb 24, 2015, 11:20:48 AM2/24/15
to tesser...@googlegroups.com
You need upscaling, then a bit of blurring and it should work.

For upscaling personally I tried Lanczos with a factor of 3x. This eliminates most of "8 vs. 3" errors. Don't forget that your source TIFF is BW (2 colors) so you have to save the upscaling result e.g. as a 24bit PNG.

For blurring - I used FastStone Image Viewer's Blur with a parameter of 14. If you want to use ImageMagick - I don't know how it exactly relates to Gaussian blur sigma, you have to experiment.

Then a standard command line for Tesseract works well. At least no more "8 vs. 3" errors.

Best regards,
Dmitri Silaev
www.CustomOCR.com



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ad762df6-4617-4184-b5c5-aedf1ec9b92c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andy Brandt

unread,
Apr 9, 2015, 3:06:08 PM4/9/15
to tesser...@googlegroups.com
I'm having a similar issue with a font that i've trained for numbers and a few symbols only - i've attached a sample of the numbers. It is detecting 2's as 8's in my case.

I tried using a Gaussian blur and it appears to help the issue. It also appears that depending on how much or how little blur it changes the results. Do you know why this is?

Do you know if it would help to blur the images when training tesseract too?

Thanks!
Andy
txt.png

Art W Rhyno

unread,
Apr 10, 2015, 10:25:58 AM4/10/15
to tesser...@googlegroups.com
> It is detecting 2's as 8's in my case

Hi Andy,

I am surprised the font training has not eliminated this, but if you are willing to dig into the tesseract API, you can get coordinates for individual characters recognized by tesseract. One possible trick is to flag characters that invoke false positives and use their coordinates to reprocess the characters solely with your training file.

art
---
Art Rhyno
Systems Librarian
University of Windsor

Dmitri Silaev

unread,
Apr 11, 2015, 6:23:59 PM4/11/15
to tesser...@googlegroups.com
Hi Andy,

Tesseract dislikes everything not resembling an ordinary scanned document. Features that might lead to poor results include: computer generated (incl. screen-rendered, anti-aliased) characters, too jagged or too square character shapes, too bold or thin strokes, too big or too small font size, and so on. Your image has several of them at the same time: too square shapes, too big font size, too thin strokes.

Standard Tesseract language files (English is the most elaborated) are trained using
"ordinary" fonts and "ordinary" scanned (or synthesized but closely resembling scanned) images. By using a number of image operations, we will try to get the resulting image as close as possible to what Tesseract is used to:

1. "inet005.png" - your source image
2. "inet005_blur.png" - hardly blurred, for thickening the strokes and smoothing angular strokes.
3. "inet005_blur_ds.png" - downscaled by 4x. Wow! downscaled while everybody around always suggests upscaling... Upscaling is needed when Tess has no enough shape contour information due to small font size and/or computer generated characters. Here, at stage #2, we already have lengthy enough contours. So we use downscaling to pull character sizes into Tess's standard working area and at the same time smooth shapes even more as of  stage #2.
4. "inet005_blur_ds_clamp.png" - threshold. In fact, not a threshold but a clamp of levels so that the image now has 29 colors, not 2, just because I find that way the image looks nicer. This stage is needed because Tesseract, using its binarization algorithm, would choose a threshold leading to strokes nevertheless thinner than we need. Here we adjust levels so that Tess has no other way but to choose a threshold matching our needs.

Look at the result. Nice, fleshy, sanely sized, smooth (well, almost) characters - something that Tess likes to work with. No need to do your own training. And here's the reward: "inet005_blur_ds_clamp.png.txt" - a perfect recognition result.

Used a command line with no additional configs, default PSM, English data file.
inet005_blur_ds_clamp.png.txt
inet005.png
inet005_blur.png
inet005_blur_ds.png
inet005_blur_ds_clamp.png
Reply all
Reply to author
Forward
0 new messages