Incorrect OCR of 4-digit number

168 views
Skip to first unread message

Chris McClelland

unread,
Feb 26, 2022, 3:04:53 PM2/26/22
to tesseract-ocr
Hi tesseract community!

I've found an interesting scenario where a simple 4-digit number cropped from a PDF (i.e from a region rendered from a vector font, not from an embedded bitmap) is incorrectly OCR'd. I used ImageMagick to extract a .png from the source PDF, like this:

convert -density 1600 -trim input.pdf[42] -rotate 90 +repage -crop 600x720+900+3400 crop.png

...and then used tesseract to OCR it:

tesseract crop.png stdout --psm 6

The digits "1552" in the source image are OCR'd as "15562".

You can try for yourself like this:

tesseract 0swZuoU.png stdout --psm 6

The image as hosted on imgur is not bitwise-equivalent to crop.png, but it's impossible to tell apart by eye. I can upload the original crop.png somewhere else, if necessary.

I'm using the latest commit (30ebb31f) of the tesseract engine, and I tried with the latest commits (4767ea9 & e2aad9b) of both tessdata and tessdata_best.

Can I do anything to improve the OCR result in this sort of scenario?

Chris

Zdenko Podobny

unread,
Feb 27, 2022, 2:56:20 AM2/27/22
to tesser...@googlegroups.com
tesseract fix_size.png -

0326
0939
1552
2206






Zdenko


so 26. 2. 2022 o 21:04 Chris McClelland <proph...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6d94d071-6161-4d21-8733-c5322ee71dd0n%40googlegroups.com.

Merlijn B.W. Wajer

unread,
Feb 27, 2022, 5:28:01 AM2/27/22
to tesser...@googlegroups.com
Hi,

On 27/02/2022 08:55, Zdenko Podobny wrote:
> tesseract fix_size.png -
>
> 0326
> 0939
> 1552
> 2206
>
>
> See doc for explaining:
> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling
> <https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling>

Thanks for the suggestion, I'm also running into this problem in some
cases. Is it possible that this is also some kind of segmentation bug? I
wonder what Tesseract finds here in this clear image that causes it to
produce an extra character.

Regards,
Merlijn

Zdenko Podobny

unread,
Feb 27, 2022, 8:23:09 AM2/27/22
to tesser...@googlegroups.com
I do not know. The trick with upscaling is here from version 3.x.  The trick with downscaling works from version 4.x 
Just looking at Willus Dotkom's chart[1] I would guess there is some design decision... But without explanation from original/google programmers, we can just guess or find a bug ;-)


ne 27. 2. 2022 o 11:27 Merlijn B.W. Wajer <mer...@archive.org> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Chris McClelland

unread,
Feb 27, 2022, 8:36:37 AM2/27/22
to tesseract-ocr
So I did a similar analysis to Willus (see link posted by Zdenko), downscaling the images to try a range of heights for digits. Unfortunately my result is not as nice as Willus's (where he finds that the error rate drops to zero for capital-letter heights of 30-33 pixels). In my case I have 336 images each containing a column of ~30 #### numbers (dataset T) and 336 images each containing a column of ~30 #.# numbers (dataset D).

The error rate for D seems to tend to zero for larger digit-heights (i.e more pixels) - the most common errors for smaller sizes seem to be missing the decimal point, e.g getting input "1.2" and producing output "12". To eliminate those errors, I need digits about 92 pixels high.

The error rate for T is more complex. It has a broad trough in the digit-height range 20-48 pixels, with several points (20,32,38) with a perfect score, but no obvious range which produces a perfect score.

Perhaps I could train it myself? Is 336*30*4 ~ 40,000 digits of training data enough to get meaningful results with OCR?

Chris

Zdenko Podobny

unread,
Feb 27, 2022, 12:06:06 PM2/27/22
to tesser...@googlegroups.com
my 2 cents:

First of all create the public testing case/repository focused on this problem e.g. different font families, font size, shot text (like 0swZuoU.png), long text, etc. This could be used for finding problems/bugs, evaluating possible solutions, maybe (re)training. So synthetic data imitating real-world cases are fine. 
I would suggest focusing on the most common fonts as used on different platforms (e.g. on Windows  Arial, Times New Roman, Courier New, Calibri, Cambria, Consolas, Segoe UI on Linux probably DejaVu, Liberation, Ubuntu, not sure about Mac&IOS ;-)
I would suggest using column or paragraph style for input image (e.g to avoid problems with document layout analysis like tables, header, footer..)

Zdenko


ne 27. 2. 2022 o 14:36 Chris McClelland <proph...@gmail.com> napísal(a):

Orsey Aehr

unread,
Feb 28, 2022, 8:01:53 AM2/28/22
to tesseract-ocr
I found that this PR reduced errors by around 75% in my case: https://github.com/tesseract-ocr/tesseract/pull/3476
Reply all
Reply to author
Forward
0 new messages