Tesseract mistakes letters for numbers

Eric Hodges

unread,

Jul 21, 2021, 2:07:15 PM7/21/21

to tesseract-ocr

I need some help. I have a bunch of images of text like this:

They are all 200 dpi, black and white images. In over 50% of the cases, Tesseract confuses the "SI" at the front for digits. Most of them are "51", but some are "81" or "31".

I've tried tweaking all of the settings I can find, but none of them improve the results. I'm currently using a config file like this:

tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

Interesting fact: If I cut off the digits and only send the alphas to Tesseract, it recognizes them correctly. Is there something in Tesseract that makes it less likely to mix letters and numbers in a single word?

Any suggestions?

Eric Hodges

unread,

Jul 21, 2021, 2:37:15 PM7/21/21

to tesseract-ocr

Update:

I discovered the command line option:

-c load_number_dawg=0

That did not improve my results.

Ajinkya Bobade

unread,

Aug 12, 2021, 12:51:12 AM8/12/21

to tesseract-ocr

Hello,

To do this you will need to retrain Tessearct on top of the model that you currently use. The current model that you use is not trained on this specific font, so it approximates the digit, take few samples of the format that you need and retrain it on top of original weights. If you have more questions feel free to email me.

Regards
Ajinkya
Creator of AI Scanner https://imagescanner-online.com/

zdenop

unread,

Aug 12, 2021, 2:34:52 AM8/12/21

to tesseract-ocr

tesseract string.jpg -

Warning: Invalid resolution 0 dpi. Using 70 instead.

Estimating resolution as 558

SI312533

I use language model from here https://github.com/tesseract-ocr/tessdata and tesseract 4.1.1

leptonica-1.81.0 (May 22 2021, 16:14:25) [MSC v.1928 LIB Release x64]

libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 1.2.0 : libopenjp2 2.4.0

Found AVX2

Found AVX

Found FMA

Found SSE

Dátum: streda 21. júla 2021, čas: 20:07:15 UTC+2, odosielateľ: eho...@usdataworks.com

Eric Hodges

unread,

Aug 12, 2021, 9:25:54 AM8/12/21

to tesser...@googlegroups.com

Thanks for your input, but we can't train Tesseract for any fonts. We are using it for mail that comes from thousands of sources. We have no control over which fonts are used.

We were able to improve results (from 8% success to 87%) by running Tesseract multiple times. One pass looked for letters, one for digits, one for punctuation. If we knew the format the word might take we could improve accuracy that way. But we found no good solution for mixed letters and digits when we don't know the format.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/2ti8v1hea88/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71e52bfe-0a27-44b1-b70e-2907aa722561n%40googlegroups.com.

--

Eric Hodges

Sr. Product Engineer

eho...@usdataworks.com
O: 281-504-8165

U.S. Dataworks

Reply all

Reply to author

Forward