Tesseract mistakes letters for numbers

934 views
Skip to first unread message

Eric Hodges

unread,
Jul 21, 2021, 2:07:15 PM7/21/21
to tesseract-ocr
I need some help. I have a bunch of images of text like this:

sample_si.jpg
They are all 200 dpi, black and white images. In over 50% of the cases, Tesseract confuses the "SI" at the front for digits. Most of them are "51", but some are "81" or "31".

I've tried tweaking all of the settings I can find, but none of them improve the results. I'm currently using a config file like this:

tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

Interesting fact: If I cut off the digits and only send the alphas to Tesseract, it recognizes them correctly. Is there something in Tesseract that makes it less likely to mix letters and numbers in a single word?

Any suggestions?

Eric Hodges

unread,
Jul 21, 2021, 2:37:15 PM7/21/21
to tesseract-ocr
Update:

I discovered the command line option:

    -c load_number_dawg=0

That did not improve my results.

Ajinkya Bobade

unread,
Aug 12, 2021, 12:51:12 AM8/12/21
to tesseract-ocr
Hello,

To do this you will need to retrain Tessearct on top of the model that you currently use. The current model that you use is not trained on this specific font, so it approximates the digit, take few samples of the format that you need and retrain it on top of original weights. If you have more questions feel free to email me.

Regards
Ajinkya
Creator of AI Scanner https://imagescanner-online.com/

zdenop

unread,
Aug 12, 2021, 2:34:52 AM8/12/21
to tesseract-ocr
tesseract string.jpg -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 558
SI312533

I use language model from here https://github.com/tesseract-ocr/tessdata and tesseract 4.1.1
 leptonica-1.81.0 (May 22 2021, 16:14:25) [MSC v.1928 LIB Release x64]
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.0.91) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 1.2.0 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
Dátum: streda 21. júla 2021, čas: 20:07:15 UTC+2, odosielateľ: eho...@usdataworks.com

Eric Hodges

unread,
Aug 12, 2021, 9:25:54 AM8/12/21
to tesser...@googlegroups.com
Thanks for your input, but we can't train Tesseract for any fonts. We are using it for mail that comes from thousands of sources. We have no control over which fonts are used.

We were able to improve results (from 8% success to 87%) by running Tesseract multiple times. One pass looked for letters, one for digits, one for punctuation. If we knew the format the word might take we could improve accuracy that way. But we found no good solution for mixed letters and digits when we don't know the format.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/2ti8v1hea88/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71e52bfe-0a27-44b1-b70e-2907aa722561n%40googlegroups.com.


--

Eric Hodges

Sr. Product Engineer
Reply all
Reply to author
Forward
0 new messages