Shord word detection recommendations

Jean-Marc Spaggiari

unread,

Apr 2, 2024, 8:46:27 AM4/2/24

to tesseract-ocr

Hi,

I'm trying to OCR short words in the form of a letter, a space, 4 numbers.

I'm doing a lot of pre-processing to get the picture cleaned and so far I arrive to something like that:

My challenge is that tesseract is only detecting the numbers. I tried all the posisble PSM with the same result. The heading C is always ignored.

This is the command line that I am running:

tesseract -c tessedit_char_whitelist=" 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" output6.png stdout

I tried with tesseract 5.3.0 and tesseract 5.3.4-45-g87a15 with the same result.

I'm looking for some recommendations on what I can do better to help tesseract detecting the heading C correctly.

Thanks,

JMS

René JM Clais

unread,

Apr 2, 2024, 9:49:49 AM4/2/24

to tesser...@googlegroups.com

Hi Jean-Marc,

I do test your picture with French language parameter : --psm 6 -l 'fra' it works well.

With the english language -l eng effectively the C is dropped.

In fact the C is viewed as a euro sign (€).

Hope it help

Best regards

René

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cf6a3a25-732a-4214-8ce3-03a90a719c8dn%40googlegroups.com.

Jean-Marc Spaggiari

unread,

Apr 2, 2024, 10:12:01 AM4/2/24

to tesseract-ocr

Oh, interesting! Thanks for the suggestion.

It works well to add the C indeed. However, when I do that, it's confusing a 0 for a 9 on another example :(

I get C 9135 with the 'fra' option and 0135 without.

I built a small application to split the letters one by one and to run them individually through tesseract and I get C 0135 correctly. But it fails with other images. I'm wondering what's wrong with my input picture :-/

JMS

René JM Clais

unread,

Apr 3, 2024, 11:35:56 AM4/3/24

to tesser...@googlegroups.com

I let process your image with the same parameters and it works well, tesseract 5.3 .

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d102e4e7-76e1-4a33-a84b-040a9b082b5fn%40googlegroups.com.

Jean-Marc Spaggiari

unread,

Apr 3, 2024, 12:37:49 PM4/3/24

to tesseract-ocr

Thanks for giving it a try! I ended up generating 11 versions of the same picture with very little different filtering and it ends up always getting one version totally readable. So for now I'm happy with the solution and the ideas provided here.

JMS

Reply all

Reply to author

Forward