Shord word detection recommendations

67 views
Skip to first unread message

Jean-Marc Spaggiari

unread,
Apr 2, 2024, 8:46:27 AM4/2/24
to tesseract-ocr
Hi,

I'm trying to OCR short words in the form of a letter, a space, 4 numbers.

I'm doing a lot of pre-processing to get the picture cleaned and so far I arrive to something like that:
output6.png
My challenge is that tesseract is only detecting the numbers. I tried all the posisble PSM with the same result. The heading C is always ignored.

This is the command line that I am running:
tesseract -c tessedit_char_whitelist=" 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" output6.png stdout

I tried with tesseract 5.3.0 and tesseract 5.3.4-45-g87a15 with the same result. 

I'm looking for some recommendations on what I can do better to help tesseract detecting the heading C correctly.

Thanks,

JMS

René JM Clais

unread,
Apr 2, 2024, 9:49:49 AM4/2/24
to tesser...@googlegroups.com
Hi Jean-Marc,
I do test your picture with French language  parameter : --psm 6    -l 'fra'       it works well.
With the english language  -l eng  effectively the C is dropped.
In fact the C is viewed as a euro sign   (€).  
Hope it help

Best regards
René





--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cf6a3a25-732a-4214-8ce3-03a90a719c8dn%40googlegroups.com.

Jean-Marc Spaggiari

unread,
Apr 2, 2024, 10:12:01 AM4/2/24
to tesseract-ocr
Oh, interesting! Thanks for the suggestion.

It works well to add the C indeed. However, when I do that, it's confusing a 0 for a 9 on another example :(

output6.png

I get C 9135 with the 'fra' option and 0135 without. 

I built a small application to split the letters one by one and to run them individually through tesseract and I get C 0135 correctly. But it fails with other images. I'm wondering what's wrong with my input picture :-/

JMS

René JM Clais

unread,
Apr 3, 2024, 11:35:56 AM4/3/24
to tesser...@googlegroups.com
I let process your image with the  same parameters and it works well, tesseract 5.3 . 

Jean-Marc Spaggiari

unread,
Apr 3, 2024, 12:37:49 PM4/3/24
to tesseract-ocr
Thanks for giving it a try! I ended up generating 11 versions of the same picture with very little different filtering and it ends up always getting one version totally readable. So for now I'm happy with the solution and the ideas provided here.

JMS

Reply all
Reply to author
Forward
0 new messages