Whitelist is not accepting special characters

173 views
Skip to first unread message

Shadya S.

unread,
Aug 27, 2023, 12:25:25 AM8/27/23
to tesseract-ocr
I'm using Tesseract (version 5.3.1) in Windows to recognize characters from a text that includes special characters like ñüá. Most of these characters are within the Latin script, so I've declared this in the command line.

In this image, the special characters are ñ,Ñ,á,é.
text.png

The command line I'm using is

tesseract text.png stdout --psm 6 -l Latin -c tessedit_char_whitelist=aáeéiocfhklmnñtÑ

However, the output text is missing white spaces between words, and the special characters are being completely ignored, resulting in:

aoloaalcalmoo
okonioniachillalif


Do you know why tesseract is not taking into account the characters I've declared in the whitelist? Maybe I'm not correctly specifying the special characters

Any help is greatly appreciated.

Zdenko Podobny

unread,
Aug 27, 2023, 3:24:32 PM8/27/23
to tesser...@googlegroups.com
IMO there is not need to use psm and whitelist:

tesseract text.png - -l fast/script/Latin
Estimating resolution as 274
Ñato ñelo ñaña álca moño

Ñoko niño niña chillňa élif


For Windows I guess there could be a problem with UTF-8 in the terminal...

Zdenko


ne 27. 8. 2023 o 6:25 Shadya S. <shadyas...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/843a1439-45ba-422c-8ba8-40fa557938b3n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages