Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

tesseract whitelist not working

31 views
Skip to first unread message

Kyle Foley

unread,
Apr 1, 2025, 9:22:12 PMApr 1
to tesseract-ocr
I'm using Tesseract with Python because it's too difficult to OCR when the languages are mixed between the Greek alphabet and the Latin alphabet.  I was hoping that the whitelist feature would solve that problem.  But this is not the case.  When I input the following whitelist, 

αςερτυθιοπλκξηγφδσζχψωβνμΣΕΡΤΥΘΙΟΠΛΚΞΗΓΦΔΣΑΖΧΨΩΒΝΜΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890\/?<>{}[]()*&,;.:-+=|1234567890


I get a reasonably good output for the Latin characters, but the Greek text is only roughly 75% accurate.  for example, here is an output



Contracted nouns and adjectives in -ους from -οος 63
Adjectives of material in -ots from -εος 64
Nouns in ts, -εως and -υς/-υ, -εως 65

But the correct output should be οῦς not -ots


However, even if the accuracy were 100%, that whitelist will not solve my problem because it does not use the diacritics.  So when I use a whitelist with diacritics, such as


ΑἉἊἍἋἌᾍᾈᾌᾎᾉAΒΔΗΉἩἨἮἯἬἫἭἪῌᾞᾟᾜᾘᾙῊἜἚἝἛἘἙΈΕΓΙῚἾἿἽἻἺἼἹἸΊIΚΧΞΛΜΝὩὨῼὭὫὬὪὯὮΩΏὉὈὊὋὌὍΟΌῸῺᾨᾩᾯᾮᾪᾫᾬᾭΠΦΨΡΣΤΘὝὛὙΎΥΖᾅᾳᾇᾄᾂᾀᾷᾆᾴᾲἇἆἂἄἅἃάᾶὰαἁἀααᾁᾃβδέὲἕἓἒἔἑἐεἠῆᾖἧᾔᾐᾑἥἣᾕἡἦῄῂῇᾗηήὴἤἢᾒᾓγϊιἰἶἴἲἱΐῒὶίἷἵἳῗιικχλμνὁᾦὀοῷὧωὠᾡὦῳῶὡᾠᾧῴῲὢὤὥὣᾤᾢὅὃὄὂόώὼᾣᾥπφψῤῥρςστθϋὗῧὐὑυῦὔὒύὺὓὕῢΰυυϝξζΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890\/?<>{}[]()*&,;.:-+=|1234567890 "

I get the output:

ΝΕΗΟΓΑΑΠΚ
Α
ΑΟΗΠΓΠΟΠ
ΑΟΕΠΓ
ΑΕΠΓΟ
ΑΠ
ἸΑΓΝΠΑΟΕΕ
ΡΟΡΟΠ
ΑΙΟΓΠΊ
ΠΟΙΠΕΟΓΠΓΕΠΟΏΡΒΡ
ΑΓ Ι
ΙΠΠΠΠΊΒΠ

I've tried locating the characters that are messing things up but there are too many.  But it is certainly not any of these characters: \/?<>{}[]()*&,;.:-+=|

The image I'm trying to scan is uploaded.  here is the exact python code I'm using:

```
import pytesseract
custom_oem_psm_config = '--oem 3 --psm 6 -c tessedit_char_whitelist="{}"'.format(
ΑἉἊἍἋἌᾍᾈᾌᾎᾉAΒΔΗΉἩἨἮἯἬἫἭἪῌᾞᾟᾜᾘᾙῊἜἚἝἛἘἙΈΕΓΙῚἾἿἽἻἺἼἹἸΊIΚΧΞΛΜΝὩὨῼὭὫὬὪὯὮΩΏὉὈὊὋὌὍΟΌῸῺᾨᾩᾯᾮᾪᾫᾬᾭΠΦΨΡΣΤΘὝὛὙΎΥΖᾅᾳᾇᾄᾂᾀᾷᾆᾴᾲἇἆἂἄἅἃάᾶὰαἁἀααᾁᾃβδέὲἕἓἒἔἑἐεἠῆᾖἧᾔᾐᾑἥἣᾕἡἦῄῂῇᾗηήὴἤἢᾒᾓγϊιἰἶἴἲἱΐῒὶίἷἵἳῗιικχλμνὁᾦὀοῷὧωὠᾡὦῳῶὡᾠᾧῴῲὢὤὥὣᾤᾢὅὃὄὂόώὼᾣᾥπφψῤῥρςστθϋὗῧὐὑυῦὔὒύὺὓὕῢΰυυϝξζΑΖΧΨΩΒΝΜABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890\/?<>{}[]()*&,;.:-+=| "
)
str4 = pytesseract.image_to_string(img1, config=custom_oem_psm_config,lang='eng+ell')
print(str4)
```

I'm using pytesseract 0.3.13 and I have tesseract 5.3.8 installed.





Screenshot 2025-04-01 at 9.20.20 PM.png

Tom Morris

unread,
Apr 7, 2025, 2:21:20 PMApr 7
to tesseract-ocr
That looks like it's probably a character encoding issue with how pytesseract constructs/uses its command line. You might try putting the what list in a config file and passing that instead to work around the issue.

You don't mention what language model(s) you are using. If you are using eng+grc, you might try script/Latin+script/Greek to see if it improves things.

Tom

Reply all
Reply to author
Forward
0 new messages