TESSDATA_PREFIX doesn't work with national character(s)

61 views
Skip to first unread message

Jan-Erik Lärka

unread,
Aug 1, 2025, 7:32:37 AMAug 1
to tesseract-ocr
This and related problems seem to have been reported before in various forums, but not addressed.

Tesseract refuse to read any *.traineddata-file when TESSDATA_PREFIX contain a national character. 

A normal Windows user would not be able to produce any other path than what the keyboard can output, thus UTF-8 encoding a string is out of the question.

Tesseract interpret the national character and output another (ä -> õ) that indicate the application convert codepage Windows (win1252) to DOS (ibm850). It accept the same folder and files if one create a symlink pointing to the same folder without any national character in the path. Tested national character å, ä and ö (Å, Ä and Ö), but guess more characters can be affected as seen in the related issue.
 
Step 3 and 7 can be replaced by a call to tesseract instead.

How to reproduce: 
1) Start Command Line
2) SET TESSDATA_PREFIX=C:\bäst\tessdata\ 
3) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- "C:\bäst\test1.pdf" 
4) Output (notice the altered character): 
Error opening data file C:\Users\bõst\tessdata\eng.traineddata 
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. 
Failed loading language 'eng' Tesseract couldn't load any languages! 
**** Unable to open the initial device, quitting. 
5) Create symlink to the same folder to end up like C:\Symlink\tessdata 
6) SET TESSDATA_PREFIX=C:\Symlink\tessdata\ 
7) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- "C:\bäst\test1.pdf" 
8) Output (notice the same path that contain a national character (ä): 
Testdata
This is a test of Tesseract.

I tried UTF-8, but as the output message indicate it interpret that as well to... something else. ├ñ in tessdata_prefix become +±. 
I also tried U+00E4, but that was not it. 
Should it be something like \u00e4 or perhaps \\u00e4 or even something else... ?

I get the same problem running tesseract directly, just as others have reported.
The UTF-8/Unicode support present for paths need some attention to produce the expected output. 

It would be most welcome if the UTF-8 path conversion was removed altogether.

Note that Ghostscript itself in the example above handle the national character nicely.

//Jan-Erik

Nikola Smolenski

unread,
Aug 1, 2025, 8:44:44 PMAug 1
to tesser...@googlegroups.com
Out of curiosity, would it work if you try:

SET TESSDATA_PREFIX=C:\b„st\tessdata\


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/1c398d62-8546-41d4-9ed6-83763a80a037n%40googlegroups.com.

Jan-Erik Lärka

unread,
Aug 4, 2025, 3:08:12 AMAug 4
to tesseract-ocr

Note that the character appear as ä in the message, but in the command line window (DOS) everything has to be in codepage 850.
So the mapping is somewhat off
The interesting part is that the original example find the path, but some other part of tesseract refuse to use it.

C:\Temp>set tessdata_prefix=C:\b„st\tessdata\

C:\Temp>@"C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=1 -o- "C:\bäst\1.pdf"
Warning: TESSDATA_PREFIX C:\bäst\tessdata\ does not exist, ignore it
Error opening data file ./eng.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
**** Unable to open the initial device, quitting.

Jan-Erik Lärka

unread,
Aug 4, 2025, 8:59:03 AMAug 4
to tesseract-ocr
The problem is that there are two places attempting to use TESSDATA_PREFIX and they have conflicting requirements. 
Tesseract itself checks TESSDATA_PREFIX to see if the prefix directory exists, it does this in C++ using std::filesystem::exists(). 

On Windows the only way that *both* the verification and loading will work is if the prefix exists and is composed solely of 7-bit ASCII characters, because that is both a valid UTF-8 encoded string, and a valid OS-specific path.

The code related to this therefore need a little tlc and massage to allow national characters.
Reply all
Reply to author
Forward
0 new messages