This and related problems seem to have been reported before in various forums, but not addressed.
Tesseract refuse to read any *.traineddata-file when TESSDATA_PREFIX contain a national character.
A normal Windows user would not be able to produce any other path than what the keyboard can output, thus UTF-8 encoding a string is out of the question.
Tesseract interpret the national character and output another (ä -> õ) that indicate the application convert codepage Windows (win1252) to DOS (ibm850).
It accept the same folder and files if one create a symlink pointing to the same folder without any national character in the path.
Tested national character å, ä and ö (Å, Ä and Ö), but guess more characters can be affected as seen in the related issue.
Step 3 and 7 can be replaced by a call to tesseract instead.
How to reproduce:
1) Start Command Line
2) SET TESSDATA_PREFIX=C:\bäst\tessdata\
3) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- "C:\bäst\test1.pdf"
4) Output (notice the altered character):
Error opening data file C:\Users\bõst\tessdata\eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
**** Unable to open the initial device, quitting.
5) Create symlink to the same folder to end up like C:\Symlink\tessdata
6) SET TESSDATA_PREFIX=C:\Symlink\tessdata\
7) "C:\Program Files\gs\gs10.05.1\bin\gswin64c.exe" -sDEVICE=ocr -r300 -dNOPAUSE -dQUIET -dBATCH -dFirstPage=1 -dLastPage=3 -o- "C:\bäst\test1.pdf"
8) Output (notice the same path that contain a national character (ä):
Testdata
This is a test of Tesseract.
I tried UTF-8, but as the output message indicate it interpret that as well to... something else. ├ñ in tessdata_prefix become +±.
I also tried U+00E4, but that was not it.
Should it be something like \u00e4 or perhaps \\u00e4 or even something else... ?
I get the same problem running tesseract directly, just as others have reported.
The UTF-8/Unicode support present for paths need some attention to produce the expected output.
It would be most welcome if the UTF-8 path conversion was removed altogether.
Note that Ghostscript itself in the example above handle the national character nicely.
//Jan-Erik