Fine Tuning with image containing multiple languages

58 views
Skip to first unread message

Jacob Pedersen

unread,
Dec 16, 2022, 9:00:36 AM12/16/22
to tesseract-ocr
Hi

Consider an image containing a mix of English and German text.

Extracting wordstr boxes from it and fixing mistakes.

When fine tuning the two languages, I get encoding errors for English as it does not contain German chars.

What is the correct approach here?

1. Ignore encoding errors? What effect does this have on the result?
2. Create two box files changing German words like 'Dänemark' to 'Danemark' for eng?
3. Remove German wordstr's from box file when fine tuning deu?
4. Add German chars to the English unicodecharset?
5. Something else?

/Jacob
Reply all
Reply to author
Forward
0 new messages