Fine Tuning with image containing multiple languages

58 views

Skip to first unread message

unread,

Dec 16, 2022, 9:00:36 AM12/16/22

to tesseract-ocr

Consider an image containing a mix of English and German text.

Extracting wordstr boxes from it and fixing mistakes.

When fine tuning the two languages, I get encoding errors for English as it does not contain German chars.

What is the correct approach here?

1. Ignore encoding errors? What effect does this have on the result?

2. Create two box files changing German words like 'Dänemark' to 'Danemark' for eng?

3. Remove German wordstr's from box file when fine tuning deu?

4. Add German chars to the English unicodecharset?

5. Something else?

/Jacob

Reply all

Reply to author

Forward

0 new messages