Can I train a model for multiple languages at the same time?

35 views

Skip to first unread message

nahye koo

unread,

Jul 15, 2025, 10:43:23 PMJul 15

to tesseract-ocr

Hello! I want to use custom traineddata, but the performance is bad, so I want to ask for advice.

I have a font that I need to train. So I set the base model to kor for Korean, and created ground-truth for a specific font with the kor training_text file. And I trained with that data. The cmd code that I trained is as follows.

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=HDharmony START_MODEL=kor TESSDATA=../tesseract/tessdata MAX_ITERATION=1000

And I have the custom.traineddata that I got. So I tried to OCR my test.pdf again. The lang I used at that time was lang: List[str] = ["custom", "chi_sim", "eng"]

But the performance is clearly worse than when I use the default traineddata and do OCR with lang: List[str] = ["kor", "chi_sim", "eng"].

What is the problem with this?

I think the generality of the traineddata for Korean has decreased while I am training with a specific font. How can I solve this?

Should I increase the iteration? Or would it be better to train with a specific font for chi_sim or eng and do OCR with lang: List[str] = ["custom_kor", "custom_chi_sim", "custom_eng"]?

Or can I train for Korean, English, and Chinese characters at the same time and create one custom_total.traineddata?

I don't know which method is right.

I would really appreciate it if you could explain it in detail. I will wait for your answer.
Thank you.

Reply all

Reply to author

Forward

0 new messages