Hello! I want to use custom traineddata, but the performance is bad, so I want to ask for advice.
I have a font that I need to train. So I set the base model to kor for Korean, and created ground-truth for a specific font with the kor training_text file. And I trained with that data. The cmd code that I trained is as follows.
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=HDharmony START_MODEL=kor TESSDATA=../tesseract/tessdata MAX_ITERATION=1000
And I have the custom.traineddata that I got. So I tried to OCR my test.pdf again. The lang I used at that time was lang: List[str] = ["custom", "chi_sim", "eng"]
But the performance is clearly worse than when I use the default traineddata and do OCR with lang: List[str] = ["kor", "chi_sim", "eng"].
What is the problem with this?
I think the generality of the traineddata for Korean has decreased while I am training with a specific font. How can I solve this?
Should I increase the iteration? Or would it be better to train with a specific font for chi_sim or eng and do OCR with lang: List[str] = ["custom_kor", "custom_chi_sim", "custom_eng"]?
Or can I train for Korean, English, and Chinese characters at the same time and create one custom_total.traineddata?
I don't know which method is right.
I would really appreciate it if you could explain it in detail. I will wait for your answer.
Thank you.