Tesseract training details

94 views

Skip to first unread message

Chen

unread,

Nov 22, 2015, 1:36:51 PM11/22/15

to tesseract-ocr

I am trying to generate .traindata myself. I have some questions related to the training procedure.

We can find langdata and tessdata on github. Is there an official document introducing how to convert langdata to the final .traindata? I'm not saying the basic procedure here in wiki/TrainingTesseract, but the exact way to reproduce the offical .traindata. I guess the release lang.traindatas are generated by the Tesstrain.sh, but i cant find the script parameters like used fonts for any language. For the important text2image function, there are a lot of parameters, the official released can not just use one set of parameters for all the languages, right? i'm not sure. Can anyone guide me how to reproduce or nearly reproduce the offical .traindata? I think the efforts on tuning parameters must have been made here in the training, i just dont want to re-make the wheels again. BTW, the reason i want to generate the traindata myself is that i just want to recognize a subset of the whole language characters thus training a light package can greatly reducing the recognition time. Thanks in advance.

Regard,

Chen

unread,

Nov 22, 2015, 9:54:26 PM11/22/15

to tesseract-ocr

I just notice that under the Language-specific.sh, there are valid fonts for each language. I think i should use all the fonts for a single language.

Regards,

Chen

在 2015年11月23日星期一 UTC+8上午2:36:51，Chen写道：

Reply all

Reply to author

Forward

0 new messages