About training in Tesseract 4.0

an...@vitalify.jp

unread,

Feb 12, 2019, 1:46:52 AM2/12/19

to tesseract-ocr

I have some questions about training in Tesseract 4.0

1.Since we can't obtain the font file (not included in Tesseract's fonts) , is there any way to do the training without the font file?

2. Also we are doing some image training, for the same word in many images, is it necessary to make many box files or it would be more accurate just with on box file?

For example, I have a [0] in all images, and I'm declaring this [0] in many box files

3. Is there any difference or priority in lang setting?

For example lang=jpn+eng and lang=eng+jpn , is there any difference?

The 1st language to be set in lang will be default as top priority ?

WidmoPL

unread,

Feb 18, 2019, 8:43:17 AM2/18/19

to tesseract-ocr

Hell,i don't know the answer to 1.

but about 2. I think that each image file has to have its own box file. Despite all image files has same value,each file is different and has its own name etc.

about 3. Good question, I just check it on 10 big image files,when eng was on first place and then when eng was on second place. Contents of output files were identical to the letter, so it seems that priority doesnt matter in this case.

an...@vitalify.jp

unread,

Feb 18, 2019, 8:36:00 PM2/18/19

to tesseract-ocr

Thanks very much for the answers.

I have one more question, with the same image, when we resize it to different sizes for training, is it any help to the accuracy of OCR or it would be just the same with one image?

Message has been deleted

WidmoPL

unread,

Feb 19, 2019, 6:46:31 AM2/19/19

to tesseract-ocr

Good question. I checked it out and results are interesting.When using OCR higher magnify indeed results in better ocr qualiity of reading BUT less text is recognized. I checked some labels that i have to OCR, every label was checked in orginal size, 200% size, 600% and 1000% size. In every image (3 of them) results were the same. Less errors but less text. Maybe i took wrong kind of images,and it has to checked again on another images.

So definetely it does difference for learning, but from here im not sure is it better or worse results when training resized images. In my opinion,but i could be wrong, smaller (not resized) picture would be better, as there is more errors in text and Tesseract have something to learn. If you gave Tesseract text that he recognizes without error , he wont learn anything.

I took small screenshot, see for yourself results of reading different sizes.. Honestly i didn't think that there will be any difference at all. See attachment.

W dniu wtorek, 19 lutego 2019 02:36:00 UTC+1 użytkownik an...@vitalify.jp napisał:

Thanks very much for the answers.

I have one more question, with the same image, when we resize it to different sizes for training, is it any help to the accuracy of OCR or it would be just the same with one image?

sizetest.jpg

an...@vitalify.jp

unread,

Feb 20, 2019, 2:26:10 AM2/20/19

to tesseract-ocr

thanks guy, you're a great help (bow)

Reply all

Reply to author

Forward