Own, Custom tessdata files for training

491 views
Skip to first unread message

Rafał Błaczkowski

unread,
Jun 9, 2016, 5:23:07 AM6/9/16
to tesseract-ocr
Hello All!!

I have a big problem with tesseract-ocr.
I downloaded the example of use tesseract from the official page (net.sourceforge.tess4j.example) just for test how it works.
I downloaded too, almost all tessdata files (dunno what is the difference between these files) and run the java script (using net.sourceforge.tess4j).
I put very simple and easy tiff file for test, and results have not been so well. Some words have been recognized correctly, but the rest've been recognized like: BEST instead of DEST, DEF instead of DEP, etc.

I understand, that I should train my script how to recognize my picture (font, size, etc). But I dunno how to deal with it! Is there any documentation about these problem?
I know that some files should be put in tessdata directory, but how to create them?

I downloaded also jTessBoxEditor, put some demo image with my text, trained something in Trainer tab, but after training nothing have been done...

Can somebody help me or tell me how to solve my problems??

Many thanks for considering my request!

Quan Nguyen

unread,
Jun 13, 2016, 10:32:02 PM6/13/16
to tesseract-ocr
Images appearing readable to human eyes may not be so to computers. Therefore, image processing is most likely required prior to OCR step.

Sure, you can use jTessBoxEditor to train for your language. The generated .traineddata will be placed in a tessdata folder and you can use the Validate function to verify the resultant data.

Rafał Błaczkowski

unread,
Jun 14, 2016, 3:14:53 AM6/14/16
to tesseract-ocr
Thank you for your answer.
But actually I don't know how to use jTessBoxEditor to train my OCR and to receive .traineddata file...
Could you tell my how to use it? Or do you know where can I find some tutorial for it? I couldn't find any...

Quan Nguyen

unread,
Jun 14, 2016, 7:11:03 PM6/14/16
to tesseract-ocr
It automates the process outlined in the Tesseract Training wiki. Once you read through it, use of the tool is straight forward. You can practice with the sample source training files included.
Reply all
Reply to author
Forward
0 new messages