I created my traineddata by following these two guides:
I will now describe in detail every single step I used below.
I have called my test font hwdigitbig.
Here are the steps:
- Create 1 box file for each of my TIF files (each TIF holds samples for 1 digit):
tesseract eng.hwdigitbig.exp0.tif eng.hwdigitbig.exp0 batch.nochop makebox
tesseract eng.hwdigitbig.exp1.tif eng.hwdigitbig.exp1 batch.nochop makebox
...
tesseract eng.hwdigitbig.exp9.tif eng.hwdigitbig.exp9 batch.nochop makebox
- Open box files in jTessBoxEditor and fix incorrect values
- Also in jTessBoxEditor, split/merge invalid bounding boxes (I get many bad bounding boxes in those samples, some spanning 3 characters vertically, I guess I need to clean the images a bit)
- Retrain tesseract with fixed box files for each digit
tesseract eng.hwdigitbig.exp0.tif eng.hwdigitbig.exp0.box nobatch box.train
...
tesseract eng.hwdigitbig.exp9.tif eng.hwdigitbig.exp9.box nobatch box.train
- Generate unicharset for all boxes together
unicharset_extractor eng.hwdigitbig.exp0.box eng.hwdigitbig.exp1.box eng.hwdigitbig.exp2.box eng.hwdigitbig.exp3.box eng.hwdigitbig.exp4.box eng.hwdigitbig.exp5.box eng.hwdigitbig.exp6.box eng.hwdigitbig.exp7.box eng.hwdigitbig.exp8.box eng.hwdigitbig.exp9.box
- Font properties file (the simplest font possible, no effects applied to it)
echo "hwdigitbig 0 0 0 0 0" > font_properties
- Clustering step (2 commands, all trained box files together on each command)
- Renaming generated files. The resulting files are:
eng.shapetable
eng.normproto
eng.inttemp
eng.pffmtable
- Generating traineddata
combine_tessdata eng
- The last step will generate this file (137 kb big)
eng.traineddata
- I then rename this file to my new test language name, which I'll call the same as my font
hwdigitbig.traineddata
So that concludes the steps I used.
The traineddata generate with the steps above is 137 kb big, no matter if I use my big samples of 6000 characters per digit, or reduced files of 1000 samples per digit.
The OCR results are not satisfactory at all, in fact even using the default eng language for handwriting recognition is giving better results.
Any ideas/suggestions?
Thank you very much!