I am new to Tesseract-OCR and need help in training the engine to recognize Simplified Chinese texts.
I just installed Tesseract 4.00Alpha on Windows 10:
$ tesseract --version
tesseract 4.00.00alpha
leptonica-1.74.1
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
I have 3 images containing a Simplified Chinese sentence of different sizes:
chi_sim.Microsoft_Yahei.exp1.tif (small)
chi_sim.Microsoft_Yahei.exp2.tif (medium)
chi_sim.Microsoft_Yahei.exp3.tif (large)
I ran Tesseract to recognize the texts in the images using the commands below:
$ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1a
$ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2a
$ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif chi_sim.Microsoft_Yahei.exp3a
Tesseract was able to recognize the texts in the large image perfectly. It missed the last "period" symbol in the medium image, and failed to recognize a number of characters in the small image.
I'd like to train Tesseract to be able to recognize chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I created box files for both images as chi_sim.Microsoft_Yahei.exp1.box and chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor.
The Windows version of Tesseract 4.0 I installed didn't come with tesstrain.sh. I downloaded the source and was able to extract the training commands. The documentation mentioned about LSTM but I couldn't find any LSTM call within the tesstrain.sh script. Anyway, I ran the extracted commands as below ($TESS_LANG is the path of the langdata folder.):
= Phase I: Generating training images =
$ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box chi_sim.Microsoft_Yahei.exp2.box
= Phase UP: Generating unicharset and unichar properties files =
$ set_unicharset_properties -U ./chi_sim/unicharset -O ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights --script_dir=$TESS_LANG
= Phase D: Generating Dawg files =
$ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset
= Phase E: Extracting features =
$ tesseract chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2 box.train $TESS_LANG/chi_sim/chi_sim.config
$ tesseract chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1 box.train $TESS_LANG/chi_sim/chi_sim.config
= Phase C: Clustering feature prototypes (cnTraining) =
$ cntraining -D ./chi_sim
chi_sim.Microsoft_Yahei.exp1.tr chi_sim.Microsoft_Yahei.exp2.tr = Phase M : Clustering microfeatures (mfTraining) =
$ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X ./chi_sim/chi_sim.xheights
chi_sim.Microsoft_Yahei.exp1.tr chi_sim.Microsoft_Yahei.exp2.tr = Making final traineddata file =
$ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/.
Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and "shapetable"
$ combine_tessdata ./chi_sim/chi_sim.
$ cp ./chi_sim/chi_sim.traineddata $TESSDATA_PREFIX/tessdata/chi_sim_1.traineddata
===================================
I reran Tesseract on the 3 images using the commands below:
$ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1b
$ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2b
$ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif chi_sim.Microsoft_Yahei.exp3b
The large image still produces perfect result. The medium image gives the same result as before missing a "period" symbol. The small image actually returns worse result detecting wrong number of words from the image.
I am attaching a zip files containing the images, the box files, and the results (.txt) returned from the initial runs and the runs after the training.
Are my training steps incorrect? What can I do to improve the quality of the OCR engine? Any suggestion will be much appreciated!