Need help training Simplified Chinese.

Clement

unread,

Jun 22, 2017, 2:40:22 AM6/22/17

to tesseract-ocr

I am new to Tesseract-OCR and need help in training the engine to recognize Simplified Chinese texts.

I just installed Tesseract 4.00Alpha on Windows 10:

$ tesseract --version
tesseract 4.00.00alpha
leptonica-1.74.1
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0

I have 3 images containing a Simplified Chinese sentence of different sizes:

chi_sim.Microsoft_Yahei.exp1.tif (small)
chi_sim.Microsoft_Yahei.exp2.tif (medium)
chi_sim.Microsoft_Yahei.exp3.tif (large)

I ran Tesseract to recognize the texts in the images using the commands below:

$ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1a
$ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2a
$ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif chi_sim.Microsoft_Yahei.exp3a

Tesseract was able to recognize the texts in the large image perfectly. It missed the last "period" symbol in the medium image, and failed to recognize a number of characters in the small image.

I'd like to train Tesseract to be able to recognize chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I created box files for both images as chi_sim.Microsoft_Yahei.exp1.box and chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor.

The Windows version of Tesseract 4.0 I installed didn't come with tesstrain.sh. I downloaded the source and was able to extract the training commands. The documentation mentioned about LSTM but I couldn't find any LSTM call within the tesstrain.sh script. Anyway, I ran the extracted commands as below ($TESS_LANG is the path of the langdata folder.):

= Phase I: Generating training images =
$ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box chi_sim.Microsoft_Yahei.exp2.box

= Phase UP: Generating unicharset and unichar properties files =
$ set_unicharset_properties -U ./chi_sim/unicharset -O ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights --script_dir=$TESS_LANG

= Phase D: Generating Dawg files =
$ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset

= Phase E: Extracting features =

$ tesseract chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2 box.train $TESS_LANG/chi_sim/chi_sim.config
$ tesseract chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1 box.train $TESS_LANG/chi_sim/chi_sim.config

= Phase C: Clustering feature prototypes (cnTraining) =
$ cntraining -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.tr chi_sim.Microsoft_Yahei.exp2.tr

= Phase M : Clustering microfeatures (mfTraining) =
$ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X ./chi_sim/chi_sim.xheights chi_sim.Microsoft_Yahei.exp1.tr chi_sim.Microsoft_Yahei.exp2.tr

= Making final traineddata file =
$ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/.

Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and "shapetable"

$ combine_tessdata ./chi_sim/chi_sim.

$ cp ./chi_sim/chi_sim.traineddata $TESSDATA_PREFIX/tessdata/chi_sim_1.traineddata

===================================

I reran Tesseract on the 3 images using the commands below:

$ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1b

$ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2b

$ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif chi_sim.Microsoft_Yahei.exp3b

The large image still produces perfect result. The medium image gives the same result as before missing a "period" symbol. The small image actually returns worse result detecting wrong number of words from the image.

I am attaching a zip files containing the images, the box files, and the results (.txt) returned from the initial runs and the runs after the training.

Are my training steps incorrect? What can I do to improve the quality of the OCR engine? Any suggestion will be much appreciated!

chi_sim_training.zip

ShreeDevi Kumar

unread,

Jun 22, 2017, 4:27:34 AM6/22/17

to tesser...@googlegroups.com

Your best bet for improving recognition is to preprocess the small and medium images to larger size.

Please see https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Tesseract 4.00.00alpha currently has two different ocr engines in it. The legacy tesseract engine is accessible with --oem 0 and new LSTM engine is accessible with --oem 1.

The option --oem 2 will use both together and --oem 3 will use the one which has been defined as default.

The training process that you followed builds a new model for the legacy engine, not LSTM.

If you notice the output for your first test, you will notice that there are spaces after each character in the OCRed text, which has been reported as an issue with the LSTM model. The legacy model does not add the extra spaces but the accuracy is lower.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/94caecfe-698d-4724-bf28-a46579d1e21f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Clement

unread,

Jun 25, 2017, 10:48:19 AM6/25/17

to tesseract-ocr

Thanks for your reply. I have another question related to the oem option you mentioned. Is it for the training command (tesstrain.sh) or the recognition command (tesseract)?

I installed Tesseract 4.00alpha on Linux. When I ran tesseract on an image, I got the old format (3.x version) that's without the extra spaces but the recognition quality was poor. I've no other version of Tesseract installed on the same box.

I tried to specify the "--oem 1" option but it didn't work:
$ tesseract 001a3.png 001a3 -l chi_sim --oem 1
read_params_file: Can't open 1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

ShreeDevi Kumar

unread,

Jun 25, 2017, 10:53:34 AM6/25/17

to tesser...@googlegroups.com

>> I installed Tesseract 4.00alpha on Linux.

How did you install it?

Did you use the latest code from github?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/14d6afaa-f220-4b03-b12f-330f1c98501a%40googlegroups.com.

shree

unread,

Jun 25, 2017, 10:58:54 AM6/25/17

to tesseract-ocr

See https://github.com/tesseract-ocr/tesseract/pull/515

for when this option was implemented (after the https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00alpha )

You should install using the latest code on github.

Clement

unread,

Jun 27, 2017, 12:48:43 AM6/27/17

to tesseract-ocr

I downloaded the alpha source code from the link below:

https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00alpha

I installed using the following commands:

$ ./autogen.sh

$ ./configure PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib --bindir=/usr/local/sbin

$ sudo make install

$ make

$ make training

$ sudo make training-install

I also tried the dev version from Nov 24, 2016 but the behavior was the same:

https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00dev

Would you suggest I try again with the latest codes?

On Wednesday, June 21, 2017 at 11:40:22 PM UTC-7, Clement wrote:

ShreeDevi Kumar

unread,

Jun 27, 2017, 1:13:02 AM6/27/17

to tesser...@googlegroups.com

On Tue, Jun 27, 2017 at 10:18 AM, Clement wrote:

I downloaded the alpha source code from the link below:
https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00alpha

I installed using the following commands:
$ ./autogen.sh
$ ./configure PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib --bindir=/usr/local/sbin
$ sudo make install
$ make
$ make training
$ sudo make training-install

I also tried the dev version from Nov 24, 2016 but the behavior was the same:
https://github.com/tesseract-ocr/tesseract/releases/tag/4.00.00dev

Would you suggest I try again with the latest codes?

Yes, please.

A number of fixes have been applied since those tags - over 500 commits to master branch.

So if you want to try the LSTM engine, use the latest code - follow instructions in https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation

get the code from

git clone https://github.com/tesseract-ocr/tesseract.git

--------------

Reply all

Reply to author

Forward