Incorrect segmentation of Chinese characters even after training a new model

wei ren

unread,

Sep 21, 2017, 6:55:39 AM9/21/17

to tesseract-ocr

I am new to OCR and tesseract. Please forgive me if I ask some "stupid" questions.

I try using tesseract 3.04.01 to recognize the Chinese characters in the attached two images and get absurd results, so I merge the two images into one and use the merged image yueyue.title.exp0.tif to train a new model. Below are the steps:

1. Create the box file.

$ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l chi_sim batch.nochop makebox

2. Correct the errors in the box file in jTessBoxEditor.

I fix the segmentation errors and assign the correct Chinese characters to the segmentations.

3. Train the new model.

$ tesseract yueyue.title.exp0.tif yueyue.title.exp0 nobatch box.train
$ unicharset_extractor yueyue.title.exp0.box

4. Define a font_properties file with the content.

title 0 0 0 0 0

5. Clustering.

$ shapeclustering -F font_properties -U unicharset yueyue.title.exp0.tr
$ mftraining -F font_properties -U unicharset -O unicharset yueyue.title.exp0.tr
$ cntraining yueyue.title.exp0.tr

6. Prefix all the files with "title.".

$ mv unicharset title.unicharset
$ mv inttemp title.inttemp
$ mv pffmtable title.pffmtable
$ mv shapetable title.shapetable
$ mv normproto title.normproto

7. Put all the files together.

$ combine_tessdata title.

8. Copy the new model to the tesseract-ocr tessdata directory.

$ sudo cp title.traineddata /usr/share/tesseract-ocr/tessdata/

Then I type the following command to recognize again the Chinese characters in the merged trained image.

$ tesseract yueyue.title.exp0.tif stdout -l title

Both the expected result is "老妇人和母鸡", but the actual result of the first page is "老老老老妇人老妇母老鸡老" and the actual result of the second page is "老老妇人和母老鸡". I generate a box file using the new model which is also attached,

$ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l title batch.nochop makebox

, and find that although tesseract only assigns the characters in the new model to the segmentations, it can't get the correct segmentations. As you can see, three characters are split into two segmentations, respectively. But when I correct the trained box file, I have merged those two segmentations into one.

I have tried specified the font as bold and/or fixed in font_properties and it doesn't help. I have also tried various page segmentation methods and it doesn't help either.

I also attach the trained tessdata here so you can easily reproduce the problems. Any hint or suggestion will be highly appreciated.

fa9c7c3b-51a9-4c41-b2be-fc01f0f07d68_x2_num0.tif

fa9c7c3b-51a9-4c41-b2be-fc01f0f07d68_x2_num1.tif

yueyue.title.exp0.box

title.traineddata

ShreeDevi Kumar

unread,

Sep 21, 2017, 7:01:26 AM9/21/17

to tesser...@googlegroups.com

You will have much better results if you use the new version of tesseract from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr

and the traineddata files from https://github.com/tesseract-ocr/tessdata_best

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

wei ren

unread,

Sep 25, 2017, 3:31:38 PM9/25/17

to tesseract-ocr

Thank you for the suggestion. Will give tesseract 4.0 a try. I hear that tesseract 4.0 uses LSTM neural network, so its performance will be much better, especially for Chinese, but it may be much slower, is that true?

By the way, I have also tried tweaking the parameters of tesseract 3.05, and have significantly improved the results with the following parameters:

assume_fixed_pitch_char_segment  1
textord_use_cjk_fp_model         1
textord_old_xheight              1
textord_min_xheight             60
textord_noise_hfract           0.1

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,

Sep 25, 2017, 11:19:59 PM9/25/17

to tesser...@googlegroups.com

Use https://github.com/tesseract-ocr/tessdata_best if you are planning to retrain

Use https://github.com/tesseract-ocr/tessdata_fast if you want to OCR

See the wiki page for more details

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#updated-data-files-for-version-400-september-15-2017

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4a702893-da3f-4b26-998e-aba4f04271cb%40googlegroups.com.

强华东

unread,

Nov 20, 2017, 7:59:18 PM11/20/17

to tesseract-ocr

thanks for your parameters .it works well
在 2017年9月26日星期二 UTC+8上午3:31:38，wei ren写道：

Reply all

Reply to author

Forward