The problems about training eng+chinese

106 views
Skip to first unread message

易鑫

unread,
Mar 19, 2019, 6:49:38 AM3/19/19
to tesseract-ocr
Hello,everyone:
    I want to recognize the characters in the table(You can see find it in the attach file).In the past, I only recognize the english letters,and the result is pretty good,but now I want to recognize 
english letters plus Chinese characters. So I retrained the model. here is my command:

1)src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/chi_sim_tuned.txt \
--langdata_dir ../langdata --tessdata_dir ./tessdata --lang chi_sim --linedata_only --noextract_font_properties  --exposures "0" \
--fontlist  "AR PL UKai CN" \
  "AR PL UKai HK" \
  "AR PL UKai TW" \
  "AR PL UKai TW MBE" \
  "AR PL UMing CN Light" \
  "AR PL UMing HK Light" \
  "AR PL UMing TW Light" \
  "AR PL UMing TW MBE Light" \
  "NSimSun" \
  "Noto Sans CJK JP" \
  "Noto Sans CJK JP Bold" \
  "Noto Sans CJK JP Heavy" \
  "Noto Sans CJK JP Light" \
  "Noto Sans CJK JP Medium" \
  "Noto Sans CJK JP Semi-Light" \
  "Noto Sans CJK JP Ultra-Light" \
  "Noto Sans CJK KR" \
  "Noto Sans CJK KR Bold" \
  "Noto Sans CJK KR Heavy" \
  "Noto Sans CJK KR Light" \
  "Noto Sans CJK KR Medium" \
  "Noto Sans CJK KR Semi-Light" \
  "Noto Sans CJK KR Ultra-Light" \
  "Noto Sans CJK SC" \
  "Noto Sans CJK SC Bold" \
  "Noto Sans CJK SC Heavy" \
  "Noto Sans CJK SC Light" \
  "Noto Sans CJK SC Medium" \
  "Noto Sans CJK SC Semi-Light" \
  "Noto Sans CJK SC Ultra-Light" \
  "Noto Sans CJK TC" \
  "Noto Sans CJK TC Bold" \
  "Noto Sans CJK TC Heavy" \
  "Noto Sans CJK TC Light" \
  "Noto Sans CJK TC Medium" \
  "Noto Sans CJK TC Semi-Light" \
  "Noto Sans CJK TC Ultra-Light" \
  "Noto Sans Mono CJK JP" \
  "Noto Sans Mono CJK JP Bold" \
  "Noto Sans Mono CJK KR" \
  "Noto Sans Mono CJK KR Bold" \
  "Noto Sans Mono CJK SC" \
  "Noto Sans Mono CJK SC Bold" \
  "Noto Sans Mono CJK TC" \
  "Noto Sans Mono CJK TC Bold" \
  "SimSun" \
  "WenQuanYi Zen Hei Medium" \
  "WenQuanYi Zen Hei Mono Medium" \
--output_dir ~/tesstutorial/chi_sim_train

2)mkdir -p ~/tesstutorial/chi_sim_tuned_from_chi_sim
3)combine_tessdata -e ../tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm
4)lstmtraining --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned \
--continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--old_traineddata ../tessdata_best/chi_sim.traineddata \
--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
--max_iterations 10000
5)lstmtraining --stop_training --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint  \
           --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned.traineddata

The result is not good, most strange is that the result contains some Chinese characters that do not exist in the training_text file, I really can not understand,
can some one help me,thanks a lot.

The training_text file and the result are also in the attach file.

Sorry for my poor english.




  
test.xlsx
test.jpg
chi_sim_tuned.txt

Shree Devi Kumar

unread,
Mar 19, 2019, 10:01:54 AM3/19/19
to tesser...@googlegroups.com
You are using a number of Japanese, Koean and Traditional Chinese fonts for training. Try without them.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/bd740b98-3c0c-4216-88ba-0eb72cdcf3ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

易鑫

unread,
Mar 19, 2019, 9:24:16 PM3/19/19
to tesseract-ocr
thanks for your advice,I will try.

Shree Devi Kumar <shree...@gmail.com> 于2019年3月19日周二 下午10:01写道:

Shree Devi Kumar

unread,
Mar 19, 2019, 11:16:43 PM3/19/19
to tesser...@googlegroups.com
Also, 10000 iterations for finetuning will lead to overfitting.

I tried by using fewer fonts and adding a couple of English only fonts that match the typeface of the image you shared. The output is improved compared to tessdata_best. I assume that you want to limit your unicharset based on your training_text (numbers, some English letters and some Simplified Chinese characters). The image was pre-processed to B&W and deskewed.

I found that --psm 6 gives worse results both for tessdata_best and finetuned, but the default psm gives better accuracy though there are multiple blank lines for extra columns identified in --psm 3.

See attached:


chi_sim_tuned.txt
chi_sim_best.txt
chi_sim.png

Shree Devi Kumar

unread,
Mar 19, 2019, 11:18:19 PM3/19/19
to tesser...@googlegroups.com

~/tesseract/src/training/tesstrain.sh \
--fonts_dir ~/.fonts \
--training_text ~/langdata/chi_sim/chi_sim_tuned.txt \
--langdata_dir ~/langdata \
--tessdata_dir ~/tessdata \
--lang chi_sim --linedata_only \
--noextract_font_properties  \
--exposures "0" \
--workspace_dir ~/tmp \
--save_box_tiff \
--fontlist  \
"NSimSun" \
"Arial Unicode MS" \
"SimSun" \
"Merchant Copy" \
"Merchant Copy Doublesize" \
"Noto Sans CJK SC" \
"Noto Sans Mono CJK SC" \
--output_dir ~/tesstutorial/chi_sim_trainnew


mkdir -p ~/tesstutorial/chi_sim_tuned_from_chi_sim

combine_tessdata -e ~/tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm

~/tesseract/bin/src/training/lstmtraining \
--model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned \
--continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--old_traineddata ~/tessdata_best/chi_sim.traineddata \
--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
--debug_interval -1 \
--max_iterations 3600

~/tesseract/bin/src/training/lstmtraining \
--stop_training \
--continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint  \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--model_output ~/tessdata_best/chi_sim_tuned.traineddata

易鑫

unread,
Mar 20, 2019, 12:27:21 AM3/20/19
to tesseract-ocr
Thank you very much for your reply, your result is pretty good.

You are right, I want to limit my unicharset.
I want to ask you a few questions:

1.What pre-processing have you done? only Binarisation,Rotation and Deskewing? 

2.From your result,chi_sim_tuned.txt, also contains some characters that do not in the train_text file,such as "二",“》:”,why?

3. How to the choose the "max_iterations" value, I usually choose a large number for the first time such as 10000 to let the model under overfitting condition, then reduce the value gradually,make sure the model is good finally.
  Is there any good method to choose max_iterations?











 

Shree Devi Kumar <shree...@gmail.com> 于2019年3月20日周三 上午11:18写道:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Mar 20, 2019, 2:20:40 AM3/20/19
to tesser...@googlegroups.com
On Wed, Mar 20, 2019 at 9:57 AM 易鑫 <yixinl...@gmail.com> wrote:
Thank you very much for your reply, your result is pretty good.

You are right, I want to limit my unicharset.
I want to ask you a few questions:

1.What pre-processing have you done? only Binarisation,Rotation and Deskewing? 

I used irfanview interactively. Rotated to straighten the lines, converted to 2 color image and changed dpi to 300.
I didn't test with oiginal image. Tesseract also does binarization.

2.From your result,chi_sim_tuned.txt, also contains some characters that do not in the train_text file,such as "二",“》:”,why?

I don't know. Probably they are there in the tessdata_best model and don't get fully overwritten in finetuning.

3. How to the choose the "max_iterations" value, I usually choose a large number for the first time such as 10000 to let the model under overfitting condition, then reduce the value gradually,make sure the model is good finally.
  Is there any good method to choose max_iterations?

Ray's recommendations for finetuning for font is 400 iterations. For plus-minus tuning to add a character is 3600. You should check an eval set (different from training set) around these numbers to find the minimum.

For more options, visit https://groups.google.com/d/optout.

易鑫

unread,
Mar 20, 2019, 3:56:38 AM3/20/19
to tesseract-ocr
Thank you very much.

Shree Devi Kumar <shree...@gmail.com> 于2019年3月20日周三 下午2:20写道:
Reply all
Reply to author
Forward
0 new messages