The problem of training eng + chi_sim

68 views
Skip to first unread message

易鑫

unread,
Mar 25, 2019, 6:44:12 AM3/25/19
to tesseract-ocr
Hello,everyone:
  I have focus the training eng + chi_sim for several days,but one urgent  issue confused me. I have ask the questions before,but do not get good reply,so I ask the questions again.   Sorry for disturbing you.

My steps is as follows:

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/chi_sim_tuned.txt   \
--langdata_dir ../langdata --tessdata_dir ./tessdata --lang chi_sim --linedata_only --noextract_font_properties  --exposures "0" \
--workspace_dir ./share/workspace/tmp \
--save_box_tiff \
 --fontlist  "NSimSun" \
        "Times New Roman" \
       "Arial Unicode MS" \
       "SimSun" \
      "Merchant Copy" \
      "Merchant Copy Doublesize" \
       "Noto Sans CJK SC" \
"Noto Sans Mono CJK SC" \
--output_dir ~/tesstutorial/chi_sim_train \
--overwrite


mkdir -p ~/tesstutorial/chi_sim_tuned_from_chi_sim 

 

combine_tessdata -e ../tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm


lstmtraining --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned \
--continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--old_traineddata ../tessdata_best/chi_sim.traineddata \
--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
--max_iterations 3000

lstmtraining --stop_training --continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned_checkpoint  \
           --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_tuned.traineddata

the train_text file is in the attachfile.


What confused me is that: the result contains some characters that do not in the train_text file.(only chi_sim character have the problem,eng is ok)。

Can anyone help me?Thanks a lot.
I also upload image and result file. Thanks in advance.

Thank you.







chi_sim_tuned.txt
test.jpg
test.xlsx

Shree Devi Kumar

unread,
Mar 25, 2019, 8:39:47 AM3/25/19
to tesser...@googlegroups.com
Try replacing a layer - you may need larger training_text and more iterations

lstmtraining --model_output ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim_layer  \
--continue_from ~/tesstutorial/chi_sim_tuned_from_chi_sim/chi_sim.lstm \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--append_index 5 --net_spec '[Lfx192 O1c1]' \
--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
--max_iterations 30000

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4af9e1d1-218a-4a36-8a77-1b4619b53205%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Mar 25, 2019, 11:43:38 AM3/25/19
to tesser...@googlegroups.com
36000 iterations, error rate 0.1

OCR output attached
chi_sim_layer.txt

易鑫

unread,
Mar 25, 2019, 9:50:53 PM3/25/19
to tesseract-ocr
okay.Thank you very much.
But does 36000 iterations overfit will happen?

Shree Devi Kumar <shree...@gmail.com> 于2019年3月25日周一 下午11:43写道:

易鑫

unread,
Mar 25, 2019, 10:20:53 PM3/25/19
to tesseract-ocr
and how many lines are the training_text is better , the total number of my character is no more than 100.

易鑫 <yixinl...@gmail.com> 于2019年3月26日周二 上午9:50写道:
Reply all
Reply to author
Forward
0 new messages