Can I use this way for fine tuning?

yixinl...@gmail.com

unread,

Apr 18, 2019, 5:19:20 AM4/18/19

to tesseract-ocr

Hello,everyone:

I have used tesseract 4.0 to train a chi_sim model,but the result is not so good as I expected,So I think out one way for fine tuning.

1.src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/chi_sim_layer_training_text \

--langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang chi_sim --linedata_only --noextract_font_properties --exposures "0" \

--maxpages 0 \

--workspace_dir ~/share/workspace/tmp \

--save_box_tiff \

--fontlist "NSimSun" \

"Times New Roman" \

"Arial Unicode MS" \

"SimSun" \

"Noto Sans CJK SC" \

"Noto Sans Mono CJK SC" \

--output_dir ~/tesstutorial/chi_sim_train \

--overwrite

2. mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim

3.combine_tessdata -e ../tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm

4.lstmtraining --model_output ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer \

--continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm \

--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \

--old_traineddata ../tessdata_best/chi_sim.traineddata \

--append_index 5 --net_spec '[Lfx192 O1c1]' \

--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \

--max_iterations 40000

5.lstmtraining --stop_training --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint \

--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata

The steps above is the normal way, then I continue fine tuning based on the chi_sim_layer.traineddata which is obtain before.

Then use the OCR-D https://github.com/OCR-D/ocrd-train for fine tuning.

6. Prepare ground-truth files(include tif and txt file).

7. Modify the Makefile in the OCR-D to satisfy my need.

8. make training

Can I use this way,Please check whether it is feasible ?

Thank you in advance.Sorry for my poor English.

易鑫

unread,

Apr 18, 2019, 10:00:06 PM4/18/19

to tesseract-ocr

Is anybody here,can some one help me,thanks a lot.

<yixinl...@gmail.com> 于2019年4月18日周四下午5:19写道：

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a2c1647-4fd3-4766-88e3-379ccf4221dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

suraa syss

unread,

Apr 19, 2019, 7:07:19 AM4/19/19

to tesseract-ocr

you want to prepare unicharset before lstm training

易鑫

unread,

Apr 21, 2019, 8:34:41 PM4/21/19

to tesseract-ocr

No，I want to fine tuning using actual images.

suraa syss <sura...@gmail.com> 于2019年4月19日周五下午7:07写道：

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c09249de-79ff-4499-bd92-d459f99321e9%40googlegroups.com.

Shanshan Wang

unread,

Apr 22, 2019, 8:34:46 AM4/22/19

to tesser...@googlegroups.com

Why not just use ocrd for fine tune training? Just set up your START_MODEL as chi_sim.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE20QBtNLvg588gOcieBvAZsdEBkZfyDwfULpOJPgHnT%3Dtw%40mail.gmail.com.

易鑫

unread,

Apr 23, 2019, 9:53:27 PM4/23/19

to tesseract-ocr

>Why not just use ocrd for fine tune training? Just set up your START_MODEL as chi_sim.

Because I have trained a chi_sim model from Tesseract-OCR, and I don't have too many sample images.

Shanshan Wang <coo...@gmail.com> 于2019年4月22日周一下午8:34写道：

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAFg-N5tDhr8xuE5-DtOo%3DNCkSV3ZL_-1dpkmX6zb55srtSueiQ%40mail.gmail.com.

Reply all

Reply to author

Forward