Hello,everyone:
I have used tesseract 4.0 to train a chi_sim model,but the result is not so good as I expected,So I think out one way for fine tuning.
1.src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/chi_sim_layer_training_text \
--langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang chi_sim --linedata_only --noextract_font_properties --exposures "0" \
--maxpages 0 \
--workspace_dir ~/share/workspace/tmp \
--save_box_tiff \
--fontlist "NSimSun" \
"Times New Roman" \
"Arial Unicode MS" \
"SimSun" \
"Noto Sans CJK SC" \
"Noto Sans Mono CJK SC" \
--output_dir ~/tesstutorial/chi_sim_train \
--overwrite
2. mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim
3.combine_tessdata -e ../tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm
4.lstmtraining --model_output ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer \
--continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
--old_traineddata ../tessdata_best/chi_sim.traineddata \
--append_index 5 --net_spec '[Lfx192 O1c1]' \
--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
--max_iterations 40000
5.lstmtraining --stop_training --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint \
--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata
The steps above is the normal way, then I continue fine tuning based on the chi_sim_layer.traineddata which is obtain before.
6. Prepare ground-truth files(include tif and txt file).
7. Modify the Makefile in the OCR-D to satisfy my need.
8. make training
Can I use this way,Please check whether it is feasible ?
Thank you in advance.Sorry for my poor English.