fine tune Tesseract

Ibr

unread,

Oct 24, 2017, 10:52:02 AM10/24/17

to tesseract-ocr

Hi,

I have the latest version of Tesseract and leptonica 1.74.4, ran the command

training/lstmtraining --model_output /home/ibr/latest_leptonica_4/lstmf_old_jpn/jpn \
--continue_from /home/ibr/latest_leptonica_4/jpn_tune/extracted/jpn.lstm \
--traineddata /home/ibr/latest_leptonica_4/lstmf_jpn_lep4/jpn/jpn.traineddata \
--old_traineddata /home/ibr/latest_leptonica_4/jpn_tune/original_traineddata/jpn.traineddata \
--train_listfile /home/ibr/latest_leptonica_4/jpn_tune/jpn.training_files.txt \
--max_iterations 18000

then the command :

training/lstmtraining --stop_training \
--continue_from /path/to/fine_tune/results/lang_checkpoint \
--traineddata /path/to/starter_traineddata/lang.traineddata \
--model_output /paht/to/new/tuned_lang.traineddata/lang.traineddata

to create the traineddata file, yet I found that the accuracy of the official "best_traineddata" is better that what I got, I saw on this ocr group and some comments of GitHub that too many iterations don't give you best results, so I was wondering what is the optimal iterations to get the best results?

the first command above is from, Fine Tune for +_ few characters, what are the commands for "Fine Tuning Impact"? I tried the command :

training/lstmtraining --model_output ~/tesstutorial/impact_from_small/impact \
  --continue_from ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 1200

but what is the next commands?

Thanks

Andrew J

unread,

Oct 24, 2017, 1:34:20 PM10/24/17

to tesseract-ocr

You'll need something like this:

training/lstmtraining --stop_training \

--continue_from ./trained/base_checkpoint \

--traineddata ./trained/eng/eng.traineddata \

--model_output ./trained/engoutput/eng.traineddata

To "finish" the training

Ibr

unread,

Oct 25, 2017, 10:30:58 AM10/25/17

to tesseract-ocr

correct, thanks

wangdon...@gmail.com

unread,

Oct 26, 2017, 9:35:26 AM10/26/17

to tesseract-ocr

在 2017年10月24日星期二 UTC+8下午10:52:02，Ibr写道：

I have the same problem，after each training model, the recognition accuracy is not better than "best".training_text come form github's langdata ,append my train_text,language is chi_sim.

and I dot kown the difference between "chi_sim" "and chi_sim_vert"

Ibr

unread,

Oct 29, 2017, 5:37:57 AM10/29/17

to tesseract-ocr

Hi,

what is the command that you have been using to create training model?

I created a model based on fine tuning, not on +- model, but actually I didn't chick its accuracy, theoretically after fine tuning the model it should give better accuracy, because fine tuning adds above the "best" model without changing anything, but in the real world I didn't yet check the accuracy.

anyways as for the second this which is chi_sim_vert, I faced that when I made trained model for Japanese, actually anything with vert its a sublanguage for Chinese and Japanese, "I think its when writing in vertical mode" when making tuning against LSTMFs you will notice errors regarding "vert", if you want to solve this problem, in the tesseract/training there is a file called, language-specific.sh search for the vert and delete it, it wont affect the traineddata

refer to this link for more about the "vert" issue

Reply all

Reply to author

Forward