How to use fine tuning for training?

81 views
Skip to first unread message

易鑫

unread,
Jan 28, 2019, 11:07:30 PM1/28/19
to tesseract-ocr
Hello,everyone:
     
      Now I want to recognize  the character in the table,you can find the table sample in the attach file. It contains  "0123456789-.LQX" only 15 different characters.

So, I think using fine tuning is a good way for recognition.

Here is my steps:

1.  src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/part.txt \
--langdata_dir ../langdata --tessdata_dir ./tessdata --lang eng --linedata_only --noextract_font_properties --output_dir ~/tesstutorial/engtest


part.txt is also in the attach file.

2.  mkdir -p ~/tesstutorial/engtuned_from_eng
3. lstmtraining --model_output ~/tesstutorial/engtuned_from_eng/engtuned --continue_from ~/tesstutorial/engtuned_from_eng/eng.lstm \
--traineddata ../tessdata/eng.traineddata --train_listfile ~/tesstutorial/engtest/eng.training_files.txt --max_iterations 400

4. combine_tessdata -o ./tessdata/eng_new.traineddata \ 
~/tesstutorial/engtuned_from_eng/eng.lstm \ 
~/tesstutorial/engtest/eng.lstm-number-dawg \ 
~/tesstutorial/engtest/eng.lstm-punc-dawg \ 
~/tesstutorial/engtest/eng.lstm-word-dawg


But when I execute  the 3rd step,there is a error.
Continuing from /home/yixin/tesstutorial/engtuned_from_eng/eng.lstm
Loaded 298/298 pages (1-298) of document /home/yixin/tesstutorial/engtest/eng.Arial_Bold.exp0.lstmf
Loaded 297/297 pages (1-297) of document /home/yixin/tesstutorial/engtest/eng.Century_Schoolbook_L_Medium.exp0.lstmf
Loaded 294/294 pages (1-294) of document /home/yixin/tesstutorial/engtest/eng.Arial.exp0.lstmf
Loaded 293/293 pages (1-293) of document /home/yixin/tesstutorial/engtest/eng.Courier_New_Bold.exp0.lstmf
Loaded 302/302 pages (1-302) of document /home/yixin/tesstutorial/engtest/eng.Century_Schoolbook_L_Bold_Italic.exp0.lstmf
Loaded 301/301 pages (1-301) of document /home/yixin/tesstutorial/engtest/eng.Arial_Italic.exp0.lstmf
Loaded 301/301 pages (1-301) of document /home/yixin/tesstutorial/engtest/eng.Century_Schoolbook_L_Bold.exp0.lstmf
Loaded 302/302 pages (1-302) of document /home/yixin/tesstutorial/engtest/eng.Century_Schoolbook_L_Italic.exp0.lstmf
Loaded 302/302 pages (1-302) of document /home/yixin/tesstutorial/engtest/eng.Arial_Bold_Italic.exp0.lstmf
Loaded 296/296 pages (1-296) of document /home/yixin/tesstutorial/engtest/eng.Courier_New_Bold_Italic.exp0.lstmf
!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 249
Segmentation fault (core dumped)

This is the related code.
248 void WeightMatrix::MatrixDotVector(const int8_t* u, double* v) const {
249   assert(int_mode_);
250   if (IntSimdMatrix::intSimdMatrix) {
251     IntSimdMatrix::intSimdMatrix->matrixDotVectorFunction(
252       wi_.dim1(), wi_.dim2(), &shaped_w_[0], &scales_[0], u, v);
253   } else {
254     IntSimdMatrix::MatrixDotVector(wi_, scales_, u, v);
255   }
256 }


I am a new user of lstm training, is my method is okay for recognize only 15 different characters, or is there any good ideas to solve this problem and how to solve the assert error.

Thank you in advance.

Sorry for my poor English.




table_sample.png
part.txt

Shree Devi Kumar

unread,
Jan 28, 2019, 11:30:28 PM1/28/19
to tesser...@googlegroups.com
combine_tessdata -o ./tessdata/eng_new.traineddata \ 
~/tesstutorial/engtuned_from_eng/eng.lstm \ 

You need to extract eng.lstm from tessdata_best

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d74d5f9a-31ae-4e64-b18b-59d687f02799%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

易鑫

unread,
Jan 28, 2019, 11:42:16 PM1/28/19
to tesser...@googlegroups.com
Thank you.I will try.
By the way,is my method feasible?I read the wiki,but I do not quite understand "Fine Tuning for ± a few characters". It seems that using "Fine Tuning for ± a few characters" can satisfy my need.



Shree Devi Kumar <shree...@gmail.com> 于2019年1月29日周二 下午12:30写道:
Reply all
Reply to author
Forward
0 new messages