How to use fine tuning for training?

易鑫

unread,

Jan 28, 2019, 11:07:30 PM1/28/19

to tesseract-ocr

Hello,everyone：

Now I want to recognize the character in the table,you can find the table sample in the attach file. It contains "0123456789-.LQX" only 15 different characters.

So, I think using fine tuning is a good way for recognition.

Here is my steps:

1. src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/part.txt \

--langdata_dir ../langdata --tessdata_dir ./tessdata --lang eng --linedata_only --noextract_font_properties --output_dir ~/tesstutorial/engtest

part.txt is also in the attach file.

2. mkdir -p ~/tesstutorial/engtuned_from_eng

3. lstmtraining --model_output ~/tesstutorial/engtuned_from_eng/engtuned --continue_from ~/tesstutorial/engtuned_from_eng/eng.lstm \

--traineddata ../tessdata/eng.traineddata --train_listfile ~/tesstutorial/engtest/eng.training_files.txt --max_iterations 400

4. combine_tessdata -o ./tessdata/eng_new.traineddata \

~/tesstutorial/engtuned_from_eng/eng.lstm \

~/tesstutorial/engtest/eng.lstm-number-dawg \

~/tesstutorial/engtest/eng.lstm-punc-dawg \

~/tesstutorial/engtest/eng.lstm-word-dawg

But when I execute the 3rd step,there is a error.

Continuing from /home/yixin/tesstutorial/engtuned_from_eng/eng.lstm

Loaded 298/298 pages (1-298) of document /home/yixin/tesstutorial/engtest/eng.Arial_Bold.exp0.lstmf

Loaded 297/297 pages (1-297) of document /home/yixin/tesstutorial/engtest/eng.Century_Schoolbook_L_Medium.exp0.lstmf

Loaded 294/294 pages (1-294) of document /home/yixin/tesstutorial/engtest/eng.Arial.exp0.lstmf

Loaded 293/293 pages (1-293) of document /home/yixin/tesstutorial/engtest/eng.Courier_New_Bold.exp0.lstmf

Loaded 302/302 pages (1-302) of document /home/yixin/tesstutorial/engtest/eng.Century_Schoolbook_L_Bold_Italic.exp0.lstmf

Loaded 301/301 pages (1-301) of document /home/yixin/tesstutorial/engtest/eng.Arial_Italic.exp0.lstmf

Loaded 301/301 pages (1-301) of document /home/yixin/tesstutorial/engtest/eng.Century_Schoolbook_L_Bold.exp0.lstmf

Loaded 302/302 pages (1-302) of document /home/yixin/tesstutorial/engtest/eng.Century_Schoolbook_L_Italic.exp0.lstmf

Loaded 302/302 pages (1-302) of document /home/yixin/tesstutorial/engtest/eng.Arial_Bold_Italic.exp0.lstmf

Loaded 296/296 pages (1-296) of document /home/yixin/tesstutorial/engtest/eng.Courier_New_Bold_Italic.exp0.lstmf

!int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 249

Segmentation fault (core dumped)

This is the related code.

248 void WeightMatrix::MatrixDotVector(const int8_t* u, double* v) const {
249 assert(int_mode_);
250 if (IntSimdMatrix::intSimdMatrix) {
251 IntSimdMatrix::intSimdMatrix->matrixDotVectorFunction(
252 wi_.dim1(), wi_.dim2(), &shaped_w_[0], &scales_[0], u, v);
253 } else {
254 IntSimdMatrix::MatrixDotVector(wi_, scales_, u, v);
255 }
256 }

I am a new user of lstm training, is my method is okay for recognize only 15 different characters, or is there any good ideas to solve this problem and how to solve the assert error.

Thank you in advance.

Sorry for my poor English.

table_sample.png

part.txt

Shree Devi Kumar

unread,

Jan 28, 2019, 11:30:28 PM1/28/19

to tesser...@googlegroups.com

combine_tessdata -o ./tessdata/eng_new.traineddata \

~/tesstutorial/engtuned_from_eng/eng.lstm \

You need to extract eng.lstm from tessdata_best

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d74d5f9a-31ae-4e64-b18b-59d687f02799%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

易鑫

unread,

Jan 28, 2019, 11:42:16 PM1/28/19

to tesser...@googlegroups.com

Thank you.I will try.

By the way,is my method feasible?I read the wiki,but I do not quite understand "Fine Tuning for ± a few characters". It seems that using "Fine Tuning for ± a few characters" can satisfy my need.

Shree Devi Kumar <shree...@gmail.com> 于2019年1月29日周二下午12:30写道：

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU89%3DEQOd-iLycvp3KP2yzzim3SmprBWmXL_j4%2BaBFXtQ%40mail.gmail.com.

Reply all

Reply to author

Forward