How to choose the stop condition of LSTM training

yixinl...@gmail.com

unread,

Apr 16, 2019, 11:35:03 PM4/16/19

to tesseract-ocr

Hello,everyone:

Now I am training use LSTM 4.0,here is my command:

rm ~/tesstutorial/chi_sim_train -rf

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/chi_sim_layer_training_text \

--langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang chi_sim --linedata_only --noextract_font_properties --exposures "0" \

--maxpages 0 \

--workspace_dir ~/share/workspace/tmp \

--save_box_tiff \

--fontlist "NSimSun" \

"Times New Roman" \

"Arial Unicode MS" \

"SimSun" \

"Noto Sans CJK SC" \

"Noto Sans Mono CJK SC" \

--output_dir ~/tesstutorial/chi_sim_train \

--overwrite

rm ~/tesstutorial/chi_sim_layer_from_chi_sim -rf

mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim

combine_tessdata -e ../tessdata_best/chi_sim.traineddata ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm

lstmtraining --model_output ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer \

--continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm \

--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \

--append_index 5 --net_spec '[Lfx128 O1c1]' \

--train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \

--max_iterations 30000

lstmtraining --stop_training --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint \

--traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata

My question is how to decide the stop condition,I tried many max_iterations values,but the results are not so good.

Thank you in advance.

Sorry for my poor English.

Lorenzo Bolzani

unread,

Apr 17, 2019, 4:10:32 AM4/17/19

to tesser...@googlegroups.com

Split the data set in two parts (80/20 for example), use the large one for training and the other for evaluation.

Train for a few epochs (100 or 1000 depending on how much data you have), stop it and check with lstmeval if the eval score is improving. Restart the training adding 100/1000 to the max_iterations and continue from the previous model and repeat until the eval score stops to improve, or gets worse, for a few iterations.

You can use something like this for the split:

cd train_folder/

ls | shuf | head -NNN | parallel mv {} eval_folder/

You can have a look here for a similar setup: https://github.com/OCR-D/ocrd-train

Also you do not strictly need to use append_index for simple fine tuning, have a look at ocrd-train. If you are training for weird stuff it could help.

I think (also) that fast model uses 192 for the final lstm layer, 384 for default, 512 for best model.

BTW, for anybody: is there a way to query a model or a checkpoint for the net_specs?

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Apr 17, 2019, 1:16:12 PM4/17/19

to tesser...@googlegroups.com

>BTW, for anybody: is there a way to query a model or a checkpoint for the net_specs?

There is no existing utility to do that. However, Ray had dumped the info for tessdata_fast (and partly for tessdata_best) which has been posted in the wiki at

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

易鑫

unread,

Apr 17, 2019, 11:55:50 PM4/17/19

to tesseract-ocr

Thank you very much.

>>"Train for a few epochs (100 or 1000 depending on how much data you have), stop it and check with lstmeval if the eval score is improving. Restart the training adding 100/1000 to the max_iterations and continue from the previous model and repeat until the eval score stops to improve, or gets worse, for a few iterations."

The eval step is manual. The user should stop training and then check the eval data, then go on training ......

Is there any method can do the eval automatically. I mean each epochs we can see the training error and eval error.

Thanks.

Shree Devi Kumar <shree...@gmail.com> 于2019年4月18日周四上午1:16写道：

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com.

Lorenzo Bolzani

unread,

Apr 18, 2019, 3:00:25 AM4/18/19

to tesser...@googlegroups.com

Yes, lstmeval is manual but easy to automate. I use a script like this:

./train.sh $NAME 100
./train.sh $NAME 300
./train.sh $NAME 400
./train.sh $NAME 500
./train.sh $NAME 750
./train.sh $NAME 1000
./train.sh $NAME 1200

...

It does short trainings, save the models into a folder and run lstmeval. At the end I get a report like this:

ext1-g_100: Eval Char error rate=1.4585826, Word error rate=13.347458
ext1-g_300: Eval Char error rate=0.97829078, Word error rate=8.4745763
ext1-g_400: Eval Char error rate=0.75069704, Word error rate=7.6271186
ext1-g_500: Eval Char error rate=0.68842175, Word error rate=7.2033898
ext1-g_750: Eval Char error rate=0.63577665, Word error rate=6.779661
ext1-g_1000: Eval Char error rate=0.50223788, Word error rate=5.0847458
ext1-g_1200: Eval Char error rate=0.47848338, Word error rate=5.5084746
ext1-g_1400: Eval Char error rate=0.50223788, Word error rate=5.9322034
ext1-g_1600: Eval Char error rate=0.47848338, Word error rate=5.0847458
ext1-g_1800: Eval Char error rate=0.42583829, Word error rate=4.6610169
ext1-g_2000: Eval Char error rate=0.4264803, Word error rate=4.2372881
ext1-g_2250: Eval Char error rate=0.44124661, Word error rate=5.0847458
ext1-g_2500: Eval Char error rate=0.42134419, Word error rate=4.2372881
ext1-g_3000: Eval Char error rate=0.42583829, Word error rate=3.9548023
ext1-g_3500: Eval Char error rate=0.3545748, Word error rate=2.9661017
ext1-g_4000: Eval Char error rate=0.42070218, Word error rate=2.9661017
ext1-g_4500: Eval Char error rate=0.38218138, Word error rate=2.9661017
ext1-g_5000: Eval Char error rate=0.42070218, Word error rate=3.3898305
ext1-g_5500: Eval Char error rate=0.37768728, Word error rate=2.1186441
ext1-g_6000: Eval Char error rate=0.38731748, Word error rate=2.5423729
ext1-g_6500: Eval Char error rate=0.34879668, Word error rate=2.1186441
ext1-g_7000: Eval Char error rate=0.40529386, Word error rate=2.6836158

and I can choose which model to use. Here I would pick the 3500 or the 6500: usually I prefer to pick an early one not to risk overfitting. I could also decide to train a little more (8000, 9000, ...) to see if it improves more but it is already oscillating around a certain value.

One note: evaluation score is just a reference unless you have a lot of real world data. If you are using synthetic data this will likely differ from the real world data so it is important not to overfit over it.

You can improve the script with an iteration and stop if the improvement over the best result is below a threshold for a few epochs. I found no real advantage in doing this as the training is quite fast and I have no problem in letting it run while I do something else.

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com.

易鑫

unread,

Apr 18, 2019, 5:01:35 AM4/18/19

to tesseract-ocr

Thank you. I see.

Lorenzo Bolzani <l.bo...@gmail.com> 于2019年4月18日周四下午3:00写道：

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwH%2BXSP3W8t39kRHCs_umTUFZkPv9tMMUHEzRZwiVQmBA%40mail.gmail.com.

Reply all

Reply to author

Forward