understading lstmeval and use it on pretrained models for comparison

Arno Loo

unread,

Jun 27, 2019, 4:16:18 PM6/27/19

to tesseract-ocr

Hello,

I just finished my first training of tesseract 4.0 and I ran a lstmeval on the generated model, which I named mod01.

I use this command line :

lstmeval --model data/checkpoints/mod01_checkpoint --traineddata ./usr/share/tessdata/mod01.traineddata --eval_listfile data/list.eval

It worked fine and it gave me a character error rate and a word error rate. Now I would like to know if my training improved Tesseract's accuracy on my specific documents. So I wanted to launch the evaluation on the same dataset but with the model I started the training from, the english provided on Tesseract's github repo : eng.traineddata. I tried :

lstmeval --traineddata ./usr/share/tessdata/eng.traineddata --eval_listfile data/list.eval

But it did not work because I did not provided any --model

And this showed me that my understanding of Tesseract's was not correct.

Since downloading a new lang.traineddata is enough to use Tesseract with this lang I thought that all the model was contained in the traineddata files. What is this --model argument then ?

In which my research on the web told me to put the last checkpoint of my training but without explaining why.

Is it possible then to run lstmeval on a pretrained model like eng.traineddata ?

Thank you !

Shree Devi Kumar

unread,

Jun 27, 2019, 5:17:46 PM6/27/19

to tesser...@googlegroups.com

See https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc

When using checkpoint you need to also use the starter traineddata file used for training.

Or give final traineddata file as model.

So, if after training u have converted the checkpoint to a traineddata, you can use that as model. Similarly for the original traineddata.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f762b56-f7b0-4438-a8cb-cbab94304341%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jun 27, 2019, 5:21:44 PM6/27/19

to tesser...@googlegroups.com

training/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

training/lstmeval --model tessdata/best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

Arno Loo

unread,

Jun 28, 2019, 7:51:49 AM6/28/19

to tesseract-ocr

Ok !

Thanks Shree

Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit :

See https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc

When using checkpoint you need to also use the starter traineddata file used for training.

Or give final traineddata file as model.

So, if after training u have converted the checkpoint to a traineddata, you can use that as model. Similarly for the original traineddata.

On Thu, 27 Jun 2019, 21:46 Arno Loo, <arno....@gmail.com> wrote:

Hello,

I just finished my first training of tesseract 4.0 and I ran a lstmeval on the generated model, which I named mod01.
I use this command line :
lstmeval --model data/checkpoints/mod01_checkpoint --traineddata ./usr/share/tessdata/mod01.traineddata --eval_listfile data/list.eval

It worked fine and it gave me a character error rate and a word error rate. Now I would like to know if my training improved Tesseract's accuracy on my specific documents. So I wanted to launch the evaluation on the same dataset but with the model I started the training from, the english provided on Tesseract's github repo : eng.traineddata. I tried :
lstmeval --traineddata ./usr/share/tessdata/eng.traineddata --eval_listfile data/list.eval
But it did not work because I did not provided any --model

And this showed me that my understanding of Tesseract's was not correct.
Since downloading a new lang.traineddata is enough to use Tesseract with this lang I thought that all the model was contained in the traineddata files. What is this --model argument then ?
In which my research on the web told me to put the last checkpoint of my training but without explaining why.

Is it possible then to run lstmeval on a pretrained model like eng.traineddata ?

Thank you !

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Arno Loo

unread,

Jun 28, 2019, 3:17:30 PM6/28/19

to tesseract-ocr

I continue to make experiments and trying to understand what seems important and I have a few questions after a research in Tesseract's wiki

During the training we can see this kind of information :

At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train=96.314%, word train=100%, skip ratio=0%,  New best char error = 96.314 wrote checkpoint.

- 100/100/100 : What do this 3 numbers at the begining mean when they are different ? Which they are often, unlike in my example.

- Mean rms I know well, it's the Root Mean Square error. But what error metric is used ? Usually it is some kind of distance, the Levenshtein distance is often appropriate for OCR tasks but the "%" wouldn't be there if it was.

- delta I don't know

- char train must be the percentage of wrong character predictions during the training

- word train must be the percentage of wrong word predictions during the training

- skip ratio is I think the percentage of samples skip for any reason (invalid data or something)

Does anyone can help me understand them please ?

Also, I do not see any error on evaluation during the training. Which would be really helpful to avoid overfitting. The only way I would know how to follow the evaluation error during the training would be to try a lstmeval on each checkpoint, but I think there must be a better way ? Otherwise the --eval_listfile argument would be useless in lstmtraining, but I can't find out how it is used.

Thank you :)

Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit :

Shree Devi Kumar

unread,

Jun 28, 2019, 3:39:52 PM6/28/19

to tesser...@googlegroups.com

Your best source for documentation is the source code. See

https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L371

https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L382

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Jun 28, 2019, 3:46:11 PM6/28/19

to tesser...@googlegroups.com

If you train long enough, you will see eval related messages e.g.

Line 668: At iteration 9695/15400/15404, Mean rms=0.645%, delta=2.451%, char train=8.273%, word train=18.648%, skip ratio=0%, New worst char error = 8.273At iteration 6102, stage 0, Eval Char error rate=12.403291, Word error rate=24.582963 wrote checkpoint.
Line 821: At iteration 12562/22400/22405, Mean rms=0.579%, delta=2.053%, char train=7.1%, word train=16.53%, skip ratio=0%, New worst char error = 7.1At iteration 8689, stage 1, Eval Char error rate=12.493766, Word error rate=23.64956 wrote checkpoint.
Line 1009: At iteration 15946/31100/31106, Mean rms=0.525%, delta=1.768%, char train=5.69%, word train=15.172%, skip ratio=0%, New worst char error = 5.69At iteration 11557, stage 1, Eval Char error rate=7.7101831, Word error rate=19.192454 wrote checkpoint.
Line 1183: At iteration 18897/39200/39207, Mean rms=0.502%, delta=1.551%, char train=5.08%, word train=14.304%, skip ratio=0.1%, New worst char error = 5.08At iteration 14912, stage 1, Eval Char error rate=6.8221366, Word error rate=18.226883 wrote checkpoint.
Line 1413: At iteration 22667/50200/50210, Mean rms=0.433%, delta=1.197%, char train=3.977%, word train=11.758%, skip ratio=0%, New best char error = 3.977At iteration 17869, stage 1, Eval Char error rate=5.7822909, Word error rate=16.036021 wrote best model:/home/ubuntu/tesstutorial/IASTENG_LAYER/IASTENG_LAYER3.977_22667.checkpoint wrote checkpoint.
Line 1606: At iteration 25738/59300/59312, Mean rms=0.466%, delta=1.399%, char train=4.48%, word train=12.999%, skip ratio=0.1%, New worst char error = 4.48At iteration 19199, stage 1, Eval Char error rate=5.8820906, Word error rate=16.435243 wrote checkpoint.
Line 1791: At iteration 28593/68200/68212, Mean rms=0.412%, delta=1.016%, char train=3.424%, word train=10.999%, skip ratio=0%, New worst char error = 3.424At iteration 24127, stage 1, Eval Char error rate=4.4509122, Word error rate=13.741829 wrote checkpoint.
Line 1924: At iteration 30533/74500/74513, Mean rms=0.399%, delta=1.078%, char train=3.749%, word train=10.475%, skip ratio=0%, New worst char error = 3.749At iteration 27583, stage 1, Eval Char error rate=4.3155356, Word error rate=13.993133 wrote checkpoint.
Line 2112: At iteration 33286/83400/83416, Mean rms=0.381%, delta=0.947%, char train=3.051%, word train=10.002%, skip ratio=0%, New best char error = 3.051At iteration 29521, stage 1, Eval Char error rate=4.3376752, Word error rate=13.312631 wrote checkpoint.
Line 2308: At iteration 36028/92600/92619, Mean rms=0.408%, delta=1.106%, char train=3.788%, word train=11.206%, skip ratio=0%, New worst char error = 3.788At iteration 31215, stage 1, Eval Char error rate=3.9168943, Word error rate=12.539135 wrote checkpoint.
Line 2425: At iteration 37731/98200/98220, Mean rms=0.411%, delta=1.101%, char train=3.824%, word train=11.042%, skip ratio=0%, New worst char error = 3.824At iteration 34699, stage 1, Eval Char error rate=3.7448292, Word error rate=12.167938 wrote checkpoint.
Line 2555: At iteration 39621/104500/104520, Mean rms=0.368%, delta=0.848%, char train=2.771%, word train=9.92%, skip ratio=0%, New best char error = 2.771At iteration 36028, stage 1, Eval Char error rate=3.8032456, Word error rate=12.157691 wrote best model:/home/ubuntu/tesstutorial/IASTENG_LAYER/IASTENG_LAYER2.771_39621.checkpoint wrote checkpoint.
Line 2693: At iteration 41440/110800/110823, Mean rms=0.358%, delta=0.865%, char train=2.847%, word train=9.352%, skip ratio=0.1%, New worst char error = 2.847At iteration 37814, stage 1, Eval Char error rate=3.8059549, Word error rate=12.294499 wrote checkpoint.

Message has been deleted

Arno Loo

unread,

Jul 19, 2019, 2:15:28 PM7/19/19

to tesseract-ocr

I went and tried to understand the source code as well as I could and although I did not find all the answers I did find some. (for tesseract 4.0.0-beta.3)

At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char train=1.882%, word train=2.285%, skip ratio=0.4%,  wrote checkpoint.

In the above example,

14615 : learning_iteration
695400 : training_iteration
698614 : sample_iteration

sample_iteration : "Index into training sample set. (sample_iteration >= training_iteration)." It is how many times a training file has been passed into the learning process

training_iteration : "Number of actual backward training steps used." It is how many times a training file has been SUCCESSFULLY passed into the learning process

So everytime you get an error : "Image too large to learn!!" - "Encoding of string failed!" - "Deserialize header failed", the sample_iteration increments but not the training_iteration.

Actually you have 1 - (695400 / 698614) = 0.4% which is the skip ratio : proportion of files that have been skiped because of an error

learning_iteration : "Number of iterations that yielded a non-zero delta error and thus provided significant learning. (learning_iteration <= training_iteration). learning_iteration_ is used to measure rate of learning progress."

So it uses the delta value to assess it the iteration has been useful.

What is good to know is that when you specify a maximum number of iteration to the training process it uses the middle iteration number (training_iteration) to know when to stop. But when it writes a checkpoint, the checkpoint name uses the smallest iteration number (learning_iteration). Along with the char train rate. So a checkpoint name is the concatenation of model_name & char_train & learning_iteration

------

But there are still a lot of things I do not understand. And one of them is actually causing me an issue : even with a lot of iterations (475k) I still do not see any log message with the error on the evaluation set.

At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char train=9.379%, word train=9.669%, skip ratio=0.1%,  New worst char error = 9.379 wrote checkpoint.

Le vendredi 28 juin 2019 17:39:52 UTC+2, shree a écrit :

Your best source for documentation is the source code. See

https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L371

https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L382

On Fri, Jun 28, 2019 at 8:47 PM Arno Loo <arno....@gmail.com> wrote:

I continue to make experiments and trying to understand what seems important and I have a few questions after a research in Tesseract's wiki

During the training we can see this kind of information :
At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train=96.314%, word train=100%, skip ratio=0%, New best char error = 96.314 wrote checkpoint.

- 100/100/100 : What do this 3 numbers at the begining mean when they are different ? Which they are often, unlike in my example.
- Mean rms I know well, it's the Root Mean Square error. But what error metric is used ? Usually it is some kind of distance, the Levenshtein distance is often appropriate for OCR tasks but the "%" wouldn't be there if it was.
- delta I don't know
- char train must be the percentage of wrong character predictions during the training
- word train must be the percentage of wrong word predictions during the training
- skip ratio is I think the percentage of samples skip for any reason (invalid data or something)

Does anyone can help me understand them please ?

Also, I do not see any error on evaluation during the training. Which would be really helpful to avoid overfitting. The only way I would know how to follow the evaluation error during the training would be to try a lstmeval on each checkpoint, but I think there must be a better way ? Otherwise the --eval_listfile argument would be useless in lstmtraining, but I can't find out how it is used.

Thank you :)

Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit :
See https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc

When using checkpoint you need to also use the starter traineddata file used for training.

Or give final traineddata file as model.

So, if after training u have converted the checkpoint to a traineddata, you can use that as model. Similarly for the original traineddata.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jul 19, 2019, 3:43:56 PM7/19/19

to tesser...@googlegroups.com

Very well written. You may want to update the wiki pages with the info too.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ElGato ElMago

unread,

Jul 22, 2019, 1:43:42 AM7/22/19

to tesseract-ocr

Yes. This is a very good write-up and helpful to traininers.

2019年7月20日土曜日 0時43分56秒 UTC+9 shree:

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shree

unread,

Jul 22, 2019, 3:08:36 AM7/22/19

to tesseract-ocr

I have added the info at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#iterations-and-checkpoints

Thanks @Arno Loo

Shree Devi Kumar

unread,

Jul 22, 2019, 5:12:41 AM7/22/19

to tesseract-ocr

>But there are still a lot of things I do not understand. And one of them is actually causing me an issue : even with a lot of iterations (475k) I still do not see any log message with the error on the evaluation set.

At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char train=9.379%, word train=9.669%, skip ratio=0.1%,  New worst char error = 9.379 wrote checkpoint.

Please search in your log file for `Eval Char error rate`. You will see messages similar to following:

Line 581: At iteration 11780/22300/22300, Mean rms=0.444%, delta=0.337%, char train=1.106%, word train=4.349%, skip ratio=0%, New worst char error = 1.106At iteration 1000, stage 0, Eval Char error rate=96.849627, Word error rate=99.99418 wrote checkpoint.
Line 695: At iteration 13602/27900/27900, Mean rms=0.402%, delta=0.256%, char train=0.822%, word train=3.451%, skip ratio=0%, New best char error = 0.822At iteration 10735, stage 1, Eval Char error rate=98.134002, Word error rate=99.994139 wrote best model:./manipuri_layer_Bengali/layer0.822_13602.checkpoint wrote checkpoint.
Line 780: At iteration 14802/31900/31900, Mean rms=0.386%, delta=0.249%, char train=0.807%, word train=3.359%, skip ratio=0%, New worst char error = 0.807At iteration 12029, stage 1, Eval Char error rate=1.305587, Word error rate=4.4057044 wrote checkpoint.
Line 975: At iteration 17515/41300/41300, Mean rms=0.38%, delta=0.237%, char train=0.769%, word train=3.152%, skip ratio=0%, New worst char error = 0.769At iteration 13789, stage 1, Eval Char error rate=1.2493521, Word error rate=4.2435136 wrote checkpoint.
Line 1353: At iteration 22136/59200/59201, Mean rms=0.346%, delta=0.201%, char train=0.647%, word train=2.701%, skip ratio=0%, New worst char error = 0.647At iteration 16507, stage 1, Eval Char error rate=1.2372596, Word error rate=4.2810121 wrote checkpoint.
Line 1480: At iteration 23655/65400/65401, Mean rms=0.338%, delta=0.2%, char train=0.623%, word train=2.495%, skip ratio=0%, New worst char error = 0.623At iteration 21133, stage 1, Eval Char error rate=1.0239981, Word error rate=3.4448664 wrote checkpoint.
Line 1579: At iteration 24812/70200/70201, Mean rms=0.337%, delta=0.196%, char train=0.641%, word train=2.505%, skip ratio=0%, New worst char error = 0.641At iteration 22640, stage 1, Eval Char error rate=0.92986846, Word error rate=3.2734983 wrote checkpoint.
Line 1742: At iteration 26593/78100/78101, Mean rms=0.326%, delta=0.178%, char train=0.565%, word train=2.307%, skip ratio=0%, New worst char error = 0.565At iteration 23791, stage 1, Eval Char error rate=0.96613336, Word error rate=3.3449609 wrote checkpoint.
Line 1967: At iteration 29056/89300/89301, Mean rms=0.332%, delta=0.207%, char train=0.705%, word train=2.625%, skip ratio=0%, New worst char error = 0.705At iteration 25585, stage 1, Eval Char error rate=0.9651644, Word error rate=3.1872666 wrote checkpoint.
Line 2153: At iteration 30993/98500/98501, Mean rms=0.301%, delta=0.158%, char train=0.475%, word train=2.025%, skip ratio=0%, New best char error = 0.475At iteration 27017, stage 1, Eval Char error rate=0.86442927, Word error rate=3.0246162 wrote best model:./manipuri_layer_Bengali/layer0.475_30993.checkpoint wrote checkpoint.
Line 2246: At iteration 32014/103100/103101, Mean rms=0.321%, delta=0.2%, char train=0.589%, word train=2.385%, skip ratio=0%, New worst char error = 0.589At iteration 29079, stage 1, Eval Char error rate=0.91884141, Word error rate=3.2192128 wrote checkpoint.
Line 2377: At iteration 33395/109600/109601, Mean rms=0.323%, delta=0.193%, char train=0.617%, word train=2.572%, skip ratio=0%, New worst char error = 0.617At iteration 30993, stage 1, Eval Char error rate=0.81090954, Word error rate=2.861102 wrote checkpoint.
Line 2517: At iteration 34767/116500/116501, Mean rms=0.302%, delta=0.197%, char train=0.641%, word train=2.263%, skip ratio=0%, New worst char error = 0.641At iteration 32139, stage 1, Eval Char error rate=0.85321202, Word error rate=3.0337451

Arno Loo

unread,

Jul 22, 2019, 7:58:08 AM7/22/19

to tesseract-ocr

Oh right there are some. Thanks Shree

Reply all

Reply to author

Forward