lstmeval --model data/checkpoints/mod01_checkpoint --traineddata ./usr/share/tessdata/mod01.traineddata --eval_listfile data/list.eval
lstmeval --traineddata ./usr/share/tessdata/eng.traineddata --eval_listfile data/list.eval
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f762b56-f7b0-4438-a8cb-cbab94304341%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
training/lstmeval --model ~/tesstutorial/engoutput/base_checkpoint \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
training/lstmeval --model tessdata/best/eng.traineddata \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt
See https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.ascWhen using checkpoint you need to also use the starter traineddata file used for training.Or give final traineddata file as model.So, if after training u have converted the checkpoint to a traineddata, you can use that as model. Similarly for the original traineddata.
On Thu, 27 Jun 2019, 21:46 Arno Loo, <arno....@gmail.com> wrote:
Hello,--I just finished my first training of tesseract 4.0 and I ran a lstmeval on the generated model, which I named mod01.I use this command line :
lstmeval --model data/checkpoints/mod01_checkpoint --traineddata ./usr/share/tessdata/mod01.traineddata --eval_listfile data/list.eval
It worked fine and it gave me a character error rate and a word error rate. Now I would like to know if my training improved Tesseract's accuracy on my specific documents. So I wanted to launch the evaluation on the same dataset but with the model I started the training from, the english provided on Tesseract's github repo : eng.traineddata. I tried :But it did not work because I did not provided any --model
lstmeval --traineddata ./usr/share/tessdata/eng.traineddata --eval_listfile data/list.evalAnd this showed me that my understanding of Tesseract's was not correct.Since downloading a new lang.traineddata is enough to use Tesseract with this lang I thought that all the model was contained in the traineddata files. What is this --model argument then ?In which my research on the web told me to put the last checkpoint of my training but without explaining why.Is it possible then to run lstmeval on a pretrained model like eng.traineddata ?Thank you !
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train=96.314%, word train=100%, skip ratio=0%, New best char error = 96.314 wrote checkpoint.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char train=1.882%, word train=2.285%, skip ratio=0.4%, wrote checkpoint.
14615 : learning_iteration
695400 : training_iteration
698614 : sample_iteration
At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char train=9.379%, word train=9.669%, skip ratio=0.1%, New worst char error = 9.379 wrote checkpoint.
Your best source for documentation is the source code. See
On Fri, Jun 28, 2019 at 8:47 PM Arno Loo <arno....@gmail.com> wrote:
--I continue to make experiments and trying to understand what seems important and I have a few questions after a research in Tesseract's wikiDuring the training we can see this kind of information :
At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train=96.314%, word train=100%, skip ratio=0%, New best char error = 96.314 wrote checkpoint.- 100/100/100 : What do this 3 numbers at the begining mean when they are different ? Which they are often, unlike in my example.- Mean rms I know well, it's the Root Mean Square error. But what error metric is used ? Usually it is some kind of distance, the Levenshtein distance is often appropriate for OCR tasks but the "%" wouldn't be there if it was.- delta I don't know- char train must be the percentage of wrong character predictions during the training- word train must be the percentage of wrong word predictions during the training- skip ratio is I think the percentage of samples skip for any reason (invalid data or something)Does anyone can help me understand them please ?Also, I do not see any error on evaluation during the training. Which would be really helpful to avoid overfitting. The only way I would know how to follow the evaluation error during the training would be to try a lstmeval on each checkpoint, but I think there must be a better way ? Otherwise the --eval_listfile argument would be useless in lstmtraining, but I can't find out how it is used.Thank you :)
Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit :See https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.ascWhen using checkpoint you need to also use the starter traineddata file used for training.Or give final traineddata file as model.So, if after training u have converted the checkpoint to a traineddata, you can use that as model. Similarly for the original traineddata.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char train=9.379%, word train=9.669%, skip ratio=0.1%, New worst char error = 9.379 wrote checkpoint.