lstmeval gives a perfect result but tesseract fails

141 views
Skip to first unread message

Julien Jemine

unread,
May 31, 2018, 7:13:43 AM5/31/18
to tesseract-ocr
Hi,

I've trained a LSTM model for a custom language from scratch as explained here.

The language only has about 100 words and 17 characters, so it's pretty simple.

When I run lstmeval on my model, I get a perfect match:
[icm@u16-offcao-07] train1$ lstmeval --model /home/icm/share/tessdata/iqi.traineddata --eval_listfile iqitrain2/iqi.training_files.txt --verbosity 2
Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi.Arial.exp0.lstmf
Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi.Calibri.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
Truth:6CUEN 6 CU EN
OCR  :6CUEN 6 CU EN
Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_Semi-Condensed.exp0.lstmf
Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
Truth:6CUEN 6 CU EN
OCR  :6CUEN 6 CU EN
Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi.Verdana.exp0.lstmf
Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
Truth:6CUEN 6 CU EN
OCR  :6CUEN 6 CU EN
Truth:6CUEN 6 CU EN
OCR  :6CUEN 6 CU EN
Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0

However, when I put my iqi.traineddata file in my tessdata folder and try to run tesseract on the same tif file, I get errors:
[icm@u16-offcao-07] train1$ tesseract iqitrain2/iqi.training_img.txt stdout -l iqi
Page 0 : /home/icm/train1/iqitrain2/iqi.Arial.exp0.tif
6CFEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN
6CUEN 1 CU EN
Page 1 : /home/icm/train1/iqitrain2/iqi.Calibri.exp0.tif

6CM 10FEEN 0 6 FEE 13CUEN 11 6 FE EEN 1116
6UEN 16 FE
Page 2 : /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_Semi-Condensed.exp0.tif

6TM 13CUEN 13 1 EN 11CUE 11 CU EN 12B 11 16
6 6 CU EN
Page 3 : /home/icm/train1/iqitrain2/iqi.Verdana.exp0.tif

ASTM 103UEEN 13 1CU EN 13CUEN 13 6 FE EEN 11 16
6CUEN 6 CU EN


Now the really frustrating part: I have the opposite phenomenon with the "eng" language! (with eng.traineddata taken from tessdata_best)
lstmeval gives me a few errors (Eval Char error rate=2.4665552, Word error rate=16.666667)
tesseract gives me the right answer! (But the images are generated with tesstrain.sh and very common fonts, it's probably to be expected).

Am I doing something wrong?
What's going on here?

ShreeDevi Kumar

unread,
May 31, 2018, 11:25:47 AM5/31/18
to tesser...@googlegroups.com
>I've trained a LSTM model for a custom language from scratch as explained here.

>The language only has about 100 words and 17 characters, so it's pretty simple.

For such a small model, try to build the legacy version rather than LSTM.

$tesstrain_dir/tesstrain.sh \
   --lang $Lang \
   --exposures "0" \
   --fonts_dir $fonts_dir \
   --fontlist $fonts_for_training \
   --langdata_dir $langdata_dir \
   --tessdata_dir  $tessdata_dir \
   --training_text $langdata_dir/$Lang/$Lang.training_text \
   --output_dir $train_output_dir



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Julien Jemine

unread,
Jun 1, 2018, 2:53:27 AM6/1/18
to tesseract-ocr
Hi Shree,

Thanks for your answer. 
If you don't mind, could you explain why it'd be better ?
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Jun 1, 2018, 4:30:25 AM6/1/18
to tesser...@googlegroups.com
From what I understand from the documentation provided by Ray Smith regarding LSTM training, the models have been trained on hundreds of thousands of lines and  hundreds of fonts. The network spec used for training from scratch will therefore be optimized for such large models.

You seem to have a different requirement, hence I suggested building the legacy tesseract model.

You can experiment and see if it is better.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Julien Jemine

unread,
Jun 1, 2018, 7:09:40 AM6/1/18
to tesser...@googlegroups.com
You can experiment and see if it is better.
I think I'll do just that, thanks for the idea.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/0_bN53wL7zw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Claudiu Saftoiu

unread,
Jul 12, 2019, 7:03:04 AM7/12/19
to tesseract-ocr
Did you ever resolve the difference between the two commands? I am having the same issue - lstm training gives 0 error, but when run using tesseract it gives errors

Abstract

unread,
Jul 12, 2019, 8:00:16 AM7/12/19
to tesseract-ocr
I feel this question is also very interesting. I am also to achieve good results with digits-only or cyrillic-large-letters-only recognition,
and it really looks strange. After training lstmeval reports perfect results (lstmeval stops at 0,0001% rate in several hours of working), buf I run standard recognition - result is really far from what expected (actual quality on handwritten text is ~60%).

I tested it with my own tool, that helps me drawing boxes and combines train images from scanned pages, and another tool that tests training results using same box and template files.
 
четверг, 31 мая 2018 г., 14:13:43 UTC+3 пользователь Julien Jemine написал:
Reply all
Reply to author
Forward
0 new messages