Getting error while creating .lstm files

93 views
Skip to first unread message

Harathi Surya

unread,
Jun 21, 2018, 11:50:14 PM6/21/18
to tesseract-ocr
Hi,

I am trying to create .lstm files to finetune tesseract4.0.0 for new characters. I want to fine tune tesseract to recognize new characters like ±.
What i tried:
I added text that consists of the plus or minus symbol to the eng.training_text in langdata.
Then I tried to run the following command

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only --noextract_font_properties --langdata_dir ../langdata   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus

I am getting the following error:
ERROR: /tmp/tmp.3qWucNlYrH/eng/eng.Arial.exp0.box does not exist or is not readable

The error repeated for all the font types.

Can you please give some suggestions why this error occurs and how to solve this?

Thanks in advance
Harathi

Shree Devi Kumar

unread,
Jun 21, 2018, 11:53:26 PM6/21/18
to tesser...@googlegroups.com
Look at src/training/language_specific.sh

The list of default fonts for English is being picked up from there and you probably don't have them installed.

Use fonts that are available.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ae8f7849-8d9a-4799-be3d-47dc67fcddc2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Harathi Surya

unread,
Jun 22, 2018, 1:11:24 PM6/22/18
to tesseract-ocr
Hi Shree,

Thank you for the suggestion.
I have checked for the fonts available in my system. 'dejavu' is available. So, i thought of first checking for this font. So i ran the following command:

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only   --noextract_font_properties --langdata_dir ../langdata   --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainplusminus --fontlist 'dejavu'

But it still throwing the error:

'''
Could not find font named dejavu.
Pango suggested font FreeMono.
Please correct --font arg.
+ [[ 0 -gt 0 ]]
+ check_file_readable /tmp/tmp.ZyN6KA8ZSJ/eng/eng.dejavu.exp0.box /tmp/tmp.ZyN6KA8ZSJ/eng/eng.dejavu.exp0.tif
+ for file in '$@'
+ [[ ! -r /tmp/tmp.ZyN6KA8ZSJ/eng/eng.dejavu.exp0.box ]]
+ err_exit '/tmp/tmp.ZyN6KA8ZSJ/eng/eng.dejavu.exp0.box does not exist or is not readable'
+ echo -e 'ERROR: /tmp/tmp.ZyN6KA8ZSJ/eng/eng.dejavu.exp0.box' does not exist or is not readable
+ tee -a /tmp/tmp.ZyN6KA8ZSJ/eng/tesstrain.log
ERROR: /tmp/tmp.ZyN6KA8ZSJ/eng/eng.dejavu.exp0.box does not exist or is not readable
+ exit 1
''''

Can you please give some suggestions on this...

Thanks,
Harathi

Harathi Surya

unread,
Jun 22, 2018, 2:43:24 PM6/22/18
to tesseract-ocr
Hi Shree,

Thanks for your help. I resolved the error by changing fontlist argument as 'DejaVu Sans'. Now the model is training with new character.

Thanks,
Harathi

On Thursday, June 21, 2018 at 8:50:14 PM UTC-7, Harathi Surya wrote:

Harathi Surya

unread,
Jun 22, 2018, 3:06:35 PM6/22/18
to tesseract-ocr
Hi,

I am facing one more problem here.
I have trained the tesseract for new character successfully.

But when i tried to test it by giving the following command:

'tesseract test.png out -l eng'

The output is not satisfactory. I trained tesseract for '±' character. But there is no change in output before and after finetuning. I have trained the model for 3600 iterations. The final loss is 0.0013.

Please find the attached files for the input image and the output text i am getting.
Can anyone please help me with this...

Thanks,
Harathi

On Thursday, June 21, 2018 at 8:50:14 PM UTC-7, Harathi Surya wrote:
out.txt
2018-06-21 (5).png

Shree Devi Kumar

unread,
Jun 22, 2018, 3:14:35 PM6/22/18
to tesser...@googlegroups.com
Did you run the eval as given in


Did you stop training and create a new traineddata file?

Are you using the new traineddata file for testing?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Harathi Surya

unread,
Jun 22, 2018, 3:25:33 PM6/22/18
to tesseract-ocr
Hi Shree,

Thanks for your reply.
I replaced fontlist argument 'Impact Condensed' with 'DejaVu Sans' to create evalplusminus folder.

Then i ran the lstmeval command and i got this as output
At iteration 0, stage 0, Eval Char error rate=0.024610566, Word error rate=0.086171938

Do i need to create new traineddata file?

I have traineddata files in '/local/share/tessdata' for old data and 'tesstutorial/trainplusminus/eng' which is created for new data.

Do i need to give TESSDATA_PREFIX='tesstutorial/trainplusminus/eng' instead of  '/local/share/tessdata'

Please guide me

Thanks in advance,
Harathi



On Thursday, June 21, 2018 at 8:50:14 PM UTC-7, Harathi Surya wrote:

Shree Devi Kumar

unread,
Jun 22, 2018, 3:29:20 PM6/22/18
to tesser...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Harathi Surya

unread,
Jun 22, 2018, 3:41:25 PM6/22/18
to tesseract-ocr
Thanks Shree,

I followed the instructions and ran the following command:

src/training/lstmtraining --stop_training   --continue_from ~/tesstutorial/trainplusminus/plusminus_checkpoint   --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata   --model_output ~/tesstutorial/trainplusminus/eng.traineddata

Then i changed the TESSDATA_PREFIX to '/tesstutorial/trainplusminus'. Then i tested the model with the image i attached in the previous email. The output is little changed. But didnt get expected. '±' symbol is replaced by '+' symbol. Please find the attached output file. 
Training for more epochs may improve this?

Thanks,
Harathi

On Thursday, June 21, 2018 at 8:50:14 PM UTC-7, Harathi Surya wrote:
out (1).txt

Harathi Surya

unread,
Jun 22, 2018, 3:43:26 PM6/22/18
to tesseract-ocr
Sorry by mistake uploaded the wrong file. Please find the attached file for the output i got.

Thanks,
Harathi
out.txt

Shree Devi Kumar

unread,
Jun 22, 2018, 3:50:06 PM6/22/18
to tesser...@googlegroups.com
The tutorial has been written by Ray Smith. I haven't tested the plus-minus as given. 

Check whether the fonts you are using have the plus-minus sign.  

Using one font is for the IMPACT tutorial with 400 iterations.

For plus-minus you need to use the larger list of fonts.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Harathi Surya

unread,
Jun 22, 2018, 4:02:10 PM6/22/18
to tesseract-ocr
Ok Shree, 
I think the only font that is in common with my system and tesseract default fonts is 'DejaVu Sans'. That is the reason why i gave only one font. I will try installing new fonts to my system and will try.

Thanks for your help,
Harathi

On Thursday, June 21, 2018 at 8:50:14 PM UTC-7, Harathi Surya wrote:
Reply all
Reply to author
Forward
0 new messages