train more fonts on trained model fas in tesseract

1,818 views
Skip to first unread message

reza

unread,
May 14, 2018, 1:45:15 PM5/14/18
to tesseract-ocr
hi
i tested tesseract 4 beta on persian lang , the results was good. but i think needs more training on more fonts and texts.
how could we train more fonts and texts on model that exist in tesseract 4 beta for persian lang ?

and last question is, how could we apply dictionary to correct that words OCRing with error ?

thanks

ShreeDevi Kumar

unread,
May 14, 2018, 1:48:11 PM5/14/18
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4011df1b-a0cc-46bc-ba9f-e6d6b7f62d64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

reza

unread,
May 15, 2018, 5:43:34 AM5/15/18
to tesseract-ocr
thanks for your reply.
I read that but i confused. could u send me a bash file for fine tune for impact ?
thanks 


On Monday, May 14, 2018 at 6:18:11 PM UTC+4:30, shree wrote:

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 14, 2018 at 1:52 PM, reza <reza...@gmail.com> wrote:
hi
i tested tesseract 4 beta on persian lang , the results was good. but i think needs more training on more fonts and texts.
how could we train more fonts and texts on model that exist in tesseract 4 beta for persian lang ?

and last question is, how could we apply dictionary to correct that words OCRing with error ?

thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

reza

unread,
May 15, 2018, 7:30:59 AM5/15/18
to tesseract-ocr
i used this attached finetune.sh file ... but that raised error. could u help me ?

thanks
 
###### MAKING TRAINING DATA ######

=== Starting training for language 'eng'
[Tue, May 15, 2018 11:42:36 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Arial --outputbase=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt --text=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/font_tmp.CpgpM0lbxD/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial
Rendering using Corbel
[Tue, May 15, 2018 11:42:37 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0 --max_pages=3 --font=Arial --text=./langdata/eng/eng.training_text
[Tue, May 15, 2018 11:42:37 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0 --max_pages=3 --font=Corbel --text=./langdata/eng/eng.training_text
Stripped 2 unrenderable words
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif
Stripped 1 unrenderable words
Rendered page 1 to file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif
Stripped 2 unrenderable words
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif
Stripped 1 unrenderable words
Rendered page 1 to file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Tue, May 15, 2018 11:42:39 AM] /c/Program Files (x86)/Tesseract-OCR/unicharset_extractor --output_unicharset /tmp/tmp.6m4B2TUln1/eng/eng.unicharset --norm_mode 1 /tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box /tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box
Extracting unicharset from box file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box
Extracting unicharset from box file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box
ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset does not exist or is not readable
###### MAKING EVAL DATA ######

=== Starting training for language 'eng'
[Tue, May 15, 2018 11:42:40 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Calibri --outputbase=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt --text=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/font_tmp.n0qq4iJk4q/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Calibri
[Tue, May 15, 2018 11:42:40 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0 --max_pages=3 --font=Calibri --text=./langdata/eng/eng.training_text
Stripped 2 unrenderable words
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.tif
Stripped 1 unrenderable words
Rendered page 1 to file C:/Users/asus/AppData/Local/Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Tue, May 15, 2018 11:42:42 AM] /c/Program Files (x86)/Tesseract-OCR/unicharset_extractor --output_unicharset /tmp/tmp.h0l64TAxEq/eng/eng.unicharset --norm_mode 1 /tmp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.box
Extracting unicharset from box file C:/Users/asus/AppData/Local/Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.box
ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.h0l64TAxEq/eng/eng.unicharset does not exist or is not readable
#### combine_tessdata to extract lstm model from previous trained set ####
Extracting tessdata components from ./tessdata_best/eng.traineddata
Wrote ./trained_plus_chars/eng.lstm
Version string:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054
#### training from previous optimum  #####
finetune.sh: line 119: 11664 Segmentation fault      lstmtraining --model_output $train_output_dir/pluschars --continue_from $train_output_dir/$Lang.lstm --old_traineddata $tessdata_dir/$Lang.traineddata --traineddata $train_output_dir/$Lang/$Lang.traineddata --max_iterations $MaxIterations --debug_interval -1 --eval_listfile $eval_output_dir/$Lang.training_files.txt --train_listfile $train_output_dir/$Lang.training_files.txt
#### Building final trained file ./trained_plus_chars/eng_NEW.traineddata d####
finetune.sh: line 130: 11320 Segmentation fault      lstmtraining --stop_training --continue_from $train_output_dir/pluschars_checkpoint --traineddata $train_output_dir/$Lang/$Lang.traineddata --model_output $final_trained_data_file
finetune.sh

ShreeDevi Kumar

unread,
May 15, 2018, 8:42:20 AM5/15/18
to tesser...@googlegroups.com
What o/s are you running it on?

Which version of tesseract?

> ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset does not exist or is not readable

which version of icu library?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Message has been deleted

reza

unread,
May 15, 2018, 9:14:25 AM5/15/18
to tesseract-ocr
thanks for reply 
tesseract 4 beta
 windows 10

ShreeDevi Kumar

unread,
May 15, 2018, 9:16:27 AM5/15/18
to tesser...@googlegroups.com
Please use the latest windows binaries from https://github.com/UB-Mannheim/tesseract/wiki provided by @stweil

How do you run bash script on windows10?

@stweil I have not tried training on windows? Do you have feedback from others who have tried it.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 15, 2018 at 2:41 PM, reza <reza...@gmail.com> wrote:
windows 10
tesseract 4 alpha

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

reza

unread,
May 15, 2018, 10:29:39 AM5/15/18
to tesseract-ocr
i test it on ubuntu , that raised error too.

could u help me and send me a new bash file for fine tuning with new fonts ?

i put "eng.traineddata" fil in tessdata_best folder
and "eng.training_text" and "eng.traineddata" in langdata\eng

is it true and sufficient ? or need more file ? 


thanks 

ShreeDevi Kumar

unread,
May 15, 2018, 1:05:10 PM5/15/18
to tesser...@googlegroups.com
I will try to put together complete steps.

I am doing a test run for training persian.

Are the following fonts ok for it?

  '55_Sarchia_Kurdish' \
  '56_Sarchia_Kurdish_Bold Bold' \
  'Amiri' \
  'Arabic Typesetting' \
  'Arial' \
  'Arial Unicode MS' \
  'B Nazanin' \
  'B Nazanin Bold' \
  'Calibri' \
  'Courier New' \
  'Microsoft Sans Serif' \
  'Scheherazade' \
  'Tahoma' \
  'Times New Roman,' \
  'Traditional Arabic' \

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

reza

unread,
May 15, 2018, 1:17:07 PM5/15/18
to tesseract-ocr
hi again
thanks for your reply.

i need more fonts. for examples :
B Koodak
B Lotus
B Titr
B Zar
B Yekan
Iran Nastaliq

if needs, i send the .ttf files of that fonts ?

thanks 
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
May 18, 2018, 4:19:54 PM5/18/18
to tesser...@googlegroups.com
I have posted a couple of test models for Farsi at https://github.com/Shreeshrii/tessdata_shreetest

These have not been trained on text with diacritics as the normalization and training process was giving error on the combining marks.

Please give them a try and see if they provide better recognition for numbers and text without combining marks.

FYI, I do not know the Persian language so it is difficult for me to gauge if results are ok or not.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

reza

unread,
May 19, 2018, 3:54:32 AM5/19/18
to tesseract-ocr
hi ShreeDevi

Thanks.

I tested the 2 models that you have provided. The accuracy on samples without noise were about 98% but on scanned samples or captured images, were about 80%.
but still it didn't work on different fonts.
Could u send all files that needed for training models? I want fine tune the model with more fonts and diacritics .

best regards

ShreeDevi Kumar

unread,
May 19, 2018, 5:43:44 AM5/19/18
to tesser...@googlegroups.com
Hi Reza,

Attached are two scripts and one log file. You will need to change the directories in the scripts.

finetune.sh and finetune log file are for a sample finetuning for eng. By changing the language code you can run it for fas.
You can use that as a test.

plus-fas.sh is for plusminus type of finetuning for fas. It merges the existing unicharset with the unicharset extracted from the training_text.

You will need to update the training_text file in langdata/fas
Optionally you can also review and update wordlist, numbers and punc file.

The scripts should run if you give correct directory names. 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
finetune.log.txt
finetune.sh
plus_fas.sh

reza

unread,
May 19, 2018, 7:16:32 AM5/19/18
to tesseract-ocr
thanks for your reply.
i will test these as soon as possible. 

one of the weakness of tesseract is when we want ocr multiple languages. for example, if we have an image with persian and english text, the tesseract can't recogize those as well as we have a single language.

Do you have any solution for it ?

PS: i use this command "tesseract input.png out -l fas+eng"

kislay...@imageinfosystems.com

unread,
Oct 16, 2018, 8:19:55 AM10/16/18
to tesseract-ocr
Hello all, 

I want to train tesseract 4.0 alpha for a new font, is there anyone who can help me on this topic.

Soumik Ranjan Dasgupta

unread,
Oct 16, 2018, 8:27:17 AM10/16/18
to tesser...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--
Regards,
Soumik Ranjan Dasgupta
Message has been deleted
Message has been deleted

kislay bajpai

unread,
Oct 16, 2018, 10:10:57 AM10/16/18
to tesseract-ocr
Hello, 

Thanks for prompt reply, I want to train tesseract 4.0 alpha for font E13B. How could i train? Please share the knowledge.

Soumik Ranjan Dasgupta

unread,
Oct 17, 2018, 2:48:26 PM10/17/18
to tesser...@googlegroups.com
You'll need to install the fonts in your system add the same in font_properties and language_specific.sh for fine-tuning or training from scratch. For further details please see https://github.com/tesseract-ocr/tesseract/issues/1672.


For more options, visit https://groups.google.com/d/optout.

kislay bajpai

unread,
Oct 17, 2018, 4:03:26 PM10/17/18
to tesser...@googlegroups.com
Okay, thanks for reply. I will see how to do so.

vivek....@teknowmics.co.in

unread,
Oct 24, 2018, 11:34:14 AM10/24/18
to tesseract-ocr
'Add the same in font_properties and language_specific.sh' ? Can you please elaborate? Thank you

Soumik Ranjan Dasgupta

unread,
Oct 24, 2018, 11:41:16 AM10/24/18
to tesser...@googlegroups.com
Please see tesseract/src/training/language_specific.sh
You need to add the fonts under the respective category after installation. 

vivek....@teknowmics.co.in

unread,
Oct 24, 2018, 12:09:34 PM10/24/18
to tesseract-ocr
training/lstmtraining --model_output /path/to/output [--max_image_MB 6000] \
--continue_from /path/to/existing/model \
--traineddata /path/to/original/traineddata \
[--perfect_sample_delay 0] [--debug_interval 0] \
[--max_iterations 0] [--target_error_rate 0.01] \
--train_listfile /path/to/list/of/filenames.txt

In this command, what should be passed to the argument continue_from and traineddata? I'm a bit confused.

Shree Devi Kumar

unread,
Oct 24, 2018, 4:59:21 PM10/24/18
to tesser...@googlegroups.com
See the wiki page on training 4.0 and follow the tutorial. 

Vinod Gattani

unread,
Oct 25, 2018, 4:26:58 AM10/25/18
to tesser...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages