train more fonts on trained model fas in tesseract

reza

unread,

May 14, 2018, 1:45:15 PM5/14/18

to tesseract-ocr

hi

i tested tesseract 4 beta on persian lang , the results was good. but i think needs more training on more fonts and texts.

how could we train more fonts and texts on model that exist in tesseract 4 beta for persian lang ?

and last question is, how could we apply dictionary to correct that words OCRing with error ?

thanks

ShreeDevi Kumar

unread,

May 14, 2018, 1:48:11 PM5/14/18

to tesser...@googlegroups.com

please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4011df1b-a0cc-46bc-ba9f-e6d6b7f62d64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

reza

unread,

May 15, 2018, 5:43:34 AM5/15/18

to tesseract-ocr

thanks for your reply.

I read that but i confused. could u send me a bash file for fine tune for impact ?

thanks

On Monday, May 14, 2018 at 6:18:11 PM UTC+4:30, shree wrote:

please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 14, 2018 at 1:52 PM, reza <reza...@gmail.com> wrote:

hi
i tested tesseract 4 beta on persian lang , the results was good. but i think needs more training on more fonts and texts.
how could we train more fonts and texts on model that exist in tesseract 4 beta for persian lang ?

and last question is, how could we apply dictionary to correct that words OCRing with error ?

thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

reza

unread,

May 15, 2018, 7:30:59 AM5/15/18

to tesseract-ocr

i used this attached finetune.sh file ... but that raised error. could u help me ?

thanks

###### MAKING TRAINING DATA ######

=== Starting training for language 'eng'
[Tue, May 15, 2018 11:42:36 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Arial --outputbase=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt --text=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/font_tmp.CpgpM0lbxD/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial
Rendering using Corbel
[Tue, May 15, 2018 11:42:37 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0 --max_pages=3 --font=Arial --text=./langdata/eng/eng.training_text
[Tue, May 15, 2018 11:42:37 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0 --max_pages=3 --font=Corbel --text=./langdata/eng/eng.training_text
Stripped 2 unrenderable words
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif
Stripped 1 unrenderable words
Rendered page 1 to file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif
Stripped 2 unrenderable words
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif
Stripped 1 unrenderable words
Rendered page 1 to file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Tue, May 15, 2018 11:42:39 AM] /c/Program Files (x86)/Tesseract-OCR/unicharset_extractor --output_unicharset /tmp/tmp.6m4B2TUln1/eng/eng.unicharset --norm_mode 1 /tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box /tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box
Extracting unicharset from box file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box
Extracting unicharset from box file C:/Users/asus/AppData/Local/Temp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box
ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset does not exist or is not readable
###### MAKING EVAL DATA ######

=== Starting training for language 'eng'
[Tue, May 15, 2018 11:42:40 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Calibri --outputbase=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt --text=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/font_tmp.n0qq4iJk4q/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Calibri
[Tue, May 15, 2018 11:42:40 AM] /c/Program Files (x86)/Tesseract-OCR/text2image --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0 --max_pages=3 --font=Calibri --text=./langdata/eng/eng.training_text
Stripped 2 unrenderable words
Rendered page 0 to file C:/Users/asus/AppData/Local/Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.tif
Stripped 1 unrenderable words
Rendered page 1 to file C:/Users/asus/AppData/Local/Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Tue, May 15, 2018 11:42:42 AM] /c/Program Files (x86)/Tesseract-OCR/unicharset_extractor --output_unicharset /tmp/tmp.h0l64TAxEq/eng/eng.unicharset --norm_mode 1 /tmp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.box
Extracting unicharset from box file C:/Users/asus/AppData/Local/Temp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.box
ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.h0l64TAxEq/eng/eng.unicharset does not exist or is not readable
#### combine_tessdata to extract lstm model from previous trained set ####
Extracting tessdata components from ./tessdata_best/eng.traineddata
Wrote ./trained_plus_chars/eng.lstm
Version string:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054
#### training from previous optimum #####
finetune.sh: line 119: 11664 Segmentation fault lstmtraining --model_output $train_output_dir/pluschars --continue_from $train_output_dir/$Lang.lstm --old_traineddata $tessdata_dir/$Lang.traineddata --traineddata $train_output_dir/$Lang/$Lang.traineddata --max_iterations $MaxIterations --debug_interval -1 --eval_listfile $eval_output_dir/$Lang.training_files.txt --train_listfile $train_output_dir/$Lang.training_files.txt
#### Building final trained file ./trained_plus_chars/eng_NEW.traineddata d####
finetune.sh: line 130: 11320 Segmentation fault lstmtraining --stop_training --continue_from $train_output_dir/pluschars_checkpoint --traineddata $train_output_dir/$Lang/$Lang.traineddata --model_output $final_trained_data_file

finetune.sh

ShreeDevi Kumar

unread,

May 15, 2018, 8:42:20 AM5/15/18

to tesser...@googlegroups.com

What o/s are you running it on?

Which version of tesseract?

> ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset does not exist or is not readable

which version of icu library?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7c46c196-e08d-4541-9f3b-b8a768792c9a%40googlegroups.com.

Message has been deleted

reza

unread,

May 15, 2018, 9:14:25 AM5/15/18

to tesseract-ocr

thanks for reply

tesseract 4 beta

windows 10

ShreeDevi Kumar

unread,

May 15, 2018, 9:16:27 AM5/15/18

to tesser...@googlegroups.com

Please use the latest windows binaries from https://github.com/UB-Mannheim/tesseract/wiki provided by @stweil

How do you run bash script on windows10?

@stweil I have not tried training on windows? Do you have feedback from others who have tried it.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 15, 2018 at 2:41 PM, reza <reza...@gmail.com> wrote:

windows 10
tesseract 4 alpha

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7c46c196-e08d-4541-9f3b-b8a768792c9a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3851abc9-90b5-4a09-a01f-ffbd583e6bab%40googlegroups.com.

reza

unread,

May 15, 2018, 10:29:39 AM5/15/18

to tesseract-ocr

i test it on ubuntu , that raised error too.

could u help me and send me a new bash file for fine tuning with new fonts ?

i put "eng.traineddata" fil in tessdata_best folder

and "eng.training_text" and "eng.traineddata" in langdata\eng

is it true and sufficient ? or need more file ?

thanks

ShreeDevi Kumar

unread,

May 15, 2018, 1:05:10 PM5/15/18

to tesser...@googlegroups.com

I will try to put together complete steps.

I am doing a test run for training persian.

Are the following fonts ok for it?

'55_Sarchia_Kurdish' \

'56_Sarchia_Kurdish_Bold Bold' \

'Amiri' \

'Arabic Typesetting' \

'Arial' \

'Arial Unicode MS' \

'B Nazanin' \

'B Nazanin Bold' \

'Calibri' \

'Courier New' \

'Microsoft Sans Serif' \

'Scheherazade' \

'Tahoma' \

'Times New Roman,' \

'Traditional Arabic' \

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/885e3e15-e08f-4489-a0bc-2162f913495a%40googlegroups.com.

reza

unread,

May 15, 2018, 1:17:07 PM5/15/18

to tesseract-ocr

hi again

thanks for your reply.

i need more fonts. for examples :

B Koodak
B Lotus
B Titr

B Zar

B Yekan

Iran Nastaliq

if needs, i send the .ttf files of that fonts ?

thanks

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,

May 18, 2018, 4:19:54 PM5/18/18

to tesser...@googlegroups.com

I have posted a couple of test models for Farsi at https://github.com/Shreeshrii/tessdata_shreetest

These have not been trained on text with diacritics as the normalization and training process was giving error on the combining marks.

Please give them a try and see if they provide better recognition for numbers and text without combining marks.

FYI, I do not know the Persian language so it is difficult for me to gauge if results are ok or not.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e43db8d0-731e-4268-8791-9e243646f49d%40googlegroups.com.

reza

unread,

May 19, 2018, 3:54:32 AM5/19/18

to tesseract-ocr

hi ShreeDevi

Thanks.

I tested the 2 models that you have provided. The accuracy on samples without noise were about 98% but on scanned samples or captured images, were about 80%.

but still it didn't work on different fonts.

Could u send all files that needed for training models? I want fine tune the model with more fonts and diacritics .

best regards

ShreeDevi Kumar

unread,

May 19, 2018, 5:43:44 AM5/19/18

to tesser...@googlegroups.com

Hi Reza,

Attached are two scripts and one log file. You will need to change the directories in the scripts.

finetune.sh and finetune log file are for a sample finetuning for eng. By changing the language code you can run it for fas.

You can use that as a test.

plus-fas.sh is for plusminus type of finetuning for fas. It merges the existing unicharset with the unicharset extracted from the training_text.

You will need to update the training_text file in langdata/fas

Optionally you can also review and update wordlist, numbers and punc file.

The scripts should run if you give correct directory names.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fe15cedc-0a2a-41fc-ac3c-b80df458a509%40googlegroups.com.

finetune.log.txt

finetune.sh

plus_fas.sh

reza

unread,

May 19, 2018, 7:16:32 AM5/19/18

to tesseract-ocr

thanks for your reply.

i will test these as soon as possible.

one of the weakness of tesseract is when we want ocr multiple languages. for example, if we have an image with persian and english text, the tesseract can't recogize those as well as we have a single language.

Do you have any solution for it ?

PS: i use this command "tesseract input.png out -l fas+eng"

kislay...@imageinfosystems.com

unread,

Oct 16, 2018, 8:19:55 AM10/16/18

to tesseract-ocr

Hello all,

I want to train tesseract 4.0 alpha for a new font, is there anyone who can help me on this topic.

Soumik Ranjan Dasgupta

unread,

Oct 16, 2018, 8:27:17 AM10/16/18

to tesser...@googlegroups.com

Please see https://github.com/tesseract-ocr/tesseract/wiki/Fonts#fonts-for-tesseract-training.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1ee9528e-d8fd-4438-9cd0-4925ae7763d5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Regards,

Soumik Ranjan Dasgupta

Message has been deleted

kislay bajpai

unread,

Oct 16, 2018, 10:10:57 AM10/16/18

to tesseract-ocr

Hello,

Thanks for prompt reply, I want to train tesseract 4.0 alpha for font E13B. How could i train? Please share the knowledge.

Soumik Ranjan Dasgupta

unread,

Oct 17, 2018, 2:48:26 PM10/17/18

to tesser...@googlegroups.com

You'll need to install the fonts in your system add the same in font_properties and language_specific.sh for fine-tuning or training from scratch. For further details please see https://github.com/tesseract-ocr/tesseract/issues/1672.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/72b70562-15f4-4b6f-96a9-62b6d792980c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

kislay bajpai

unread,

Oct 17, 2018, 4:03:26 PM10/17/18

to tesser...@googlegroups.com

Okay, thanks for reply. I will see how to do so.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAB_aDAdG7wKs-U9fhvuf3FZdFGs2--0qHW1Bfzr%2BinrPnZ3Ovg%40mail.gmail.com.

vivek....@teknowmics.co.in

unread,

Oct 24, 2018, 11:34:14 AM10/24/18

to tesseract-ocr

'Add the same in font_properties and language_specific.sh' ? Can you please elaborate? Thank you

Soumik Ranjan Dasgupta

unread,

Oct 24, 2018, 11:41:16 AM10/24/18

to tesser...@googlegroups.com

Please see tesseract/src/training/language_specific.sh

You need to add the fonts under the respective category after installation.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8eafa0fa-6129-4c87-a53b-ae8a5659ae79%40googlegroups.com.

vivek....@teknowmics.co.in

unread,

Oct 24, 2018, 12:09:34 PM10/24/18

to tesseract-ocr

training/lstmtraining --model_output /path/to/output [--max_image_MB 6000] \ 
--continue_from /path/to/existing/model \ 
--traineddata /path/to/original/traineddata \ 
[--perfect_sample_delay 0] [--debug_interval 0] \ 
[--max_iterations 0] [--target_error_rate 0.01] \ 
--train_listfile /path/to/list/of/filenames.txt

In this command, what should be passed to the argument continue_from and traineddata? I'm a bit confused.

Shree Devi Kumar

unread,

Oct 24, 2018, 4:59:21 PM10/24/18

to tesser...@googlegroups.com

See the wiki page on training 4.0 and follow the tutorial.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d374762e-28e2-4118-847f-edec3065b3a8%40googlegroups.com.

Vinod Gattani

unread,

Oct 25, 2018, 4:26:58 AM10/25/18

to tesser...@googlegroups.com

You can look in this repo.

https://github.com/Shreeshrii/tessdata_ocrb.

Use finetune-ocrb.sh

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX-km279eFQ%3D0Lx-63E5AoUoYerdha6GKenZ15Fcs%2BvrA%40mail.gmail.com.

Reply all

Reply to author

Forward