Training Tesseract 4 from Scratch

Shobhit Kapil

unread,

Apr 3, 2019, 2:46:36 PM4/3/19

to tesseract-ocr

Hi Team,

I am not at all aware of training tesseract 4, is there any way that how to learn train tesseract 4.

By reading the document also i am not getting from where to start and what to start.

Thanks,

Shobhit

Kristóf Horváth

unread,

Apr 4, 2019, 7:56:02 AM4/4/19

to tesseract-ocr

Try this guide to learn more about Tesseract's LSTM learning capabilities. link to the guide. Ye official documentation is kinda garbage, so try out different guides and articles, eventually you will get it, but as far as I know there are not many up to date guides so try the guide I linked in. (its in wiki format)

Titi

unread,

Apr 4, 2019, 12:01:55 PM4/4/19

to tesseract-ocr

I'm reading the guide Kristóf Horváth shared. But I feel I'm so dull, hjx:(
If anyone shouted eureka! eureka! eureka!, please rep here, step by step. (If it 's me, i will rep, sure)
Thanks.

Kristóf Horváth

unread,

Apr 4, 2019, 12:09:28 PM4/4/19

to tesseract-ocr

You are not dull, I made that guide on my own when i was learning how to teach tesseract and in my experience nobody will give you an only step by step guide. Most of it is intuition once you know what you can do with the available technology, you just have to work it until it is done.

Zdenko Podobny

unread,

Apr 4, 2019, 12:22:26 PM4/4/19

to tesser...@googlegroups.com

Kristof,

do you use some plugin in google doc?

It seems like you use Mardown formatting, but it is not nice readable (at lease for me) ;-)

Zdenko

št 4. 4. 2019 o 9:56 Kristóf Horváth <vazzz...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fc607a42-008e-456a-8964-0a0858c70008%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kristóf Horváth

unread,

Apr 4, 2019, 12:24:43 PM4/4/19

to tesseract-ocr

I made this for a wiki, and dont have the time to format it but if you format it, you can submit that and i can accept it.

2019. április 4., csütörtök 14:22:26 UTC+2 időpontban zdenop a következőt írta:

Kristof,

do you use some plugin in google doc?
It seems like you use Mardown formatting, but it is not nice readable (at lease for me) ;-)

Zdenko

št 4. 4. 2019 o 9:56 Kristóf Horváth <vazzz...@gmail.com> napísal(a):

Try this guide to learn more about Tesseract's LSTM learning capabilities. link to the guide. Ye official documentation is kinda garbage, so try out different guides and articles, eventually you will get it, but as far as I know there are not many up to date guides so try the guide I linked in. (its in wiki format)

2019. április 3., szerda 16:46:36 UTC+2 időpontban Shobhit Kapil a következőt írta:
Hi Team,

I am not at all aware of training tesseract 4, is there any way that how to learn train tesseract 4.
By reading the document also i am not getting from where to start and what to start.

Thanks,
Shobhit

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,

Apr 4, 2019, 1:40:04 PM4/4/19

to tesser...@googlegroups.com

#=== CHECK THAT TESSERACT AND TRAINING TOOLS ARE INSTALLED

tesseract -v

text2image -v

unicharset_extractor -v

set_unicharset_properties -v

combine_lang_model -v

lstmtraining -v

lstmeval -v

#=== MAKE DIRECTORIES AND DOWNLOAD REQUIRED FILES

mkdir -p ~/tessscratch

cd ~/tessscratch

wget -O lstm.train https://raw.githubusercontent.com/tesseract-ocr/tesseract/master/tessdata/configs/lstm.train

wget -O radical-stroke.txt https://raw.githubusercontent.com/tesseract-ocr/langdata_lstm/master/radical-stroke.txt

mkdir -p mylangdata

mkdir -p mylangdata/foo

#=== CREATE YOUT TRAINING TEXT FOR NEW LANGUAGE foo.

#=== FOR TRAINING FROM SCRATCH, IT SHOULD BE THOSANDS OF LINES.

#=== HERE A COPY OF ENGLISH TRAINING TEXT (72 LINES) IS MADE AS AN ILLUSTRATION.

wget -O mylangdata/foo/foo.training_text https://raw.githubusercontent.com/tesseract-ocr/langdata/master/eng/eng.training_text

#=== MAKE BOX/TIFF PAIRS USING TRAINING TEXT AND TWO FONTS.

text2image --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0 --exposure=0 --max_pages=0 \

--fonts_dir=/usr/share/fonts \

--font="Arial Unicode MS" \

--text=mylangdata/foo/foo.training_text \

--outputbase=foo.Arial.exp0

text2image --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0 --exposure=0 --max_pages=0 \

--fonts_dir=/usr/share/fonts \

--font="Courier New" \

--text=mylangdata/foo/foo.training_text \

--outputbase=foo.Courier.exp0

#=== EXTRACT UNICHARSET & SET PROPERTIES FROM BOX FILES.

unicharset_extractor --output_unicharset foo.unicharset --norm_mode 1 foo.Arial.exp0.box foo.Courier.exp0.box

set_unicharset_properties -U foo.unicharset -O foo.unicharset -X foo.xheights --script_dir=.

#=== CREATE LSTMF FILES.

tesseract foo.Arial.exp0.tif foo.Arial.exp0 --psm 6 lstm.train

tesseract foo.Courier.exp0.tif foo.Courier.exp0 --psm 6 lstm.train

ls -1 *.lstmf > foo.training_files.txt

#=== CREATE STARTER TRAINEDDATA

mkdir -p fooscratch

combine_lang_model \

--input_unicharset foo.unicharset \

--script_dir . \

--output_dir fooscratch \

--lang foo

#=== RUN LSTM TRAINING -

#=== hundreds of thousands of iterations may be needed for real training_text.

lstmtraining \

--model_output fooscratch/LAYER \

--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \

--learning_rate 20e-4 \

--traineddata fooscratch/foo/foo.traineddata \

--train_listfile foo.training_files.txt \

--debug_interval -1 \

--max_iterations 100

lstmtraining \

--model_output fooscratch/LAYER \

--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \

--learning_rate 20e-4 \

--traineddata fooscratch/foo/foo.traineddata \

--train_listfile foo.training_files.txt \

--debug_interval 0 \

--max_iterations 5000

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/39ad00ed-c9f7-42dd-896b-ae0dfbd58dbd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Apr 4, 2019, 1:51:57 PM4/4/19

to tesser...@googlegroups.com

if you want to use your own images, you don't need to run text2image with the training text and fonts.

Instead, supply your list of box and tif files in the next step.

Shobhit Kapil

unread,

Apr 5, 2019, 11:33:13 AM4/5/19

to tesseract-ocr

Hi ,

Before starting this training process i would like to know a bit about the process....

1. i have files which are not very clear and have different sort of noises will the training will be helpful in such scenarios.

2. Character are not reading correctly i.e. most of the time 5 is reading as S and Z is reading as 2 and i as !, so will this be covered in training.

Thanks,

Shobhit

Trong

unread,

Apr 7, 2019, 8:51:56 AM4/7/19

to tesseract-ocr

Hi,

I tried to train and got error

mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
Illegal instruction (core dumped)

Are there any problems in my enviroiment ?

OS: Ubuntu 18.04 64bit. Others are in below element.

Thanks,

titi@Ubun18:~/tessscratch$ ls -1 *.lstmf > vie.training_files.txt
titi@Ubun18:~/tessscratch$ lstmtraining \
> --model_output  viescratch/LAYER \


> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
> --learning_rate 20e-4 \

> --traineddata  viescratch/vie/vie.traineddata \
> --train_listfile  vie.training_files.txt   \
> --debug_interval -1 \
> --max_iterations 100
mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110
Illegal instruction (core dumped)
titi@Ubun18:~/tessscratch$ tesseract -v
tesseract 4.1.0-rc1-223-g3e71
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0


 Found SSE
titi@Ubun18:~/tessscratch$ text2image -v
Using CAIRO_FONT_TYPE_FT.
4.1.0-rc1-223-g3e71
titi@Ubun18:~/tessscratch$ unicharset_extractor -v
4.1.0-rc1-223-g3e71
titi@Ubun18:~/tessscratch$ set_unicharset_properties -v
4.1.0-rc1-223-g3e71
titi@Ubun18:~/tessscratch$ combine_lang_model -v
4.1.0-rc1-223-g3e71
titi@Ubun18:~/tessscratch$ lstmtraining -v
4.1.0-rc1-223-g3e71
titi@Ubun18:~/tessscratch$ lstmeval -v
4.1.0-rc1-223-g3e71
titi@Ubun18:~/tessscratch$

Shree Devi Kumar

unread,

Apr 7, 2019, 8:55:55 AM4/7/19

to tesser...@googlegroups.com

mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

Your traineddata file path is incorrect or file does not exist

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0f5934ee-b9b5-4696-a910-4c7922de777b%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Trong

unread,

Apr 7, 2019, 9:38:10 AM4/7/19

to tesseract-ocr

Thank you, Shree. I placed my trained data file to my dir. It works.

Thank you very much!

shree

unread,

Apr 8, 2019, 5:14:52 AM4/8/19

to tesseract-ocr

The script given is a simple example which will work for English and other Latin script languages.

Check for errors and review output files at every stage.

If you are training for Indic scripts, you need to use Norm_Code 2 and pass_through_recoder. RTL languages need further modifications.

#=== EXTRACT UNICHARSET & SET PROPERTIES FROM BOX FILES.

unicharset_extractor --output_unicharset foo.unicharset --norm_mode 2 *.box

#=== CREATE STARTER TRAINEDDATA

mkdir -p fooscratch

combine_lang_model \

--input_unicharset foo.unicharset \

--script_dir . \

--output_dir fooscratch \

--pass_through_recoder \

--lang foo

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/39ad00ed-c9f7-42dd-896b-ae0dfbd58dbd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shobhit Kapil

unread,

Apr 9, 2019, 9:58:33 AM4/9/19

to tesseract-ocr

Hi Shree,

Could you please share your valuable feedback on on the below points...

Hi ,

Before starting this training process i would like to know a bit about the process....

1. i have files which are not very clear and have different sort of noises will the training will be helpful in such scenarios.

2. Character are not reading correctly i.e. most of the time 5 is reading as S and Z is reading as 2 and i as !, so will this be covered in training.

What is the best option to over come the miss character reading.

Thanks,

Shobhit

On Wednesday, April 3, 2019 at 8:16:36 PM UTC+5:30, Shobhit Kapil wrote:

Shobhit Kapil

unread,

Apr 10, 2019, 5:12:04 PM4/10/19

to tesseract-ocr

Hi Shree,

Please share your input for the following questions....

Thanks,

Shobhit

On Wednesday, April 3, 2019 at 8:16:36 PM UTC+5:30, Shobhit Kapil wrote:

yoganand

unread,

Apr 19, 2019, 7:59:17 AM4/19/19

to tesseract-ocr

im trying to train my tesseract 4.. i started it with installing cygwin and could do till setup and steps you have given for OCRD-train is giving issues while trying to compile leptonica and tesseract. i felt that steps you have given are bit highlevel for me. i tried 'make leptonica' thru cygwin giving 'no rule to make target' error. i tried this after downloading ocrd-train and updating tesseract and letonica versions. can you explain what i need to do for below:

i used the same documentation given above. below steps i found some difficulty in executing. can some one please help

"Now that you copied ocrd-train into your setup (in my case Cygwin) you still have to connect these tools to Tesseract and Leptonica. To do so I simply used commands in Makefile to compile Tesseract and Leptonica inside my ocrd-train folder, but I am sure with a little editing you can make it so that the tools use your specific folder structure. To compile Tesseract and Leptonica use the following commands (order counts, Tesseract will give you an error without Leptonica) <code>make leptonica</code> and <code>make tesseract</code>. These commands will compile versions of Tesseract and Leptonica, version can be set as a variable inside Makefile."

On Wednesday, April 3, 2019 at 8:16:36 PM UTC+5:30, Shobhit Kapil wrote:

Kristóf Horváth

unread,

Apr 19, 2019, 9:00:10 AM4/19/19

to tesseract-ocr

So what i meant there is that you have to execute the commands from the location of OCR-d, because thats where you can find the Makefile.

Reddy, Yoganand

unread,

Apr 19, 2019, 9:58:59 AM4/19/19

to tesser...@googlegroups.com

Even, i have the same problem. I think there are many facing this issue. can someone stepup and provide bit more clarity on documentation.

On Wed, Apr 3, 2019 at 8:16 PM Shobhit Kapil <shobhi...@gmail.com> wrote:

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/39ad00ed-c9f7-42dd-896b-ae0dfbd58dbd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Thanks,

Yoganand

9553344418

Reply all

Reply to author

Forward