train tesseract OCR 4.0

4,091 views
Skip to first unread message

Saurabh Srivastav

unread,
Mar 3, 2017, 2:09:39 AM3/3/17
to tesseract-ocr
how to train tesseract 4.0. Please help me..

thanks,
Saurabh Srivastav
Screenshot from 2017-03-03 12-15-12.png

ShreeDevi Kumar

unread,
Mar 3, 2017, 2:23:31 AM3/3/17
to tesser...@googlegroups.com
screenshot of warning  means that your image does not have resolution info. Your OCR output file should have been created.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f1782fd1-97a1-40db-8ba0-f003052f39ae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Saurabh Srivastav

unread,
Mar 22, 2017, 2:01:18 PM3/22/17
to tesseract-ocr
Thank you shree for your valuable reply. But now i have created box files for a particuler image and trained it..but still i am missing something, may you please help me what i have to do after creating box file for that image and make tesseract to read the characters from that image.

thanks and regards.


On Friday, March 3, 2017 at 12:53:31 PM UTC+5:30, shree wrote:
screenshot of warning  means that your image does not have resolution info. Your OCR output file should have been created.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Mar 3, 2017 at 12:17 PM, Saurabh Srivastav <hiiiam...@gmail.com> wrote:
how to train tesseract 4.0. Please help me..

thanks,
Saurabh Srivastav

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

ShreeDevi Kumar

unread,
Mar 23, 2017, 7:54:59 AM3/23/17
to tesser...@googlegroups.com
To read characters from an image, it is not necessary to train it. Just use an appropriate traineddata.

Training is required only if it is  a new language or font or some such special circumstance.

Read the wiki for documentation.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Saurabh Srivastav

unread,
Apr 3, 2017, 9:40:05 AM4/3/17
to tesseract-ocr
hello  shree ! thank you for your help.
may you please help me how can i write a bash  script for tesseract.

ShreeDevi Kumar

unread,
Apr 3, 2017, 10:41:33 AM4/3/17
to tesser...@googlegroups.com
Saurabh,

It depends on what you want to do with the bash script.

Here is a sample of a script I used to compare results using diff tessdata files by looping thru a set of image files. Google the bash commands to figure out what they do!

#!/bin/bash
set -vx
export TESSDATA_PREFIX=/mnt/c/Users/User/shree/tesseract-ocr

    img_files=$(ls *.jpeg)
    for img_file in ${img_files}; do
        time tesseract ${img_file} ${img_file%.*}-ssd  -l ssd
        time tesseract ${img_file} ${img_file%.*}-ssdsmall  --psm 6 --oem 1 -l ssdsmall 
        time tesseract ${img_file} ${img_file%.*}-eng  --psm 6 --oem 1 -l eng 
    done    


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 3, 2017 at 7:10 PM, Saurabh Srivastav <saurabhkum...@gmail.com> wrote:
hello  shree ! thank you for your help.
may you please help me how can i write a bash  script for tesseract.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Saurabh Srivastav

unread,
Apr 3, 2017, 11:38:11 AM4/3/17
to tesseract-ocr
shree,
         actually i want a bash script which run tesseract  and store ouput file in a folder..

kindly help me to make this type of bash script.


thank you.

srn...@gmail.com

unread,
Apr 4, 2017, 8:36:52 AM4/4/17
to tesseract-ocr
Hello ShreeDevi,

https://medium.com/apegroup-texts/training-tesseract-for-labels-receipts-and-such-690f452e8f79

In the link, we can see a full fledged tutorial of tesseract 3.0 version, of using it and training it. Can you please clarify the below points...?

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

But  in the github link, i feel its good if they elaborate more..

1) How should i train tesseract if i dont know or i may get random fonts in image files. ?

2) In github tutorial, its specified that we should skip clustering steps (mftraining, cntraining, shapeclustering)  ?

3) And I want to generate a trained data file, and want to merge with tessdata(already present ) and dont want to replace it?


Can you please specify how to achieve these steps..?


Thank You.







On Monday, April 3, 2017 at 8:11:33 PM UTC+5:30, shree wrote:
Saurabh,

It depends on what you want to do with the bash script.

Here is a sample of a script I used to compare results using diff tessdata files by looping thru a set of image files. Google the bash commands to figure out what they do!

#!/bin/bash
set -vx
export TESSDATA_PREFIX=/mnt/c/Users/User/shree/tesseract-ocr

    img_files=$(ls *.jpeg)
    for img_file in ${img_files}; do
        time tesseract ${img_file} ${img_file%.*}-ssd  -l ssd
        time tesseract ${img_file} ${img_file%.*}-ssdsmall  --psm 6 --oem 1 -l ssdsmall 
        time tesseract ${img_file} ${img_file%.*}-eng  --psm 6 --oem 1 -l eng 
    done    


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 3, 2017 at 7:10 PM, Saurabh Srivastav <saurabhkum...@gmail.com> wrote:
hello  shree ! thank you for your help.
may you please help me how can i write a bash  script for tesseract.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

srn...@gmail.com

unread,
Apr 4, 2017, 8:47:47 AM4/4/17
to tesseract-ocr
Hello ShreeDevi,

can you elaborate regarding lstm step, which is new in Tesseract 4.0, and the new steps I need to follow for training Tesseract 4?

Thank you




On Monday, April 3, 2017 at 8:11:33 PM UTC+5:30, shree wrote:
Saurabh,

It depends on what you want to do with the bash script.

Here is a sample of a script I used to compare results using diff tessdata files by looping thru a set of image files. Google the bash commands to figure out what they do!

#!/bin/bash
set -vx
export TESSDATA_PREFIX=/mnt/c/Users/User/shree/tesseract-ocr

    img_files=$(ls *.jpeg)
    for img_file in ${img_files}; do
        time tesseract ${img_file} ${img_file%.*}-ssd  -l ssd
        time tesseract ${img_file} ${img_file%.*}-ssdsmall  --psm 6 --oem 1 -l ssdsmall 
        time tesseract ${img_file} ${img_file%.*}-eng  --psm 6 --oem 1 -l eng 
    done    


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Apr 3, 2017 at 7:10 PM, Saurabh Srivastav <saurabhkum...@gmail.com> wrote:
hello  shree ! thank you for your help.
may you please help me how can i write a bash  script for tesseract.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

srn...@gmail.com

unread,
Apr 4, 2017, 8:48:24 AM4/4/17
to tesseract-ocr
Are u having any progress Saurabh..?

ShreeDevi Kumar

unread,
Apr 4, 2017, 8:53:33 AM4/4/17
to tesser...@googlegroups.com

srn...@gmail.com

unread,
Apr 4, 2017, 9:31:08 AM4/4/17
to tesseract-ocr
I am trying to tesseract 4,, and i am getting folowing error,,

command used:

mkdir -p /home/p/Documents/T/engoutput
/home/p/Documents/T/tesseract-master/training/lstmtraining -U /home/p/Documents/T/img_frm_3/unicharset \
  --script_dir /home/p/Documents/T/TESS_4_ALPHA/langdata-master --debug_interval 100 \
  --train_listfile /home/p/Documents/T/TESS_4_ALPHA/langdata-master/eng/eng.training_files \
  --eval_listfile /home/p/Documents/T/TESS_4_ALPHA/langdata-master/eng/eng.training_files \
  --max_iterations 5000 &>/home/p/Documents/T/basetrain.log

used for log:
tail -f basetrain.log
Failed to load list of training filenames from /home/p/Documents/T/TESS_4_ALPHA/langdata-master/eng/eng.training_files
tail: basetrain.log: file truncated



error getting:
Failed to load list of training filenames from /home/p/Documents/T/TESS_4_ALPHA/langdata-master/eng/eng.training_files

ShreeDevi Kumar

unread,
Apr 4, 2017, 10:28:44 AM4/4/17
to tesser...@googlegroups.com
Tesstrain.sh generates a file called eng.training_files.txt

You are using command without .text extension

Check the name of generated file and use that.

I have found that editing that file also gives errors.
- excuse the brevity, sent from mobile

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Saurabh Srivastav

unread,
Apr 4, 2017, 11:57:06 AM4/4/17
to tesseract-ocr
Yes, i trained my tesseract for eng font and make them read the characters from image.
thanks,
Saurabh Srivastav

Saurabh Srivastav

unread,
Apr 4, 2017, 12:08:24 PM4/4/17
to tesseract-ocr
thank you shree ,
you always help me.

but i still have one problem that i wrote a bash script which trace the all images with .jpg extension and make their output files as the name of image.
but i want that when i run script it trace more images with some different extensions like .jpg , .jpeg , .png .is it possible? if it is, then please help me out.


thank you shree,

srn...@gmail.com

unread,
Apr 4, 2017, 3:24:26 PM4/4/17
to tesseract-ocr
Can you please post some experiences in this post, as there are no posts to train tesseract 4.

1)And also, is there any way to add the new trained data file to old trained data file, without replacing the old file.
2)If we dont know what font we may get in our images, then how should we proceed in training the tessract

ShreeDevi Kumar

unread,
Apr 4, 2017, 11:37:40 PM4/4/17
to tesser...@googlegroups.com

srn...@gmail.com

unread,
Apr 5, 2017, 4:25:21 AM4/5/17
to tesseract-ocr
After u have said,

I tried in two ways and i am stuck at lstm step:

Training

command used:

/home/p/Documents/T/tesseract-master/training/lstmtraining -U /home/p/Documents/T/img_frm_3/eng.unicharset \
>   --script_dir /home/p/Documents/T/TESS_4_ALPHA/langdata-master --debug_interval 100 \
>   --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]' \
>   --model_output /home/p/Documents/T/ \
>   --train_listfile /home/p/Documents/T/img_frm_3/eng.ArialBold.exp0.txt \
>   --eval_listfile /home/p/Documents/T/img_frm_3/eng.ArialBold.exp0.txt \
>   --max_iterations 5000 &>/home/p/Documents/T/basetrain.log

tail -f basetrain.log
Error getting is :


Deserialize header failed: BnO. 005 SUBHISHIs TOWN CENTRE
Deserialize header failed: MOKILA SHAKARPALLY
Deserialize header failed: PHONE: 040-8989898989
Load of page 0 failed!
Load of images failed!!
Deserialize header failed: TIN: 8989898989
Deserialize header failed: Station 1D: 01 Time: 03:26:46 PM
Deserialize header failed: CASHIER ID:; 3001 Date: 21-02-2017
Deserialize header failed: (null)
Deserialize header failed: (null)








Fine tuning:

command used:-

/home/plianto/Documents/Tvat/tesseract-master/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \
  --training_text /home/plianto/Documents/Tvat/img_frm_3/eng.ArialBold.exp0.txt \
  --langdata_dir /home/plianto/Documents/Tvat/TESS_4_ALPHA/langdata-master  --tessdata_dir /usr/share/tesseract-ocr/tessdata \
  --fontlist "Arial Bold" \
  --output_dir /home/plianto/Documents/Tvat/engoutput/

error:

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=/usr/share/tesseract-ocr/tessdata
[Wed Apr 5 13:53:05 IST 2017] /usr/local/bin/tesseract /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0.tif /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0 lstm.train
read_params_file: Can't open lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
ERROR: /tmp/tmp.KTk3WgBTWk/eng/eng.Arial_Bold.exp0.lstmf does not exist or is not readable
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Apr 5, 2017, 4:29:05 AM4/5/17
to tesser...@googlegroups.com
4.0 is alpha software. Please use an older released version.


- excuse the brevity, sent from mobile
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Apr 5, 2017, 4:29:56 AM4/5/17
to tesser...@googlegroups.com
You do not have the LSTM.train config file.


- excuse the brevity, sent from mobile
On 05-Apr-2017 1:55 PM, <srn...@gmail.com> wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

srn...@gmail.com

unread,
Apr 5, 2017, 4:32:06 AM4/5/17
to tesseract-ocr

Overview of Training Process

The overall training process is similar to training 3.04 Conceptually the same:

  1. Prepare training text.
  2. Render text to image + box file. (Or create hand-made box files for existing image data.)
  3. Make unicharset file.
  4. Optionally make dictionary data.
  5. Run tesseract to process image + box file to make training data set.
  6. Run training on training data set.
  7. Combine data files.

The key differences are:

  • The boxes only need to be at the textline level. It is thus far easier to make training data from existing image data.
  • The .tr files are replaced by .lstmf data files.
  • Fonts can and should be mixed freely instead of being separate.
  • The clustering steps (mftraining, cntraining, shapeclustering) are replaced with a single slow lstmtraining step.



Hello shrreDevi,


I request u to guide me in eloborating the above marked steps, as i am not able to find the relevant steps for them.


The steps which I am following is giving me the above errors in previuos reply.


Please guide me.






On Wednesday, April 5, 2017 at 9:07:40 AM UTC+5:30, shree wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

srn...@gmail.com

unread,
Apr 5, 2017, 4:35:19 AM4/5/17
to tesseract-ocr
You can use *.* when identifying the files.. but you should be careful only image files are only supplied... as it can take all available files, because * means it takes input for all the files.

1)I request you can help me with posts i had posted today..
2) And please guide how can i generate lstm files for images which i have to use..
and pls explain how you have followed...

srn...@gmail.com

unread,
Apr 5, 2017, 4:37:03 AM4/5/17
to tesseract-ocr
Please tell and help me how can i get LSTM.train config file.. as i need to work on Tesseract 4 only... dont have other option

srn...@gmail.com

unread,
Apr 5, 2017, 7:20:08 AM4/5/17
to tesseract-ocr
Hello ShreeDevi,

I solved this error lstm.train, i have given wrong path.

mkdir -p ~/tesstutorial/engoutput
training/lstmtraining -U ~/tesstutorial/engtrain/eng.unicharset \
  --script_dir ../langdata --debug_interval 100 \
  --net_spec '[1,36,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]' \

  --model_output ~/tesstutorial/engoutput/base \
  --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log



1)Can u plz tell tell me how to generate unicharset file for my image files after genearting box files with tesseract.
2)And also please clarify about netspec param and what input should be given to it

Thanks






On Wednesday, April 5, 2017 at 1:59:56 PM UTC+5:30, shree wrote:

Saurabh Srivastav

unread,
Apr 10, 2017, 5:26:06 AM4/10/17
to tesseract-ocr
hello srn ,
            can you please let me know about your progress...

srn...@gmail.com

unread,
Apr 12, 2017, 6:09:01 AM4/12/17
to tesseract-ocr
I am able to train the tesseract with fine tuning technique with some training text (not images).. and i want to know how train tesseract with images and box files.. I am getting confused because when i give this 

tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train

 command, tr files are being produced (my tesseract is 4 alpha version). 

I will post my tutorial or experiences in this week end.

And can u plz give overview how to train tessract with some images(blurred) and what changes i need to do in the link 

Saurabh Srivastav

unread,
Apr 25, 2017, 5:07:27 AM4/25/17
to tesseract-ocr
Edit your box files with correct data and the make a traineddata file and then paste it to usr/local/share/tessdata

kislay bajpai

unread,
Oct 16, 2018, 8:33:53 AM10/16/18
to tesseract-ocr
Hello Shree, 

I am confused how to train tesseract 4.0 alpha for new font (E 13B). Please help me for it.

Shree Devi Kumar

unread,
Oct 16, 2018, 9:40:48 AM10/16/18
to tesser...@googlegroups.com
Please do not use tesseract 4.0 alpha. There have been many changes since then.

Use the latest code from github, which is 4.0.0-rc3 or install from Alex's PPA or from ub mannheim (for Windows).

Please read the wiki pages about training for new font for tesseract 4 - fine tuning for Impact.

On Tue, 16 Oct 2018, 08:33 kislay bajpai, <kislay....@gmail.com> wrote:
Hello Shree, 

I am confused how to train tesseract 4.0 alpha for new font (E 13B). Please help me for it.
.

kislay bajpai

unread,
Oct 22, 2018, 6:59:58 AM10/22/18
to tesser...@googlegroups.com
Hello, 

Sorry to disturb you, actually i am very new with tesseract and getting no idea, how to train it. 
Please help me out. I am in big trouble.

version - tesseract4.0 alpha
OS - ubuntu16.04 and RHEL 7.3 (any one i can use)

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.


--
Thanks and regards
Kislay Bajpai

Shree Devi Kumar

unread,
Oct 22, 2018, 12:42:17 PM10/22/18
to tesser...@googlegroups.com

saman ukh

unread,
Feb 22, 2020, 10:02:27 AM2/22/20
to tesseract-ocr
Hello all,

I am using tesseract 4.0 which uses LSTM
I have searched a lot for training new characters, unfortunately, I found difficult to do training 
I am trying to train Arabic Traineddata by adding a few new characters 
can anyone help me with this 
what are the steps, where to start?
Reply all
Reply to author
Forward
0 new messages