Creating trainneddata from box files

370 views
Skip to first unread message

Renan Neri Pereira

unread,
May 27, 2020, 12:06:08 PM5/27/20
to tesseract-ocr
Hello Guys,

I`m wanting to train Tesseract OCR for reconize some documents. i have some images and box files but i don't know how to generate traineddata from these. I think that the tutorial for training from box files is a little bad.

Can anyone help me with that?

Thanks

Piyush Chandra

unread,
May 28, 2020, 1:04:03 AM5/28/20
to tesseract-ocr
Hi,

Hope below information helps: :)

Creating trained data file own.traineddata :

Create box files: tesseract /path/to/image.tif path/and/nameof/boxfile/imgae lstmbox

Create unicharset file: unicharset_extractor --norm_mode 1 --output_unicharset ./output/folder/own.unicharset /path/to/image1.box /path/to/image2.box /path/to/imageX.box

Create starter traineddatda (aka recoreder): combine_lang_model --input_unicharset ./out/own.unicharset --script_dir ./out --words ./out/eng.wordlist.txt --numbers ./out/eng.numbers.txt --puncs ./out/eng.punc.txt --output_dir ./out --lang own

Create training files (for each image): tesseract /path/to/image1.tif /path/to/image1.exp0 --psm 6 lstm.train

Train: lstmtraining --traineddata ./out/own/own.traineddata --model_output ./output/own --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c110]" --train_listfile ./eng_ltsm/eng.training_files.txt --eval_listfile ./eng_ltsm/eng.training_files.txt --max_iterations 100

Create Final traineddata: lstmtraining --stop_training --continue_from ./output/own_checkpoint --traineddata ./out/own/own.traineddata --model_output ./output/own.traineddata

Владимир Калачихин

unread,
May 28, 2020, 5:42:04 AM5/28/20
to tesseract-ocr
Hi!

четверг, 28 мая 2020 г., 8:04:03 UTC+3 пользователь Piyush Chandra написал:
Hope below information helps: :)


Pls, some questions:

Is it required: "--words...", "--numbers..." and "--puncs"?
Why do need "--net_spec..."?


Piyush Chandra

unread,
May 28, 2020, 7:46:10 AM5/28/20
to tesseract-ocr
Is it required: "--words...", "--numbers..." and "--puncs"? => No, they are optional
Read  about --Net spec here: https://tesseract-ocr.github.io/tessdoc/VGSLSpecs

Владимир Калачихин

unread,
May 28, 2020, 7:54:05 AM5/28/20
to tesseract-ocr

четверг, 28 мая 2020 г., 14:46:10 UTC+3 пользователь Piyush Chandra написал:
Yes, but why custom net configuration for common task? 

And, which net configuration well suited for trainning to math symbols?


Владимир Калачихин

unread,
May 28, 2020, 8:21:47 AM5/28/20
to tesseract-ocr
Hi!
Another question:
четверг, 28 мая 2020 г., 8:04:03 UTC+3 пользователь Piyush Chandra написал:

Create box files: tesseract /path/to/image.tif path/and/nameof/boxfile/imgae lstmbox



On this step tesseract recognize the image? What if this does it badly?
Can I specify what text is in the image, how it was for tesseract 3?

Shree Devi Kumar

unread,
May 28, 2020, 9:36:14 AM5/28/20
to tesseract-ocr
>Create box files: tesseract /path/to/image.tif path/and/nameof/boxfile/imgae lstmbox

Alternately you can use wordstrbox config file.

In both cases, if you are generating box files from images, the box files need to be corrected before proceeding for training.





--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7bd32ea2-3af3-44e0-8c54-753ca6dd1f90%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Владимир Калачихин

unread,
May 28, 2020, 11:19:07 AM5/28/20
to tesseract-ocr

четверг, 28 мая 2020 г., 16:36:14 UTC+3 пользователь shree написал:
 
Alternately you can use wordstrbox config file.

 What is "wordstrbox config file"?

Shree Devi Kumar

unread,
May 28, 2020, 11:21:31 AM5/28/20
to tesseract-ocr
lstmbox creates character level box files.

Wordstrbox creates line level box files.

If using wordstrbox, please use the groundtruth text for creating unicharset instead of the box files.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Владимир Калачихин

unread,
May 28, 2020, 12:25:06 PM5/28/20
to tesseract-ocr

I don't quite understand You.
Could you give us an example of use tesseract to create wordstrbox, and use combine_lang_model with groundtruth text?




четверг, 28 мая 2020 г., 18:21:31 UTC+3 пользователь shree написал:
lstmbox creates character level box files.

Wordstrbox creates line level box files.

If using wordstrbox, please use the groundtruth text for creating unicharset instead of the box files.

On Thu, May 28, 2020, 20:49 Владимир Калачихин <v.kala...@gmail.com> wrote:

четверг, 28 мая 2020 г., 16:36:14 UTC+3 пользователь shree написал:
 
Alternately you can use wordstrbox config file.

 What is "wordstrbox config file"?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,
May 29, 2020, 8:02:22 AM5/29/20
to tesseract-ocr
On Thu, May 28, 2020 at 9:55 PM Владимир Калачихин <v.kala...@gmail.com> wrote:

I don't quite understand You.
Could you give us an example of use tesseract to create wordstrbox, and use combine_lang_model with groundtruth text?

For starting from images and their groundtruth, it would be similar to the following for English.

Input Files

myfile1.png
myfile1.gt.txt
myfile2.png
myfile2.gt.txt

## Create unicharset from all groundtruth files
unicharset_extractor --output_unicharset myfile.unicharset --norm_mode 1 myfile*.gt.txt

## Create starter traineddata using above unicharset
combine_lang_model --input_unicharset myfile.unicharset --script_dir ../langdata  --output_dir ../tesstutorial/mylang --lang mylang 

## Create wordstrbox
tesseract  myfile1.png  myfile1 --psm 6 worddstrbox
tesseract  myfile2.png  myfile2 --psm 6 worddstrbox

## Manually correct wordstrbox files using the ground truth
## You can use jtessboxeditor to verify the correctness of boxes

## Create lstmf file from png and corrected box files
tesseract  myfile1.png  myfile1 --psm 6 lstm.train
tesseract  myfile2.png  myfile2 --psm 6 lstm.train  
## Create list of lstmf files to use for training
ls *.lstmf -1 > mylang.traininingfiles_text

 




четверг, 28 мая 2020 г., 18:21:31 UTC+3 пользователь shree написал:
lstmbox creates character level box files.

Wordstrbox creates line level box files.

If using wordstrbox, please use the groundtruth text for creating unicharset instead of the box files.

On Thu, May 28, 2020, 20:49 Владимир Калачихин <v.kala...@gmail.com> wrote:

четверг, 28 мая 2020 г., 16:36:14 UTC+3 пользователь shree написал:
 
Alternately you can use wordstrbox config file.

 What is "wordstrbox config file"?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/39c0ff88-abe7-424c-bede-5d86ef0377fb%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/80b3e39d-d0e9-4fce-b827-e39d65ac3dbd%40googlegroups.com.

Владимир Калачихин

unread,
May 31, 2020, 9:02:04 AM5/31/20
to tesseract-ocr
Hi !

I still don't understand.

пятница, 29 мая 2020 г., 15:02:22 UTC+3 пользователь shree написал:
Input Files

myfile1.png
myfile1.gt.txt


Is "myfile1.png" - the picture with training text?
What is "myfile1.gt.txt"?


Shree Devi Kumar

unread,
May 31, 2020, 9:11:40 AM5/31/20
to tesseract-ocr
What I mentioned was for the case where you have images and their groundtruth. gt.txt is the grountruth - expected correct output from that image.

If you want to train from training text and fonts, then the method is different.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Владимир Калачихин

unread,
May 31, 2020, 9:15:11 AM5/31/20
to tesseract-ocr
Ok, I want to train from training text and fonts.
Whats method must be?

Shree Devi Kumar

unread,
May 31, 2020, 12:16:55 PM5/31/20
to tesseract-ocr
Use tesstrain.sh or tesstrain.py 

On Sun, May 31, 2020 at 6:45 PM Владимир Калачихин <v.kala...@gmail.com> wrote:
Ok, I want to train from training text and fonts.
Whats method must be?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Владимир Калачихин

unread,
May 31, 2020, 2:41:48 PM5/31/20
to tesseract-ocr
воскресенье, 31 мая 2020 г., 19:16:55 UTC+3 пользователь shree написал:
Use tesstrain.sh or tesstrain.py 

On Sun, May 31, 2020 at 6:45 PM Владимир Калачихин <v.kala...@gmail.com> wrote:
Ok, I want to train from training text and fonts.
Whats method must be?


I thought You knew that you can't trainning tesseract for custom language with these tools.


 

Shree Devi Kumar

unread,
Jun 1, 2020, 4:23:39 AM6/1/20
to tesseract-ocr
So, modify the info given by Piyush Chandra earlier in this thread. The paths needs to based on where you have the files.

### create tif and box using fonts and training text
text2image  --fonts_dir=/home/ubuntu/.fonts --outputbase=/mylang.myfont.exp0 --max_pages=0  --font=myfont  --text=../langdata/mylang/mylang.training_text

### create unicharset from training_text
unicharset_extractor --norm_mode 1 --output_unicharset ./output/folder/own.unicharset  ../langdata/mylang/mylang.training_text

### Create starter traineddatda (aka recoder): 
combine_lang_model --input_unicharset ./out/own.unicharset --script_dir ./langdata --output_dir ./out --lang mylang

### Create training files (for each image): 
tesseract /mylang.myfont.exp0.tif  /mylang.myfont.exp0  --psm 6 lstm.train

### Create list of lstmf files
ls -1 /mylang.*.lstmf > mylang.training_files.txt

### Train: 
lstmtraining --traineddata ./out/ mylang / mylang .traineddata --model_output ./output/ mylang  --train_listfile  mylang.training_files.txt   --max_iterations 100

###Create Final traineddata:

lstmtraining --stop_training --continue_from ./output/ mylang  _checkpoint --traineddata ./out/mylang /mylang.traineddata --model_output ./output/mylang.traineddata


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Владимир Калачихин

unread,
Jun 1, 2020, 10:46:48 AM6/1/20
to tesseract-ocr
Hi!
понедельник, 1 июня 2020 г., 11:23:39 UTC+3 пользователь shree написал:

### create tif and box using fonts and training text
text2image  --fonts_dir=/home/ubuntu/.fonts --outputbase=/mylang.myfont.exp0 --max_pages=0  --font=myfont  --text=../langdata/mylang/mylang.training_text

I do it for each font. For some font it's run ok, but for - with message "'--text' option is missing!". What does this mean?


 
### create unicharset from training_text
unicharset_extractor --norm_mode 1 --output_unicharset ./output/folder/own.unicharset  ../langdata/mylang/mylang.training_text

This says "Bad box coordinates in boxfile string!", but created the unicharset file.



### Create starter traineddatda (aka recoder): 
combine_lang_model --input_unicharset ./out/own.unicharset --script_dir ./langdata --output_dir ./out --lang mylang


This failed with "Failed to load script unicharset from:./langdata/Latin.unicharset"
 
Of course I don't have the Latin.unicharset - i want my own unicharset!



Shree Devi Kumar

unread,
Jun 1, 2020, 12:36:07 PM6/1/20
to tesseract-ocr
>Failed to load script unicharset from:./langdata/Latin.unicharset"

This is for Latin script not Latin language.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,
Jun 1, 2020, 12:37:25 PM6/1/20
to tesseract-ocr

Владимир Калачихин

unread,
Jun 2, 2020, 7:42:42 AM6/2/20
to tesseract-ocr
понедельник, 1 июня 2020 г., 19:37:25 UTC+3 пользователь shree написал:
 You don't understand. I don't want training to new fonts of existing language. I want a new language.

Владимир Калачихин

unread,
Jun 2, 2020, 8:46:55 AM6/2/20
to tesseract-ocr
понедельник, 1 июня 2020 г., 19:36:07 UTC+3 пользователь shree написал:

This is for Latin script not Latin language.


Ok, I did it, and some next steps.
On step

### Train:
lstmtraining .....

I had:
"
Must specify an input layer as the first layer, not !!
Failed to create network from spec:
"

Obviously, something is missing. What?


Piyush Chandra

unread,
Jun 4, 2020, 12:13:58 PM6/4/20
to tesseract-ocr
This is what is missing : --net_spec . Check the line below that I mentioned before.

lstmtraining --traineddata ./out/own/own.traineddata --model_output ./output/own --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c110]" --train_listfile ./eng_ltsm/eng.training_files.txt --eval_listfile ./eng_ltsm/eng.training_files.txt --max_iterations 100


Владимир Калачихин

unread,
Jun 22, 2020, 6:42:44 AM6/22/20
to tesseract-ocr
I returned to this job.

четверг, 4 июня 2020 г., 19:13:58 UTC+3 пользователь Piyush Chandra написал:
This is what is missing : --net_spec . Check the line below that I mentioned before.

lstmtraining --traineddata ./out/own/own.traineddata --model_output ./output/own --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c110]" --train_listfile ./eng_ltsm/eng.training_files.txt --eval_listfile ./eng_ltsm/eng.training_files.txt --max_iterations 100


Ok,  I add --net and run this step. It's ends and looks right.

After this, I run the last point from Shee recipe:

###Create Final traineddata:

lstmtraining --stop_training --continue_from ./output/ mylang  _checkpoint --traineddata ./out/mylang /mylang.traineddata --model_output ./output/mylang.traineddata

 
With message
"Must provide a --traineddata see training wiki"
and nothing happened.
Of course, --traineddata ./out/mylang /mylang.traineddata are present and used with previous steps.

What's wrong with traineddata?
Reply all
Reply to author
Forward
0 new messages