Creating trainneddata from box files

Renan Neri Pereira

unread,

May 27, 2020, 12:06:08 PM5/27/20

to tesseract-ocr

Hello Guys,

I`m wanting to train Tesseract OCR for reconize some documents. i have some images and box files but i don't know how to generate traineddata from these. I think that the tutorial for training from box files is a little bad.

Can anyone help me with that?

Thanks

Piyush Chandra

unread,

May 28, 2020, 1:04:03 AM5/28/20

to tesseract-ocr

Hi,

Hope below information helps: :)

Creating trained data file own.traineddata :

Create box files: tesseract /path/to/image.tif path/and/nameof/boxfile/imgae lstmbox

Create unicharset file: unicharset_extractor --norm_mode 1 --output_unicharset ./output/folder/own.unicharset /path/to/image1.box /path/to/image2.box /path/to/imageX.box

Create starter traineddatda (aka recoreder): combine_lang_model --input_unicharset ./out/own.unicharset --script_dir ./out --words ./out/eng.wordlist.txt --numbers ./out/eng.numbers.txt --puncs ./out/eng.punc.txt --output_dir ./out --lang own

Create training files (for each image): tesseract /path/to/image1.tif /path/to/image1.exp0 --psm 6 lstm.train

Train: lstmtraining --traineddata ./out/own/own.traineddata --model_output ./output/own --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c110]" --train_listfile ./eng_ltsm/eng.training_files.txt --eval_listfile ./eng_ltsm/eng.training_files.txt --max_iterations 100

Create Final traineddata: lstmtraining --stop_training --continue_from ./output/own_checkpoint --traineddata ./out/own/own.traineddata --model_output ./output/own.traineddata

Владимир Калачихин

unread,

May 28, 2020, 5:42:04 AM5/28/20

to tesseract-ocr

Hi!

четверг, 28 мая 2020 г., 8:04:03 UTC+3 пользователь Piyush Chandra написал:

Hope below information helps: :)

Pls, some questions:

Is it required: "--words...", "--numbers..." and "--puncs"?

Why do need "--net_spec..."?

Piyush Chandra

unread,

May 28, 2020, 7:46:10 AM5/28/20

to tesseract-ocr

Is it required: "--words...", "--numbers..." and "--puncs"? => No, they are optional

Read about --Net spec here: https://tesseract-ocr.github.io/tessdoc/VGSLSpecs

Владимир Калачихин

unread,

May 28, 2020, 7:54:05 AM5/28/20

to tesseract-ocr

четверг, 28 мая 2020 г., 14:46:10 UTC+3 пользователь Piyush Chandra написал:

Read about --Net spec here: https://tesseract-ocr.github.io/tessdoc/VGSLSpecs

Yes, but why custom net configuration for common task?

And, which net configuration well suited for trainning to math symbols?

Владимир Калачихин

unread,

May 28, 2020, 8:21:47 AM5/28/20

to tesseract-ocr

Hi!

Another question:

четверг, 28 мая 2020 г., 8:04:03 UTC+3 пользователь Piyush Chandra написал:

Create box files: tesseract /path/to/image.tif path/and/nameof/boxfile/imgae lstmbox

On this step tesseract recognize the image? What if this does it badly?

Can I specify what text is in the image, how it was for tesseract 3?

Shree Devi Kumar

unread,

May 28, 2020, 9:36:14 AM5/28/20

to tesseract-ocr

>Create box files: tesseract /path/to/image.tif path/and/nameof/boxfile/imgae lstmbox

Alternately you can use wordstrbox config file.

In both cases, if you are generating box files from images, the box files need to be corrected before proceeding for training.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7bd32ea2-3af3-44e0-8c54-753ca6dd1f90%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Владимир Калачихин

unread,

May 28, 2020, 11:19:07 AM5/28/20

to tesseract-ocr

четверг, 28 мая 2020 г., 16:36:14 UTC+3 пользователь shree написал:

Alternately you can use wordstrbox config file.

What is "wordstrbox config file"?

Shree Devi Kumar

unread,

May 28, 2020, 11:21:31 AM5/28/20

to tesseract-ocr

lstmbox creates character level box files.

Wordstrbox creates line level box files.

If using wordstrbox, please use the groundtruth text for creating unicharset instead of the box files.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/39c0ff88-abe7-424c-bede-5d86ef0377fb%40googlegroups.com.

Владимир Калачихин

unread,

May 28, 2020, 12:25:06 PM5/28/20

to tesseract-ocr

I don't quite understand You.

Could you give us an example of use tesseract to create wordstrbox, and use combine_lang_model with groundtruth text?

четверг, 28 мая 2020 г., 18:21:31 UTC+3 пользователь shree написал:

lstmbox creates character level box files.

Wordstrbox creates line level box files.

If using wordstrbox, please use the groundtruth text for creating unicharset instead of the box files.

On Thu, May 28, 2020, 20:49 Владимир Калачихин <v.kala...@gmail.com> wrote:

четверг, 28 мая 2020 г., 16:36:14 UTC+3 пользователь shree написал:

Alternately you can use wordstrbox config file.

What is "wordstrbox config file"?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,

May 29, 2020, 8:02:22 AM5/29/20

to tesseract-ocr

On Thu, May 28, 2020 at 9:55 PM Владимир Калачихин <v.kala...@gmail.com> wrote:

I don't quite understand You.
Could you give us an example of use tesseract to create wordstrbox, and use combine_lang_model with groundtruth text?

For starting from images and their groundtruth, it would be similar to the following for English.

Input Files

myfile1.png

myfile1.gt.txt

myfile2.png

myfile2.gt.txt

## Create unicharset from all groundtruth files

unicharset_extractor --output_unicharset myfile.unicharset --norm_mode 1 myfile*.gt.txt

## Create starter traineddata using above unicharset

combine_lang_model --input_unicharset myfile.unicharset --script_dir ../langdata --output_dir ../tesstutorial/mylang --lang mylang

## Create wordstrbox

tesseract myfile1.png myfile1 --psm 6 worddstrbox

tesseract myfile2.png myfile2 --psm 6 worddstrbox

## Manually correct wordstrbox files using the ground truth

## You can use jtessboxeditor to verify the correctness of boxes

## Create lstmf file from png and corrected box files

tesseract myfile1.png myfile1 --psm 6 lstm.train

tesseract myfile2.png myfile2 --psm 6 lstm.train

## Create list of lstmf files to use for training

ls *.lstmf -1 > mylang.traininingfiles_text

четверг, 28 мая 2020 г., 18:21:31 UTC+3 пользователь shree написал:
lstmbox creates character level box files.

Wordstrbox creates line level box files.

If using wordstrbox, please use the groundtruth text for creating unicharset instead of the box files.

On Thu, May 28, 2020, 20:49 Владимир Калачихин <v.kala...@gmail.com> wrote:

четверг, 28 мая 2020 г., 16:36:14 UTC+3 пользователь shree написал:

Alternately you can use wordstrbox config file.

What is "wordstrbox config file"?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/39c0ff88-abe7-424c-bede-5d86ef0377fb%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/80b3e39d-d0e9-4fce-b827-e39d65ac3dbd%40googlegroups.com.

Владимир Калачихин

unread,

May 31, 2020, 9:02:04 AM5/31/20

to tesseract-ocr

Hi !

I still don't understand.

пятница, 29 мая 2020 г., 15:02:22 UTC+3 пользователь shree написал:

Input Files

myfile1.png
myfile1.gt.txt

Is "myfile1.png" - the picture with training text?

What is "myfile1.gt.txt"?

Shree Devi Kumar

unread,

May 31, 2020, 9:11:40 AM5/31/20

to tesseract-ocr

What I mentioned was for the case where you have images and their groundtruth. gt.txt is the grountruth - expected correct output from that image.

If you want to train from training text and fonts, then the method is different.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a0e7f1ca-b8cc-4752-b622-8e4e99f953af%40googlegroups.com.

Владимир Калачихин

unread,

May 31, 2020, 9:15:11 AM5/31/20

to tesseract-ocr

Ok, I want to train from training text and fonts.

Whats method must be?

Shree Devi Kumar

unread,

May 31, 2020, 12:16:55 PM5/31/20

to tesseract-ocr

Use tesstrain.sh or tesstrain.py

On Sun, May 31, 2020 at 6:45 PM Владимир Калачихин <v.kala...@gmail.com> wrote:

Ok, I want to train from training text and fonts.
Whats method must be?

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ca08f76d-d4d4-4e48-985c-c9c2cc00f8e6%40googlegroups.com.

Владимир Калачихин

unread,

May 31, 2020, 2:41:48 PM5/31/20

to tesseract-ocr

воскресенье, 31 мая 2020 г., 19:16:55 UTC+3 пользователь shree написал:

Use tesstrain.sh or tesstrain.py

On Sun, May 31, 2020 at 6:45 PM Владимир Калачихин <v.kala...@gmail.com> wrote:
Ok, I want to train from training text and fonts.
Whats method must be?

I thought You knew that you can't trainning tesseract for custom language with these tools.

Shree Devi Kumar

unread,

Jun 1, 2020, 4:23:39 AM6/1/20

to tesseract-ocr

So, modify the info given by Piyush Chandra earlier in this thread. The paths needs to based on where you have the files.

### create tif and box using fonts and training text

text2image --fonts_dir=/home/ubuntu/.fonts --outputbase=/mylang.myfont.exp0 --max_pages=0 --font=myfont --text=../langdata/mylang/mylang.training_text

### create unicharset from training_text

unicharset_extractor --norm_mode 1 --output_unicharset ./output/folder/own.unicharset ../langdata/mylang/mylang.training_text

### Create starter traineddatda (aka recoder):

combine_lang_model --input_unicharset ./out/own.unicharset --script_dir ./langdata --output_dir ./out --lang mylang

### Create training files (for each image):

tesseract /mylang.myfont.exp0.tif /mylang.myfont.exp0 --psm 6 lstm.train

### Create list of lstmf files

ls -1 /mylang.*.lstmf > mylang.training_files.txt

### Train:

lstmtraining --traineddata ./out/ mylang / mylang .traineddata --model_output ./output/ mylang --train_listfile mylang.training_files.txt --max_iterations 100

###Create Final traineddata:

lstmtraining --stop_training --continue_from ./output/ mylang _checkpoint --traineddata ./out/mylang /mylang.traineddata --model_output ./output/mylang.traineddata

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f9ee8e10-e789-442a-ac21-0c9aa14391bd%40googlegroups.com.

Владимир Калачихин

unread,

Jun 1, 2020, 10:46:48 AM6/1/20

to tesseract-ocr

Hi!
понедельник, 1 июня 2020 г., 11:23:39 UTC+3 пользователь shree написал:

### create tif and box using fonts and training text
text2image --fonts_dir=/home/ubuntu/.fonts --outputbase=/mylang.myfont.exp0 --max_pages=0 --font=myfont --text=../langdata/mylang/mylang.training_text

I do it for each font. For some font it's run ok, but for - with message "'--text' option is missing!". What does this mean?

### create unicharset from training_text
unicharset_extractor --norm_mode 1 --output_unicharset ./output/folder/own.unicharset ../langdata/mylang/mylang.training_text

This says "Bad box coordinates in boxfile string!", but created the unicharset file.

### Create starter traineddatda (aka recoder):
combine_lang_model --input_unicharset ./out/own.unicharset --script_dir ./langdata --output_dir ./out --lang mylang

This failed with "Failed to load script unicharset from:./langdata/Latin.unicharset"

Of course I don't have the Latin.unicharset - i want my own unicharset!

Shree Devi Kumar

unread,

Jun 1, 2020, 12:36:07 PM6/1/20

to tesseract-ocr

>Failed to load script unicharset from:./langdata/Latin.unicharset"

This is for Latin script not Latin language.

wget the file from https://github.com/tesseract-ocr/langdata_lstm/blob/master/Latin.unicharset

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/77f10ba4-83cb-45ba-8f6c-17b42f313336%40googlegroups.com.

Shree Devi Kumar

unread,

Jun 1, 2020, 12:37:25 PM6/1/20

to tesseract-ocr

You may find this repo useful

https://github.com/UYousafzai/easy_train_tesseract

Владимир Калачихин

unread,

Jun 2, 2020, 7:42:42 AM6/2/20

to tesseract-ocr

понедельник, 1 июня 2020 г., 19:37:25 UTC+3 пользователь shree написал:

You may find this repo useful

https://github.com/UYousafzai/easy_train_tesseract

You don't understand. I don't want training to new fonts of existing language. I want a new language.

Владимир Калачихин

unread,

Jun 2, 2020, 8:46:55 AM6/2/20

to tesseract-ocr

понедельник, 1 июня 2020 г., 19:36:07 UTC+3 пользователь shree написал:

This is for Latin script not Latin language.
wget the file from https://github.com/tesseract-ocr/langdata_lstm/blob/master/Latin.unicharset

Ok, I did it, and some next steps.

On step

### Train:
lstmtraining .....

I had:

"

Must specify an input layer as the first layer, not !!
Failed to create network from spec:

"

Obviously, something is missing. What?

Piyush Chandra

unread,

Jun 4, 2020, 12:13:58 PM6/4/20

to tesseract-ocr

This is what is missing : --net_spec . Check the line below that I mentioned before.

lstmtraining --traineddata ./out/own/own.traineddata --model_output ./output/own --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c110]" --train_listfile ./eng_ltsm/eng.training_files.txt --eval_listfile ./eng_ltsm/eng.training_files.txt --max_iterations 100

Владимир Калачихин

unread,

Jun 22, 2020, 6:42:44 AM6/22/20

to tesseract-ocr

I returned to this job.

четверг, 4 июня 2020 г., 19:13:58 UTC+3 пользователь Piyush Chandra написал:

This is what is missing : --net_spec . Check the line below that I mentioned before.

lstmtraining --traineddata ./out/own/own.traineddata --model_output ./output/own --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c110]" --train_listfile ./eng_ltsm/eng.training_files.txt --eval_listfile ./eng_ltsm/eng.training_files.txt --max_iterations 100

Ok, I add --net and run this step. It's ends and looks right.

After this, I run the last point from Shee recipe:

###Create Final traineddata:

lstmtraining --stop_training --continue_from ./output/ mylang _checkpoint --traineddata ./out/mylang /mylang.traineddata --model_output ./output/mylang.traineddata

With message

"Must provide a --traineddata see training wiki"

and nothing happened.

Of course, --traineddata ./out/mylang /mylang.traineddata are present and used with previous steps.

What's wrong with traineddata?

Reply all

Reply to author

Forward