building on cygwin with training data

278 views

Skip to first unread message

Marco Atzeri

unread,

Jul 28, 2015, 6:47:36 PM7/28/15

to tesser...@googlegroups.com

Hi,
I just completed the build of tesseract-ocr-3.04.00
including the training portion.

Attached the patch I used together with

configure LIBS="$(pkg-config --libs icu-i18n)"

to correctly include the icu dependency.
For what I see the additional steps

make training
make training-install

are only installing these additional files

/usr/bin/ambiguous_words.exe
/usr/bin/classifier_tester.exe
/usr/bin/cntraining.exe
/usr/bin/combine_tessdata.exe
/usr/bin/dawg2wordlist.exe
/usr/bin/mftraining.exe
/usr/bin/set_unicharset_properties.exe
/usr/bin/shapeclustering.exe
/usr/bin/text2image.exe
/usr/bin/unicharset_extractor.exe
/usr/bin/wordlist2dawg.exe

full list attached.

Questions:
- anything missing ?
- which portion of
https://github.com/tesseract-ocr/langdata
you would like to see in a training data package ?

The current splits is available at:
https://cygwin.com/packages/x86_64/tesseract-ocr/tesseract-ocr-3.04.00-1
https://cygwin.com/packages/x86_64/tesseract-ocr-devel/tesseract-ocr-devel-3.04.00-1
https://cygwin.com/packages/x86_64/libtesseract-ocr_3/libtesseract-ocr_3-3.04.00-1

only English language is installed by default and it also contain the
osd data:
https://cygwin.com/packages/x86_64/tesseract-ocr-eng/tesseract-ocr-eng-3.04-1

Others :
tesseract-ocr-deu/
tesseract-ocr-fra/
tesseract-ocr-ita/
tesseract-ocr-nld/
tesseract-ocr-por/
tesseract-ocr-spa/
tesseract-ocr-vie/

Regards
Marco

tesseract-ocr-3.04.00-2.src.patch

list-training.txt

ShreeDevi Kumar

unread,

Jul 29, 2015, 5:41:39 AM7/29/15

to tesser...@googlegroups.com

Marco,

Thanks for building the training tools for cygwin. Till now just the additional binaries have been shipped as part of the tesseract package.

With Tesseract 3.04.00 there are additional scripts provided to help with training. Google has also provided the language data which can be used for training different languages and building the traineddata files. Hence my request to include these.

Not all users will be interested in training for a new language or trying to improve an existing traineddata, so in my opinion, it maybe better to package these separately.

Given the above, following are my suggestions (from a user's perspective), I hope this will provide the impetus for developers and other packagers to provide their feedback too.

1. Package the training tools separately.

2. Modify the way tessdata is packaged (both as part of training tools as well as tesseract-ocr-core).

Instead of packaging under ./usr/share/tessdata I suggest adding another level of directory above tessdata and provide it as ./usr/share/tesseract/tessdata . This would allow all tesseract related files to be kept under the tesseract directory.

3. Include the training tools exe files as well as the following training bash scripts in the ./usr/bin directory .

./usr/bin/tesstrain.sh

./usr/bin/tesstrain_utils.sh

./usr/bin/language-spcific.sh

alternately the training scripts could be kept under ./usr/share/tesseract/training/

4. Provide the tifs from testing directory for easy testing of install and example usage. It maybe useful in the future to add samples for non-latin based scripts too.

./usr/share/tesseract/testing/phototest.tif

./usr/share/tesseract/testing/eurotext.tif

5. Regarding langdata, the readme says
"To re-create the training of a single language, lang, you need the following:

All the data in the lang directory.
The corresponding unicharset/xheights files for the script(s) used by lang.
All the remaining non-lang-specific files in the top-level directory, such as font_properties."

5.1 So, I would suggest that the training tools by default include the langdata for English (similar to the packaging for tesseract-ocr itself).

5.2 Include ALL the files in the top-level directory including the unicharset/xhights files for ALL the scripts.

5.3 Package or link to the language data for different languages, which is available in separate subfolders.

The file list would then look, similar to the following:

./usr/share/tesseract/tessdata/configs/ambigs.train

./usr/share/tesseract/tessdata/configs/api_config

./usr/share/tesseract/tessdata/configs/bigram

./usr/share/tesseract/tessdata/configs/box.train

./usr/share/tesseract/tessdata/configs/box.train.stderr

./usr/share/tesseract/tessdata/configs/digits

./usr/share/tesseract/tessdata/configs/hocr

./usr/share/tesseract/tessdata/configs/inter

./usr/share/tesseract/tessdata/configs/kannada

./usr/share/tesseract/tessdata/configs/linebox

./usr/share/tesseract/tessdata/configs/logfile

./usr/share/tesseract/tessdata/configs/makebox

./usr/share/tesseract/tessdata/configs/pdf

./usr/share/tesseract/tessdata/configs/quiet

./usr/share/tesseract/tessdata/configs/rebox

./usr/share/tesseract/tessdata/configs/strokewidth

./usr/share/tesseract/tessdata/configs/unlv

./usr/share/tesseract/tessdata/pdf.ttf

./usr/share/tesseract/tessdata/tessconfigs/batch

./usr/share/tesseract/tessdata/tessconfigs/batch.nochop

./usr/share/tesseract/tessdata/tessconfigs/matdemo

./usr/share/tesseract/tessdata/tessconfigs/msdemo

./usr/share/tesseract/tessdata/tessconfigs/nobatch

./usr/share/tesseract/tessdata/tessconfigs/segdemo

./usr/share/tesseract/testing/phototest.tif

./usr/share/tesseract/testing/eurotext.tif

./usr/share/tesseract/training/tesstrain.sh

./usr/share/tesseract/training/tesstrain_utils.sh

./usr/share/tesseract/training/language-spcific.sh

./usr/share/tesseract/training/langdata/Arabic.unicharset

./usr/share/tesseract/training/langdata/Arabic.xheights

./usr/share/tesseract/training/langdata/Armenian.unicharset

./usr/share/tesseract/training/langdata/Armenian.xheights

./usr/share/tesseract/training/langdata/Bengali.unicharset

./usr/share/tesseract/training/langdata/Bengali.xheights

./usr/share/tesseract/training/langdata/Bopomofo.unicharset

./usr/share/tesseract/training/langdata/Bopomofo.xheights

./usr/share/tesseract/training/langdata/Canadian_Aboriginal.unicharset

./usr/share/tesseract/training/langdata/Canadian_Aboriginal.xheights

./usr/share/tesseract/training/langdata/Cherokee.unicharset

./usr/share/tesseract/training/langdata/Cherokee.xheights

./usr/share/tesseract/training/langdata/Common.unicharset

./usr/share/tesseract/training/langdata/Cyrillic.unicharset

./usr/share/tesseract/training/langdata/Cyrillic.xheights

./usr/share/tesseract/training/langdata/Devanagari.unicharset

./usr/share/tesseract/training/langdata/Devanagari.xheights

./usr/share/tesseract/training/langdata/Ethiopic.unicharset

./usr/share/tesseract/training/langdata/Ethiopic.xheights

./usr/share/tesseract/training/langdata/Georgian.unicharset

./usr/share/tesseract/training/langdata/Georgian.xheights

./usr/share/tesseract/training/langdata/Greek.unicharset

./usr/share/tesseract/training/langdata/Greek.xheights

./usr/share/tesseract/training/langdata/Gujarati.unicharset

./usr/share/tesseract/training/langdata/Gujarati.xheights

./usr/share/tesseract/training/langdata/Gurmukhi.unicharset

./usr/share/tesseract/training/langdata/Gurmukhi.xheights

./usr/share/tesseract/training/langdata/Han.unicharset

./usr/share/tesseract/training/langdata/Han.xheights

./usr/share/tesseract/training/langdata/Hangul.unicharset

./usr/share/tesseract/training/langdata/Hangul.xheights

./usr/share/tesseract/training/langdata/Hebrew.unicharset

./usr/share/tesseract/training/langdata/Hebrew.xheights

./usr/share/tesseract/training/langdata/Hiragana.unicharset

./usr/share/tesseract/training/langdata/Hiragana.xheights

./usr/share/tesseract/training/langdata/Kannada.unicharset

./usr/share/tesseract/training/langdata/Kannada.xheights

./usr/share/tesseract/training/langdata/Katakana.unicharset

./usr/share/tesseract/training/langdata/Katakana.xheights

./usr/share/tesseract/training/langdata/Khmer.unicharset

./usr/share/tesseract/training/langdata/Khmer.xheights

./usr/share/tesseract/training/langdata/Lao.unicharset

./usr/share/tesseract/training/langdata/Lao.xheights

./usr/share/tesseract/training/langdata/Latin.unicharset

./usr/share/tesseract/training/langdata/Latin.xheights

./usr/share/tesseract/training/langdata/Malayalam.unicharset

./usr/share/tesseract/training/langdata/Malayalam.xheights

./usr/share/tesseract/training/langdata/Myanmar.unicharset

./usr/share/tesseract/training/langdata/Myanmar.xheights

./usr/share/tesseract/training/langdata/Ogham.unicharset

./usr/share/tesseract/training/langdata/Ogham.xheights

./usr/share/tesseract/training/langdata/Oriya.unicharset

./usr/share/tesseract/training/langdata/Oriya.xheights

./usr/share/tesseract/training/langdata/Runic.unicharset

./usr/share/tesseract/training/langdata/Runic.xheights

./usr/share/tesseract/training/langdata/Sinhala.unicharset

./usr/share/tesseract/training/langdata/Sinhala.xheights

./usr/share/tesseract/training/langdata/Syriac.unicharset

./usr/share/tesseract/training/langdata/Syriac.xheights

./usr/share/tesseract/training/langdata/Tamil.unicharset

./usr/share/tesseract/training/langdata/Tamil.xheights

./usr/share/tesseract/training/langdata/Telugu.unicharset

./usr/share/tesseract/training/langdata/Telugu.xheights

./usr/share/tesseract/training/langdata/Thai.unicharset

./usr/share/tesseract/training/langdata/Thai.xheights

./usr/share/tesseract/training/langdata/Tibetan.unicharset

./usr/share/tesseract/training/langdata/common.punc

./usr/share/tesseract/training/langdata/common.unicharambigs

./usr/share/tesseract/training/langdata/font_properties

./usr/share/tesseract/training/langdata/forbidden_characters_default

./usr/share/tesseract/training/langdata/eng/desired_characters

./usr/share/tesseract/training/langdata/eng/eng.cube-unicharset

./usr/share/tesseract/training/langdata/eng/eng.cube-word-dawg

./usr/share/tesseract/training/langdata/eng/eng.numbers

./usr/share/tesseract/training/langdata/eng/eng.punc

./usr/share/tesseract/training/langdata/eng/eng.training_text

./usr/share/tesseract/training/langdata/eng/eng.training_text.bigram_freqs

./usr/share/tesseract/training/langdata/eng/eng.training_text.unigram_freqs

./usr/share/tesseract/training/langdata/eng/eng.unicharambigs

./usr/share/tesseract/training/langdata/eng/eng.word.bigrams

./usr/share/tesseract/training/langdata/eng/eng.wordlist

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Marco

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/55B80674.4070709%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Marco Atzeri

unread,

Aug 2, 2015, 3:23:25 AM8/2/15

to tesser...@googlegroups.com

On 7/29/2015 11:40 AM, ShreeDevi Kumar wrote:
> Marco,
>
> Thanks for building the training tools for cygwin. Till now just the
> additional binaries have been shipped as part of the tesseract package.
>
> With Tesseract 3.04.00 there are additional scripts provided to help
> with training. Google has also provided the language data which can be
> used for training different languages and building the traineddata
> files. Hence my request to include these.
>
> Not all users will be interested in training for a new language or
> trying to improve an existing traineddata, so in my opinion, it maybe
> better to package these separately.

Hi ShreeDevi
uploading 3.04.00-2.

The training tools are in the new package
tesseract-training-util

while the training language file are split between
tesseract-training-core
tesseract-training-{lang}

I have not changed the previos datastructure,
just added an additional level
/usr/share/tessdata/training

and the two test files are in
/usr/share/tessdata/testing/eurotext.tif
/usr/share/tessdata/testing/phototest.tif

$ cygcheck -l tesseract-training-util

/usr/bin/ambiguous_words.exe
/usr/bin/classifier_tester.exe
/usr/bin/cntraining.exe
/usr/bin/combine_tessdata.exe
/usr/bin/dawg2wordlist.exe
/usr/bin/mftraining.exe
/usr/bin/set_unicharset_properties.exe
/usr/bin/shapeclustering.exe
/usr/bin/text2image.exe
/usr/bin/unicharset_extractor.exe
/usr/bin/wordlist2dawg.exe

/usr/bin/language-specific.sh
/usr/bin/tesstrain.sh
/usr/bin/tesstrain_utils.sh

$ cygcheck -l tesseract-training-core
/usr/share/tessdata/training/Arabic.unicharset
/usr/share/tessdata/training/Arabic.xheights
...
/usr/share/tessdata/training/Cherokee.xheights
/usr/share/tessdata/training/common.punc
/usr/share/tessdata/training/common.unicharambigs
/usr/share/tessdata/training/Common.unicharset
/usr/share/tessdata/training/Cyrillic.unicharset
...
/usr/share/tessdata/training/Ethiopic.xheights
/usr/share/tessdata/training/font_properties
/usr/share/tessdata/training/forbidden_characters_default
/usr/share/tessdata/training/Georgian.unicharset
...
/usr/share/tessdata/training/Tibetan.unicharset

$ cygcheck -l tesseract-training-eng
/usr/share/tessdata/training/eng/desired_characters
/usr/share/tessdata/training/eng/eng.cube-unicharset
/usr/share/tessdata/training/eng/eng.cube-word-dawg
/usr/share/tessdata/training/eng/eng.numbers
/usr/share/tessdata/training/eng/eng.punc
/usr/share/tessdata/training/eng/eng.training_text
/usr/share/tessdata/training/eng/eng.training_text.bigram_freqs
/usr/share/tessdata/training/eng/eng.training_text.unigram_freqs
/usr/share/tessdata/training/eng/eng.unicharambigs
/usr/share/tessdata/training/eng/eng.word.bigrams
/usr/share/tessdata/training/eng/eng.wordlist

Regards
Marco

ShreeDevi Kumar

unread,

Aug 2, 2015, 4:31:56 AM8/2/15

to tesser...@googlegroups.com, tesser...@googlegroups.com

+ tesseract-dev google group

Thank you, Marco. I will download the training tools packages and and give it a try.

In future updates to the tesseract package, may I suggest packaging of more languages from 'tessdata' - https://github.com/tesseract-ocr/tessdata

specially the ones which have multiple files for the language such as ara, hin etc.

The languages that have just one file for traineddata can be downloaded easily as a zip from the 'raw' link. It would be very helpful to have a single tar/zip for the others.

Thanks so much for packaging 3.04.00 for cygwin.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Marco

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/55BDC558.2090205%40gmail.com.

Marco Atzeri

unread,

Aug 2, 2015, 5:55:16 AM8/2/15

to tesser...@googlegroups.com

On 8/2/2015 10:31 AM, ShreeDevi Kumar wrote:
> + tesseract-dev google group
>
> Thank you, Marco. I will download the training tools packages and and
> give it a try.
>
> In future updates to the tesseract package, may I suggest packaging of
> more languages from 'tessdata' - https://github.com/tesseract-ocr/tessdata
>
> specially the ones which have multiple files for the language such as
> ara, hin etc.
>
> The languages that have just one file for traineddata can be downloaded
> easily as a zip from the 'raw' link. It would be very helpful to have a
> single tar/zip for the others.
>

all the languages data in tessdata are > 1GB
so I assume very few will need all,
and most will not appreciate a single file of
346M (compressed with xz )

May be a script to list/download/update from
https://github.com/tesseract-ocr/tessdata
will be more useful.

Question:
why tessdata includes other files than traineddata ?

$ ls -s1 rus*
1.0K rus.cube.fold
1.0K rus.cube.lm
892K rus.cube.nn
1.0K rus.cube.params
15M rus.cube.size
6.8M rus.cube.word-freq
16M rus.traineddata

From the wiki I had the impression that
traineddata should include all the others file inside.

Are all the files for a language needed or only the
{lang}.traineddata ?

Langdata includes a different set of files

$ ls -s1 rus*
total 22M
1.0K desired_characters
8.0K rus.cube-unicharset
1.3M rus.cube-word-dawg
4.0K rus.numbers
8.0K rus.punc
16K rus.training_text
96K rus.training_text.bigram_freqs
4.0K rus.training_text.unigram_freqs
8.0K rus.unicharambigs
11M rus.word.bigrams
11M rus.wordlist

There is a description of the different type of data ?

Marco

ShreeDevi Kumar

unread,

Aug 2, 2015, 6:12:28 AM8/2/15

to tesser...@googlegroups.com, tesser...@googlegroups.com, Ray Smith

On Sun, Aug 2, 2015 at 3:25 PM, Marco Atzeri <marco....@gmail.com> wrote:

On 8/2/2015 10:31 AM, ShreeDevi Kumar wrote:

+ tesseract-dev google group

Thank you, Marco. I will download the training tools packages and and
give it a try.

In future updates to the tesseract package, may I suggest packaging of
more languages from 'tessdata' - https://github.com/tesseract-ocr/tessdata

specially the ones which have multiple files for the language such as
ara, hin etc.

The languages that have just one file for traineddata can be downloaded
easily as a zip from the 'raw' link. It would be very helpful to have a
single tar/zip for the others.

all the languages data in tessdata are > 1GB
so I assume very few will need all,
and most will not appreciate a single file of
346M (compressed with xz )

You are right. What I meant was that for languages with just one file eg. guj, users can download using https://github.com/tesseract-ocr/tessdata/blob/master/guj.traineddata?raw=true

But there is no easy way to download the multiple files for hin.* from same github directory.

May be a script to list/download/update from
https://github.com/tesseract-ocr/tessdata
will be more useful.

Yes, that is a good idea.

Question:
why tessdata includes other files than traineddata ?

$ ls -s1 rus*
1.0K rus.cube.fold
1.0K rus.cube.lm
892K rus.cube.nn
1.0K rus.cube.params
15M rus.cube.size
6.8M rus.cube.word-freq
16M rus.traineddata

From the wiki I had the impression that
traineddata should include all the others file inside.

Some languages were trained using the 'cube' engine. The traineddata for them includes these extra files. Please see

http://packages.ubuntu.com/wily/all/tesseract-ocr-ara/filelist

http://packages.ubuntu.com/wily/all/tesseract-ocr-eng/filelist

http://packages.ubuntu.com/wily/all/tesseract-ocr-hin/filelist

http://packages.ubuntu.com/wily/all/tesseract-ocr-rus/filelist

etc

Are all the files for a language needed or only the
{lang}.traineddata ?

I think some of the cube files are required during recognition.

Ray or other developers can offer a more complete answer.

Langdata includes a different set of files

$ ls -s1 rus*
total 22M
1.0K desired_characters
8.0K rus.cube-unicharset
1.3M rus.cube-word-dawg
4.0K rus.numbers
8.0K rus.punc
16K rus.training_text
96K rus.training_text.bigram_freqs
4.0K rus.training_text.unigram_freqs
8.0K rus.unicharambigs
11M rus.word.bigrams
11M rus.wordlist

Langdata files are required only by those who want to train for that particular language - maybe in an effort to improve the traineddata provided by Google or to customize it to their needs.

There is a description of the different type of data ?

Marco

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/55BDE8F4.8010609%40gmail.com.

Reply all

Reply to author

Forward

0 new messages