Error when executing combine_lang

Shandigutt

unread,

Sep 3, 2018, 5:11:50 PM9/3/18

to tesseract-ocr

Hi,

I'm currently in the process of training Tesseract for new language. I'm currently following Tesseract wiki training guidelines.

Once I build Tesseract from source and installed, I first created my own langdata set.

Then I crated training data and eval data using tesstrain.sh script.

Then I tried to create a starter traineddata file using combine_lang_model script. I used the below command for that,

./build/src/training/combine_lang_model --input_unicharset ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers ../langdata/sin/sin.numbers --output_dir ../training/combined_sin --version_str 1.0 --lang sin

When executing the above command I referred the langdata I created on my own for words list, punctuations and numbers. Also I referred the unicharset file that was created when creating training data. But I got the following error output,

Loaded unicharset of size 90 from file ../training/sintrain/sin/sin.unicharset

Setting unichar properties

Setting script properties

Warning: properties incomplete for index 4 = ී

Warning: properties incomplete for index 6 = ි

Warning: properties incomplete for index 11 = ු

Warning: properties incomplete for index 15 = ්‌

Warning: properties incomplete for index 30 = ූ

Warning: properties incomplete for index 44 = ්‍ර

Warning: properties incomplete for index 79 = ්‍ය

Warning: properties incomplete for index 82 = ක්‍

Warning: properties incomplete for index 89 = ර්‍

Error writing unicharset!!

Can somebody assist me on this.

Thanks

Shandigutt

unread,

Sep 3, 2018, 5:19:51 PM9/3/18

to tesseract-ocr

Adding more details to my query,

My tesseract version:

tesseract 4.0.0-beta.4-74-gd8237

leptonica-1.77.0

libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

Found SSE

My OS details,

tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a

No LSB modules are available.

Distributor ID: Ubuntu

Description: Ubuntu 18.04.1 LTS

Release: 18.04

Codename: bionic

Thanks

Shree Devi Kumar

unread,

Sep 3, 2018, 11:25:37 PM9/3/18

to tesser...@googlegroups.com

> Then I tried to create a starter traineddata file using combine_lang_model script. I used the below command for that,

When you run tesstrain.sh, it creates the starter traineddata using combine_lang_model script.

See below for messages from a small test run.

+ /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts --lang sin --linedata_only --noextract_font_properties --langdata_dir ../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif --training_text ../langdata_lstm/sin/sin.training_text --workspace_dir /home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir ../tesstutorial/sintest

=== Starting training for language 'sin'

[Tue Sep 4 03:21:08 UTC 2018] /home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts --font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt --text=/home/ubuntu/tmp//fc-cache/sample_text.txt --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache

Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif

=== Phase I: Generating training images ===

Rendering using FreeSerif

[Tue Sep 4 03:21:10 UTC 2018] /home/ubuntu/tesseract/src/training/text2image --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1 --font=FreeSerif --text=../langdata_lstm/sin/sin.training_text

Stripped 1 unrenderable words

Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===

[Tue Sep 4 03:21:11 UTC 2018] /home/ubuntu/tesseract/src/training/unicharset_extractor --output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2 /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box

Extracting unicharset from box file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box

Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset

[Tue Sep 4 03:21:11 UTC 2018] /home/ubuntu/tesseract/src/training/set_unicharset_properties -U /tmp/sin-2018-09-04.Wa5/sin.unicharset -O /tmp/sin-2018-09-04.Wa5/sin.unicharset -X /tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm

Loaded unicharset of size 111 from file /tmp/sin-2018-09-04.Wa5/sin.unicharset

Setting unichar properties

Setting script properties

Warning: properties incomplete for index 7 = ි

Warning: properties incomplete for index 9 = ු

Warning: properties incomplete for index 17 = ්‌

Warning: properties incomplete for index 19 = ී

Warning: properties incomplete for index 38 = ්‍ර

Warning: properties incomplete for index 66 = ₹

Warning: properties incomplete for index 73 = ූ

Warning: properties incomplete for index 79 = ්‍ය

Warning: properties incomplete for index 89 = ක්‍

Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset

=== Phase E: Generating lstmf files ===

Using TESSDATA_PREFIX=../tessdata_best

[Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train

Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica

Page 1

=== Constructing LSTM training data ===

[Tue Sep 4 03:21:13 UTC 2018] /home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm --words ../langdata_lstm/sin/sin.wordlist --numbers ../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc --output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder

Loaded unicharset of size 111 from file /tmp/sin-2018-09-04.Wa5/sin.unicharset

Setting unichar properties

Setting script properties

Warning: properties incomplete for index 7 = ි

Warning: properties incomplete for index 9 = ු

Warning: properties incomplete for index 17 = ්‌

Warning: properties incomplete for index 19 = ී

Warning: properties incomplete for index 38 = ්‍ර

Warning: properties incomplete for index 66 = ₹

Warning: properties incomplete for index 73 = ූ

Warning: properties incomplete for index 79 = ්‍ය

Warning: properties incomplete for index 89 = ක්‍

Config file is optional, continuing...

Failed to read data from: ../langdata_lstm/sin/sin.config

Reducing Trie to SquishedDawg

=== Saving box/tiff pairs for training data ===

Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to ../tesstutorial/sintest

Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to ../tesstutorial/sintest

=== Moving lstmf files for training data ===

Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to ../tesstutorial/sintest

Created starter traineddata for language 'sin'

Run lstmtraining to do the LSTM training for language 'sin'

real 0m5.238s

user 0m3.792s

sys 0m0.256s

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shandigutt

unread,

Sep 4, 2018, 5:55:20 PM9/4/18

to tesseract-ocr

Thank you very much for sorting things out Shree. But I have one more question

When I run tesstrain.sh I don't pass my words list, punctuation and numbers as input parameters. But I keep those files in the langdata folder. So when it executes combine_lang_model internally does it pas these files as arguments to combine_lang_model script?

Now since this step is completed can I move straight to running lstmtrainingscript?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Sep 5, 2018, 12:11:51 AM9/5/18

to tesser...@googlegroups.com

Easiest way to check is to use combine_tessdata to unpack the starter traineddata file and see what is included. You can use dawg2wordlist to verify that it is the correct files being included.

Yes, after you have the created starter traineddata, you can run lstmtraining.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shandigutt

unread,

Sep 8, 2018, 9:39:22 AM9/8/18

to tesseract-ocr

Thank you very much Shree

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e3ec5a61-110b-4d26-b48d-26dae3a5457d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

Error when executing combine_lang_model script

Shandigutt

Shandigutt

Shree Devi Kumar

Shandigutt

Shree Devi Kumar

Shandigutt