Error when executing combine_lang_model script

505 views
Skip to first unread message

Shandigutt

unread,
Sep 3, 2018, 5:11:50 PM9/3/18
to tesseract-ocr
Hi,

I'm currently in the process of training Tesseract for new language. I'm currently following Tesseract wiki training guidelines.

Once I build Tesseract from source and installed, I first created my own langdata set. 

Then I crated training data and eval data using tesstrain.sh script.

Then I tried to create a starter traineddata file using combine_lang_model script. I used the below command for that,

./build/src/training/combine_lang_model --input_unicharset ../training/sintrain/sin/sin.unicharset --script_dir ../langdata --words ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers ../langdata/sin/sin.numbers --output_dir ../training/combined_sin --version_str 1.0 --lang sin

When executing the above command I referred the langdata I created on my own for words list, punctuations and numbers. Also I referred the unicharset file that was created when creating training data. But I got the following error output,

Loaded unicharset of size 90 from file ../training/sintrain/sin/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 4 = ී
Warning: properties incomplete for index 6 = ි
Warning: properties incomplete for index 11 = ු
Warning: properties incomplete for index 15 = ්‌
Warning: properties incomplete for index 30 = ූ
Warning: properties incomplete for index 44 = ්‍ර
Warning: properties incomplete for index 79 = ්‍ය
Warning: properties incomplete for index 82 = ක්‍
Warning: properties incomplete for index 89 = ර්‍
Error writing unicharset!!

Can somebody assist me on this.

Thanks

Shandigutt

unread,
Sep 3, 2018, 5:19:51 PM9/3/18
to tesseract-ocr
Adding more details to my query,

My tesseract  version:
tesseract 4.0.0-beta.4-74-gd8237
 leptonica-1.77.0
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found SSE

My OS details,
tharaka@tharaka-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic

Thanks

Shree Devi Kumar

unread,
Sep 3, 2018, 11:25:37 PM9/3/18
to tesser...@googlegroups.com
> Then I tried to create a starter traineddata file using combine_lang_model script. I used the below command for that,

When you run tesstrain.sh, it creates the starter traineddata  using combine_lang_model script. 

See below for messages from a small test run.

+ /home/ubuntu/tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts --lang sin --linedata_only --noextract_font_properties --langdata_dir ../langdata_lstm --tessdata_dir ../tessdata_best --fontlist FreeSerif --training_text ../langdata_lstm/sin/sin.training_text --workspace_dir /home/ubuntu/tmp/ --save_box_tiff --maxpages 1 --output_dir ../tesstutorial/sintest

=== Starting training for language 'sin'
[Tue Sep 4 03:21:08 UTC 2018] /home/ubuntu/tesseract/src/training/text2image --fonts_dir=../.fonts --font=FreeSerif --outputbase=/home/ubuntu/tmp//fc-cache/sample_text.txt --text=/home/ubuntu/tmp//fc-cache/sample_text.txt --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache
Rendered page 0 to file /home/ubuntu/tmp//fc-cache/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using FreeSerif
[Tue Sep 4 03:21:10 UTC 2018] /home/ubuntu/tesseract/src/training/text2image --fontconfig_tmpdir=/home/ubuntu/tmp//fc-cache --fonts_dir=../.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --max_pages=1 --font=FreeSerif --text=../langdata_lstm/sin/sin.training_text
Stripped 1 unrenderable words
Rendered page 0 to file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Tue Sep 4 03:21:11 UTC 2018] /home/ubuntu/tesseract/src/training/unicharset_extractor --output_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --norm_mode 2 /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
Extracting unicharset from box file /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box
Wrote unicharset file /tmp/sin-2018-09-04.Wa5/sin.unicharset
[Tue Sep 4 03:21:11 UTC 2018] /home/ubuntu/tesseract/src/training/set_unicharset_properties -U /tmp/sin-2018-09-04.Wa5/sin.unicharset -O /tmp/sin-2018-09-04.Wa5/sin.unicharset -X /tmp/sin-2018-09-04.Wa5/sin.xheights --script_dir=../langdata_lstm
Loaded unicharset of size 111 from file /tmp/sin-2018-09-04.Wa5/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 7 = ි
Warning: properties incomplete for index 9 = ු
Warning: properties incomplete for index 17 = ්‌
Warning: properties incomplete for index 19 = ී
Warning: properties incomplete for index 38 = ්‍ර
Warning: properties incomplete for index 66 = ₹
Warning: properties incomplete for index 73 = ූ
Warning: properties incomplete for index 79 = ්‍ය
Warning: properties incomplete for index 89 = ක්‍
Writing unicharset to file /tmp/sin-2018-09-04.Wa5/sin.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=../tessdata_best
[Tue Sep 4 03:21:12 UTC 2018] /home/ubuntu/tesseract/src/api/tesseract /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0 --psm 6 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.4-93-ge4b9c with Leptonica
Page 1

=== Constructing LSTM training data ===
[Tue Sep 4 03:21:13 UTC 2018] /home/ubuntu/tesseract/src/training/combine_lang_model --input_unicharset /tmp/sin-2018-09-04.Wa5/sin.unicharset --script_dir ../langdata_lstm --words ../langdata_lstm/sin/sin.wordlist --numbers ../langdata_lstm/sin/sin.numbers --puncs ../langdata_lstm/sin/sin.punc --output_dir ../tesstutorial/sintest --lang sin --pass_through_recoder
Loaded unicharset of size 111 from file /tmp/sin-2018-09-04.Wa5/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 7 = ි
Warning: properties incomplete for index 9 = ු
Warning: properties incomplete for index 17 = ්‌
Warning: properties incomplete for index 19 = ී
Warning: properties incomplete for index 38 = ්‍ර
Warning: properties incomplete for index 66 = ₹
Warning: properties incomplete for index 73 = ූ
Warning: properties incomplete for index 79 = ්‍ය
Warning: properties incomplete for index 89 = ක්‍
Config file is optional, continuing...
Failed to read data from: ../langdata_lstm/sin/sin.config
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg

=== Saving box/tiff pairs for training data ===
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.box to ../tesstutorial/sintest
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.tif to ../tesstutorial/sintest

=== Moving lstmf files for training data ===
Moving /tmp/sin-2018-09-04.Wa5/sin.FreeSerif.exp0.lstmf to ../tesstutorial/sintest

Created starter traineddata for language 'sin'


Run lstmtraining to do the LSTM training for language 'sin'


real 0m5.238s
user 0m3.792s
sys 0m0.256s


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/71472620-135e-4777-8913-688e95fb9be3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shandigutt

unread,
Sep 4, 2018, 5:55:20 PM9/4/18
to tesseract-ocr
Thank you very much for sorting things out Shree. But I have one more question

When I run tesstrain.sh I don't pass my words list, punctuation and numbers as input parameters. But I keep those files in the langdata folder. So when it executes combine_lang_model internally does it pas these files as arguments to combine_lang_model script?

Now since this step is completed can I move straight to running lstmtrainingscript?
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Sep 5, 2018, 12:11:51 AM9/5/18
to tesser...@googlegroups.com
Easiest way to check is to use combine_tessdata to unpack the starter traineddata file and see what is included. You can use dawg2wordlist to verify that it is the correct files being included.

Yes, after you have the created starter traineddata, you can run lstmtraining.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Shandigutt

unread,
Sep 8, 2018, 9:39:22 AM9/8/18
to tesseract-ocr
Thank you very much Shree
Reply all
Reply to author
Forward
0 new messages