Failure when creating training data

248 views
Skip to first unread message

Shandigutt

unread,
Sep 30, 2018, 6:27:46 PM9/30/18
to tesseract-ocr
Hi,

I attempted to create training data using the below command,

./src/training/tesstrain.sh --fonts_dir ../Support/font --lang sin --linedata_only \
  --noextract_font_properties --langdata_dir ../training \
  --tessdata_dir ../tessdata_best --output_dir ../training/sintrain --fontlist "BhashitaComplex" --training_text ../training/sin/sin.training_text 


I could capture only a part of the log output. Highlights are extracted below,

Word started with a combiner:0xddc
Normalization failed for string 'ො'
Word started with a combiner:0xdca
Word started with a combiner:0x200d
Normalization failed for string '්‍ය'
Word started with a combiner:0xdcf
Normalization failed for string 'ා'

Wrote unicharset file /tmp/sin-2018-09-29.aN0/sin.unicharset
[Sat Sep 29 21:33:19 UTC 2018] /usr/local/bin/set_unicharset_properties -U /tmp/sin-2018-09-29.aN0/sin.unicharset -O /tmp/sin-2018-09-29.aN0/sin.unicharset -X /tmp/sin-2018-09-29.aN0/sin.xheights --script_dir=../training
Loaded unicharset of size 114 from file /tmp/sin-2018-09-29.aN0/sin.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:../training/Latin.unicharset
Failed to load script unicharset from:../training/Sinhala.unicharset
Warning: properties incomplete for index 3 = ස
Warning: properties incomplete for index 4 = ී
Warning: properties incomplete for index 5 = ග

=== Constructing LSTM training data ===
Creating new directory ../training/sintrain
[Sun Sep 30 05:32:18 UTC 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/sin-2018-09-29.aN0/sin.unicharset --script_dir ../training --words ../training/sin/sin.wordlist --numbers ../training/sin/sin.numbers --puncs ../training/sin/sin.punc --output_dir ../training/sintrain --lang sin --pass_through_recoder
Loaded unicharset of size 114 from file /tmp/sin-2018-09-29.aN0/sin.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:../training/Latin.unicharset
Failed to load script unicharset from:../training/Sinhala.unicharset
Warning: properties incomplete for index 3 = ස
Warning: properties incomplete for index 4 = ී
Warning: properties incomplete for index 5 = ග


Warning: properties incomplete for index 112 = ෴
Warning: properties incomplete for index 113 = ෲ
Config file is optional, continuing...
Failed to read data from: ../training/sin/sin.config
Failed to read data from: ../training/radical-stroke.txt
Error reading radical code table ../training/radical-stroke.txt

=== Moving lstmf files for training data ===
Moving /tmp/sin-2018-09-29.aN0/sin.BhashitaComplex.exp0.lstmf to ../training/sintrain

Created starter traineddata for language 'sin'


Run lstmtraining to do the LSTM training for language 'sin'

For the full capture of the log please find the attached file

Tesseract version I use,
tesseract --version
tesseract 4.0.0-beta.4-158-g02f9d
 leptonica-1.77.0
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found SSE

OS details,
Linux ip-172-31-13-179 4.15.0-1021-aws #21-Ubuntu SMP Tue Aug 28 10:23:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Please let me know what has gone wrong. 

Thanks
Log.txt

Shree Devi Kumar

unread,
Sep 30, 2018, 7:33:54 PM9/30/18
to tesser...@googlegroups.com
Looks like your langdata dir does not have the script unicharset files for Signals and Latin scripts.

Failed to load script unicharset from:../training/Latin.unicharset
Failed to load script unicharset from:../training/Sinhala.unicharset

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/590c5444-0006-4816-baf1-35042d443d31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Sep 30, 2018, 7:41:11 PM9/30/18
to tesser...@googlegroups.com
Sinhala script

Sorry about the wrong autocorrect on phone

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages