Hi,
I attempted to create training data using the below command,
./src/training/tesstrain.sh --fonts_dir ../Support/font --lang sin --linedata_only \
--noextract_font_properties --langdata_dir ../training \
--tessdata_dir ../tessdata_best --output_dir ../training/sintrain --fontlist "BhashitaComplex" --training_text ../training/sin/sin.training_text
I could capture only a part of the log output. Highlights are extracted below,
Word started with a combiner:0xddc
Normalization failed for string 'ො'
Word started with a combiner:0xdca
Word started with a combiner:0x200d
Normalization failed for string '්ය'
Word started with a combiner:0xdcf
Normalization failed for string 'ා'
Wrote unicharset file /tmp/sin-2018-09-29.aN0/sin.unicharset
[Sat Sep 29 21:33:19 UTC 2018] /usr/local/bin/set_unicharset_properties -U /tmp/sin-2018-09-29.aN0/sin.unicharset -O /tmp/sin-2018-09-29.aN0/sin.unicharset -X /tmp/sin-2018-09-29.aN0/sin.xheights --script_dir=../training
Loaded unicharset of size 114 from file /tmp/sin-2018-09-29.aN0/sin.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:../training/Latin.unicharset
Failed to load script unicharset from:../training/Sinhala.unicharset
Warning: properties incomplete for index 3 = ස
Warning: properties incomplete for index 4 = ී
Warning: properties incomplete for index 5 = ග
=== Constructing LSTM training data ===
Creating new directory ../training/sintrain
[Sun Sep 30 05:32:18 UTC 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/sin-2018-09-29.aN0/sin.unicharset --script_dir ../training --words ../training/sin/sin.wordlist --numbers ../training/sin/sin.numbers --puncs ../training/sin/sin.punc --output_dir ../training/sintrain --lang sin --pass_through_recoder
Loaded unicharset of size 114 from file /tmp/sin-2018-09-29.aN0/sin.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:../training/Latin.unicharset
Failed to load script unicharset from:../training/Sinhala.unicharset
Warning: properties incomplete for index 3 = ස
Warning: properties incomplete for index 4 = ී
Warning: properties incomplete for index 5 = ග
Warning: properties incomplete for index 112 = ෴
Warning: properties incomplete for index 113 = ෲ
Config file is optional, continuing...
Failed to read data from: ../training/sin/sin.config
Failed to read data from: ../training/radical-stroke.txt
Error reading radical code table ../training/radical-stroke.txt
=== Moving lstmf files for training data ===
Moving /tmp/sin-2018-09-29.aN0/sin.BhashitaComplex.exp0.lstmf to ../training/sintrain
Created starter traineddata for language 'sin'
Run lstmtraining to do the LSTM training for language 'sin'
For the full capture of the log please find the attached file
Tesseract version I use,
tesseract 4.0.0-beta.4-158-g02f9d
libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
OS details,
Linux ip-172-31-13-179 4.15.0-1021-aws #21-Ubuntu SMP Tue Aug 28 10:23:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Please let me know what has gone wrong.
Thanks