Hi,
I am trying to train Tesseract for Sinhala language. I was following
training guidelines mentioned in Github wiki. I get an error with reference to the 4th step which is "Creating Starter Traineddata". Please find the below command I executed,
training/combine_lang_model --input_unicharset ../training/sin/sin.unicharset --script_dir ../langdata --words ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers ../langdata/sin/sin.numbers --output_dir ../training/combined_sin --version_str 1.0 --lang sin
I get the following output,
Loaded unicharset of size 94 from file ../training/sin/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 4 = ී
Warning: properties incomplete for index 6 = ි
Warning: properties incomplete for index 11 = ු
Warning: properties incomplete for index 15 = ්
Warning: properties incomplete for index 33 = ූ
Warning: properties incomplete for index 52 = ්ර
Warning: properties incomplete for index 56 = ්ය
Warning: properties incomplete for index 87 = ක්
Warning: properties incomplete for index 93 = ර්
Config file is optional, continuing...
Null char=2
Invalid format in radical table at line 4: 3400 1.4
Creation of encoded unicharset failed!!
Error writing recoder!!
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
For more information I have attached my sin.unicharset file and sin.config files.
I use below Tesseract version,
tesseract -v
tesseract 4.00.00dev-696-geba0ae3
leptonica-1.74.4
libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8
Found SSE
I use below OS,
uname -a
Linux shandigutt-laptop-ubuntu 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Appreciate if somebody can please help me on this.
Thannks