sumedhe@vm-linux:~/tesseract-ocr/tesseract/training$ ./tesstrain.sh --fonts_dir /usr/share/fonts --lang sin --linedata_only --noextract_font_properties --langdata_dir ../langdata --tessdata_dir ../tessdata --output_dir ~/tesstutorial/sintrain --fontlist "Iskoola Pota"
=== Starting training for language 'sin'
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:27:15 +0530] /home/sumedhe/.local/bin/text2image --fonts_dir=/usr/share/fonts --font=Iskoola Pota --outputbase=/tmp/font_tmp.QehGCA4qGJ/sample_text.txt --text=/tmp/font_tmp.QehGCA4qGJ/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.QehGCA4qGJ
Rendered page 0 to file /tmp/font_tmp.QehGCA4qGJ/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Iskoola Pota
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:27:42 +0530] /home/sumedhe/.local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.QehGCA4qGJ --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0 --max_pages=3 --font=Iskoola Pota --text=../langdata/sin/sin.training_text
Rendered page 0 to file /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.tif
Stripped 1 unrenderable words
Rendered page 1 to file /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.tif
Rendered page 2 to file /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:27:45 +0530] /home/sumedhe/.local/bin/unicharset_extractor --output_unicharset /tmp/tmp.3KQffzcZmX/sin/sin.unicharset --norm_mode 2 /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.box
Extracting unicharset from box file /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.box
Wrote unicharset file /tmp/tmp.3KQffzcZmX/sin/sin.unicharset
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:27:45 +0530] /home/sumedhe/.local/bin/set_unicharset_properties -U /tmp/tmp.3KQffzcZmX/sin/sin.unicharset -O /tmp/tmp.3KQffzcZmX/sin/sin.unicharset -X /tmp/tmp.3KQffzcZmX/sin/sin.xheights --script_dir=../langdata
Loaded unicharset of size 117 from file /tmp/tmp.3KQffzcZmX/sin/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 8 = ්
Warning: properties incomplete for index 22 = ි
Warning: properties incomplete for index 27 = ු
Warning: properties incomplete for index 31 = ී
Warning: properties incomplete for index 48 = ූ
Writing unicharset to file /tmp/tmp.3KQffzcZmX/sin/sin.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=../tessdata
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:27:45 +0530] /home/sumedhe/.local/bin/tesseract /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.tif /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Page 2
Loaded 52/52 pages (1-52) of document /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.lstmf
Page 3
Loaded 104/104 pages (1-104) of document /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.lstmf
=== Constructing LSTM training data ===
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:27:48 +0530] /home/sumedhe/.local/bin/combine_lang_model --input_unicharset /tmp/tmp.3KQffzcZmX/sin/sin.unicharset --script_dir ../langdata --words ../langdata/sin/sin.wordlist --numbers ../langdata/sin/sin.numbers --puncs ../langdata/sin/sin.punc --output_dir /home/sumedhe/tesstutorial/sintrain --lang sin --pass_through_recoder
Loaded unicharset of size 117 from file /tmp/tmp.3KQffzcZmX/sin/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 8 = ්
Warning: properties incomplete for index 22 = ි
Warning: properties incomplete for index 27 = ු
Warning: properties incomplete for index 31 = ී
Warning: properties incomplete for index 48 = ූ
Config file is optional, continuing...
Failed to read data from: ../langdata/sin/sin.config
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.3KQffzcZmX/sin/sin.Iskoola_Pota.exp0.lstmf to /home/sumedhe/tesstutorial/sintrain
Completed training for language 'sin'
sumedhe@vm-linux:~/tesseract-ocr/tesseract/training$ ./tesstrain.sh --fonts_dir /usr/share/fonts --lang sin --linedata_only \
> --noextract_font_properties --langdata_dir ../langdata \
> --tessdata_dir ../tessdata \
> --fontlist "Iskoola Pota" --output_dir ~/tesstutorial/sineval
=== Starting training for language 'sin'
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:28:26 +0530] /home/sumedhe/.local/bin/text2image --fonts_dir=/usr/share/fonts --font=Iskoola Pota --outputbase=/tmp/font_tmp.Ul8AYkBWaO/sample_text.txt --text=/tmp/font_tmp.Ul8AYkBWaO/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.Ul8AYkBWaO
Rendered page 0 to file /tmp/font_tmp.Ul8AYkBWaO/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Iskoola Pota
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:28:48 +0530] /home/sumedhe/.local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.Ul8AYkBWaO --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0 --max_pages=3 --font=Iskoola Pota --text=../langdata/sin/sin.training_text
Rendered page 0 to file /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.tif
Stripped 1 unrenderable words
Rendered page 1 to file /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.tif
Rendered page 2 to file /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.tif
=== Phase UP: Generating unicharset and unichar properties files ===
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:28:50 +0530] /home/sumedhe/.local/bin/unicharset_extractor --output_unicharset /tmp/tmp.XqczAxMf9Z/sin/sin.unicharset --norm_mode 2 /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.box
Extracting unicharset from box file /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.box
Wrote unicharset file /tmp/tmp.XqczAxMf9Z/sin/sin.unicharset
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:28:50 +0530] /home/sumedhe/.local/bin/set_unicharset_properties -U /tmp/tmp.XqczAxMf9Z/sin/sin.unicharset -O /tmp/tmp.XqczAxMf9Z/sin/sin.unicharset -X /tmp/tmp.XqczAxMf9Z/sin/sin.xheights --script_dir=../langdata
Loaded unicharset of size 117 from file /tmp/tmp.XqczAxMf9Z/sin/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 8 = ්
Warning: properties incomplete for index 22 = ි
Warning: properties incomplete for index 27 = ු
Warning: properties incomplete for index 31 = ී
Warning: properties incomplete for index 48 = ූ
Writing unicharset to file /tmp/tmp.XqczAxMf9Z/sin/sin.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=../tessdata
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:28:50 +0530] /home/sumedhe/.local/bin/tesseract /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.tif /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0 lstm.train
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Page 1
Page 2
Loaded 52/52 pages (1-52) of document /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.lstmf
Page 3
Loaded 104/104 pages (1-104) of document /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.lstmf
=== Constructing LSTM training data ===
[2018 ජනවාරි 19 වැනි සිකුරාදා 02:28:54 +0530] /home/sumedhe/.local/bin/combine_lang_model --input_unicharset /tmp/tmp.XqczAxMf9Z/sin/sin.unicharset --script_dir ../langdata --words ../langdata/sin/sin.wordlist --numbers ../langdata/sin/sin.numbers --puncs ../langdata/sin/sin.punc --output_dir /home/sumedhe/tesstutorial/sineval --lang sin --pass_through_recoder
Loaded unicharset of size 117 from file /tmp/tmp.XqczAxMf9Z/sin/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 8 = ්
Warning: properties incomplete for index 22 = ි
Warning: properties incomplete for index 27 = ු
Warning: properties incomplete for index 31 = ී
Warning: properties incomplete for index 48 = ූ
Config file is optional, continuing...
Failed to read data from: ../langdata/sin/sin.config
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.XqczAxMf9Z/sin/sin.Iskoola_Pota.exp0.lstmf to /home/sumedhe/tesstutorial/sineval
Completed training for language 'sin'
sumedhe@vm-linux:~/tesseract-ocr/tesseract/training$ cd -
/home/sumedhe/tesseract-ocr/tesseract/java
sumedhe@vm-linux:~/tesseract-ocr/tesseract/training$ ./lstmtraining --debug_interval 100 --traineddata ~/tesstutorial/sintrain/sin/sin.traineddata --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' --model_output ~/tesstutorial/sinoutput/base --learning_rate 20e-4 --train_listfile ~/tesstutorial/sintrain/sin.training_files.txt --eval_listfile ~/tesstutorial/sineval/sin.training_files.txt --max_iterations 5000^C
sumedhe@vm-linux:~/tesseract-ocr/tesseract/java$ ^C
sumedhe@vm-linux:~/tesseract-ocr/tesseract/java$ ../training/lstmtraining --debug_interval 100 \
> --traineddata ~/tesstutorial/sintrain/sin/sin.traineddata \
> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
> --model_output ~/tesstutorial/sinoutput/base --learning_rate 20e-4 \
> --train_listfile ~/tesstutorial/sintrain/sin.training_files.txt \
> --eval_listfile ~/tesstutorial/sineval/sin.training_files.txt \
> --max_iterations 5000
Warning: given outputs 111 not equal to unicharset of 117.
Num outputs,weights in Series:
1,36,0,1:1, 0
Num outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx256:256, 361472
Fc117:117, 30069
Total weights = 533973
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc117] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]
Training parameters:
Debug interval = 100, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=2
Loaded 156/156 pages (1-156) of document /home/sumedhe/tesstutorial/sintrain/sin.Iskoola_Pota.exp0.lstmf
Loaded 156/156 pages (1-156) of document /home/sumedhe/tesstutorial/sineval/sin.Iskoola_Pota.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffffa1 ffffffe0 ffffffb7 ffffff9a ffffffe0 ffffffb6 ffffffaf ffffffe0 ffffffb6 ffffffba 20 ffffffe0 ffffffb7 ffffff83 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffaf ffffffe0 ffffffb7 ffffff94 ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb6 ffffffb1 20 ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb7 ffffff93 20 ffffffe0 ffffffb6 ffffff87 ffffffe0 ffffffb6 ffffffb3 ffffffe0 ffffffb7 ffffff93 ffffffe0 ffffffb6 ffffffb8 20 ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb6 ffffffb8 20 ffffffe0 ffffffb6 ffffffbb ffffffe0 ffffffb7 ffffff96 ffffffe0 ffffffb6 ffffffb4 20 ffffffe0 ffffffb6 ffffffb8 ffffffe0 ffffffb7 ffffff9a ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb7 ffffff8f
Can't encode transcription: 'පරිච්ඡේදය සිදුවන වී ඇඳීම තම රූප මේවා' in language ''
Encoding of string failed! Failure bytes: ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb6 ffffffba 20 ffffffe0 ffffffb6 ffffffbb ffffffe0 ffffffb7 ffffff9d ffffffe0 ffffffb6 ffffffb8 20 ffffffe0 ffffffb6 ffffff89 ffffffe0 ffffffb7 ffffff84 ffffffe0 ffffffb6 ffffffad 20 ffffffe0 ffffffb6 ffffffb8 ffffffe0 ffffffb7 ffffff99 ffffffe0 ffffffb7 ffffff84 ffffffe0 ffffffb7 ffffff92 20 ffffffe0 ffffffb6 ffffffa7 ffffffe0 ffffffb7 ffffff90 ffffffe0 ffffffb6 ffffff82 ffffffe0 ffffffb6 ffffffa2 ffffffe0 ffffffb6 ffffffb1 20 ffffffe0 ffffffb6 ffffffaf ffffffe0 ffffffb7 ffffff99 ffffffe0 ffffffb6 ffffff9a
Can't encode transcription: 'ඓතිහාසිකත්වය රෝම ඉහත මෙහි ටැංජන දෙක' in language ''
Encoding of string failed! Failure bytes: ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffffba ffffffe0 ffffffb7 ffffff9a ffffffe0 ffffffb7 ffffff82 ffffffe0 ffffffb6 ffffffab 20 ffffffe0 ffffffb6 ffffffb0 ffffffe0 ffffffb7 ffffff96 ffffffe0 ffffffb6 ffffffbb 20 ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb7 ffffff90 ffffffe0 ffffffb7 ffffff85 ffffffe0 ffffffb7 ffffff90 ffffffe0 ffffffb6 ffffff9a ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb7 ffffff93 ffffffe0 ffffffb6 ffffffb8 ffffffe0 ffffffb7 ffffff9a 20 32 30 30 36
Can't encode transcription: 'කෝණික වන පර්යේෂණ ධූර වැළැක්වීමේ 2006' in language ''