Error when trying to run lstmtraining: Can't encode transcription

Skip to first unread message


Sep 8, 2018, 2:29:26 PM9/8/18
to tesseract-ocr

I was trying to run lstmtraining script using below command,

./build/src/training/lstmtraining --debug_interval 100 \
  --traineddata ../training/sintrain/sin/sin.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output /media/shandigutt/UUI/training/base --learning_rate 20e-4 \
  --train_listfile ../training/sintrain/sin.training_files.txt \
  --eval_listfile ../training/sineval/sin.training_files.txt \
  --max_iterations 5000 &> /media/shandigutt/UUI/training/basetrain.log

I got the following output,

Warning: given outputs 111 not equal to unicharset of 90.
Num outputs,weights in Series:
  1,36,0,1:1, 0
Num outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys48:48, 12480
  Lfx96:96, 55680
  Lrx96:96, 74112
  Lfx256:256, 361472
  Fc90:90, 23130
Total weights = 527034
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc90] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]
Training parameters:
  Debug interval = 100, weights = 0.1, learning rate = 0.002, momentum=0.5
null char=2
Loaded 106/106 pages (1-106) of document ../training/sintrain/sin.BhashitaComplex.exp0.lstmf
Loaded 106/106 pages (1-106) of document ../training/sineval/sin.BhashitaComplex.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffffaf 20 ffffffe0 ffffffb7 ffffff83 ffffffe0 ffffffb6 ffffff82 ffffffe0 ffffffb7 ffffff83 ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffff9a ffffffe0 ffffffb7 ffffff98 ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb6 ffffffba ffffffe0 ffffffb7 ffffff9a 20 ffffffe0 ffffffb7 ffffff84 ffffffe0 ffffffb6 ffffffb8 ffffffe0 ffffffb7 ffffff94 20 ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffba 20 ffffffe0 ffffffb7 ffffff84 ffffffe0 ffffffb7 ffffff90 ffffffe0 ffffffb6 ffffff9a ffffffe0 ffffffb7 ffffff92 20 ffffffe0 ffffffb6 ffffffba 2e 20 ffffffe0 ffffffb7 ffffff83 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffff82 ffffffe0 ffffffb7 ffffff84 ffffffe0 ffffffb6 ffffffbd ffffffe0 ffffffb6 ffffffba ffffffe0 ffffffb7 ffffff9a 20 ffffffe0 ffffffb6 ffffffb8 ffffffe0 ffffffb7 ffffff99 ffffffe0 ffffffb6 ffffffb8 20 ffffffe0 ffffffb6 ffffff8d 2c 20 ffffffe0 ffffffb6 ffffff8e 2c 20 ffffffe0 ffffffb6 ffffff8f 2c 20 ffffffe0 ffffffb6 ffffff90 20 ffffffe0 ffffffb6 ffffffba ffffffe0 ffffffb6 ffffffb1 20 ffffffe0 ffffffb6 ffffff85 ffffffe0 ffffffb6 ffffff9a ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb7 ffffff82 ffffffe0 ffffffb6 ffffffbb 20 ffffffe0 ffffffb7 ffffff83 ffffffe0 ffffffb7 ffffff84 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffad 20 ffffffe0 ffffffb7 ffffff81 ffffffe0 ffffffb6 ffffffb6 ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffffaf 20 ffffffe0 ffffffb6 ffffff89 ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb7 ffffff8f ffffffe0 ffffffb6 ffffffb8 20 ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffbb ffffffe0 ffffffb7 ffffff85 20 ffffffe0 ffffffb6 ffffffba 2e 20 ffffffe0 ffffffb6 ffffff92 20 ffffffe0 ffffffb6 ffffffb1 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb7 ffffff83 ffffffe0 ffffffb7 ffffff8f 20 ffffffe0 ffffffb6 ffffffaf ffffffe0 ffffffb7 ffffff9d 2c 20 ffffffe0 ffffffb6 ffffff8d 2c 20 ffffffe0 ffffffb6 ffffff8e 2c 20 ffffffe0 ffffffb6 ffffff8f 2c 20 ffffffe0 ffffffb6 ffffff90
Can't encode transcription: 'ශබ්ද සංස්කෘතයේ හමු විය හැකි ය. සිංහලයේ මෙම ඍ, ඎ, ඏ, ඐ යන අක්ෂර සහිත ශබ්ද ඉතාම විරළ ය. ඒ නිසා දෝ, ඍ, ඎ, ඏ, ඐ' in language ''
Encoding of string failed! Failure bytes: ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb6 ffffffbb 2c 20 ffffffe0 ffffffb6 ffffff8a ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb6 ffffffb1 2c 20 ffffffe0 ffffffb6 ffffff8a ffffffe0 ffffffb6 ffffffa2 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffb4 ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb7 ffffff94 ffffffe0 ffffffb7 ffffff80 2c 20 ffffffe0 ffffffb6 ffffff8a ffffffe0 ffffffb6 ffffffa7 20 ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb7 ffffff90 ffffffe0 ffffffb6 ffffffb1 ffffffe0 ffffffb7 ffffff92 20 ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb6 ffffffa0 ffffffe0 ffffffb6 ffffffb1 20 ffffffe0 ffffffb6 ffffff8a ffffffe0 ffffffb6 ffffffb1 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffaf ffffffe0 ffffffb7 ffffff8a 20 ffffffe0 ffffffb6 ffffffb6 ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffffbd ffffffe0 ffffffb6 ffffffba ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffa7 ffffffe0 ffffffb6 ffffffb1 ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffff9c ffffffe0 ffffffb7 ffffff9a 20 ffffffe0 ffffffb6 ffffff8a ffffffe0 ffffffb6 ffffffbb ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffba 20 ffffffe0 ffffffb6 ffffffb4 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb7 ffffff85 ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffb6 ffffffe0 ffffffb6 ffffffb3 ffffffe0 ffffffb7 ffffff80 20 ffffffe0 ffffffb6 ffffff91 ffffffe0 ffffffb6 ffffffb1 20 ffffffe0 ffffffb6 ffffff8a ffffffe0 ffffffb7 ffffff85 ffffffe0 ffffffb6 ffffff9f 20 ffffffe0 ffffffb6 ffffff9a ffffffe0 ffffffb7 ffffff98 ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb7 ffffff92 ffffffe0 ffffffb6 ffffffba ffffffe0 ffffffb7 ffffff9a ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb7 ffffff8a 20 ffffffe0 ffffffb6 ffffff87 ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb7 ffffff94 ffffffe0 ffffffb7 ffffff85 ffffffe0 ffffffb6 ffffffad ffffffe0 ffffffb7 ffffff8a 20 ffffffe0 ffffffb7 ffffff80 ffffffe0 ffffffb6 ffffffb1 ffffffe0 ffffffb7 ffffff94 20 ffffffe0 ffffffb6 ffffff87 ffffffe0 ffffffb6 ffffffad 2e 20 ffffffe0 ffffffb6 ffffff92 ffffffe0 ffffffb6 ffffffaf ffffffe0 ffffffb6 ffffffab ffffffe0 ffffffb7 ffffff8a ffffffe0 ffffffb6 ffffffa9 ffffffe0 ffffffb7 ffffff99 ffffffe0 ffffffb6 ffffffb1 ffffffe0 ffffffb7 ffffff8a
Can't encode transcription: 'ඊසාන, ඊනියා, ඊශ්වර, ඊතන, ඊජිප්තුව, ඊට වැනි වචන ඊනිද් බ්ලයිටන්ගේ ඊරිය පිළිබඳව එන ඊළඟ කෘතියේත් ඇතුළත් වනු ඇත. ඒදණ්ඩෙන්' in language ''

It kept repeating for many sentences endlessly until the log file grows very big. Can somebody explain me what this issue is? In my command I was using newly created traineddata file when creating training data. At the beginning it outputs "Warning: given outputs 111 not equal to unicharset of 90."  which I think is the problem. If you need any more files from my data set for analysis please let me know. 

For more info,
My tesseract  version:
tesseract 4.0.0-beta.4-74-gd8237
  libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found SSE

My OS details,
shandigutt@shandigutt-laptop-ubuntu:/tmp/sin-2018-09-01.E4T$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic


Shree Devi Kumar

Sep 8, 2018, 2:46:54 PM9/8/18
> Warning: given outputs 111 not equal to unicharset of 90.

your starter traineddata has a unicharset of 90.
In your --net_spec you have specified number of unichars as 111.

> Encoding of string failed! 

It means that some of the chracters in the displayed string are NOT in the unicharset of your starter traineddata.

The errors seem to be in the lines from your eval set. Looks like there are some characters in that which are not in your training data.

It is also possible that these lines don't meet the Sinhala normaliation rules.

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To post to this group, send email to
Visit this group at
To view this discussion on the web visit
For more options, visit


भजन - कीर्तन - आरती @
Reply all
Reply to author
0 new messages