Tesseract error while combine_lang

Piyush Chandra

unread,

Apr 8, 2020, 11:12:44 AM4/8/20

to tesseract-ocr

Hi,

I am trying to create a hindi traineddata from scratch using eng.traineddata.

I used some png and txt files to create box file using lstmbox and edited those box files to correct the words.

Then, I used lstm.train to create lstm files and created unicharset file from the box files using unicharset_extractor.

But now, when i use combine_lang_model to get starter traineddata file I am getting error. Please help.

~/hindiFiles/hindi$ /usr/local/bin/combine_lang_model --input_unicharset ./langdata/hin/hin.unicharset --script_dir ./langdata --words ./langdata/hin.wordlist --numbers ./langdata/hin.numbers --puncs ./langdata/hin.punc --output_dir /home/piyush/hindiFiles/hindi/langdata/ --lang hin
Loaded unicharset of size 39 from file ./langdata/hin/hin.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:./langdata/Latin.unicharset
Failed to load script unicharset from:./langdata/Devanagari.unicharset
Warning: properties incomplete for index 3 = मे
Warning: properties incomplete for index 4 = रा
Warning: properties incomplete for index 5 = ना
Warning: properties incomplete for index 6 = म
Warning: properties incomplete for index 7 = पी
Warning: properties incomplete for index 8 = यू
Warning: properties incomplete for index 9 = ष
Warning: properties incomplete for index 10 = है
Warning: properties incomplete for index 11 = ।
Warning: properties incomplete for index 12 = हाँ
Warning: properties incomplete for index 13 = ,
Warning: properties incomplete for index 14 = मु
Warning: properties incomplete for index 15 = झे
Warning: properties incomplete for index 16 = भू
Warning: properties incomplete for index 17 = ख
Warning: properties incomplete for index 18 = ल
Warning: properties incomplete for index 19 = गी
Warning: properties incomplete for index 20 = तु
Warning: properties incomplete for index 21 = म्‌
Warning: properties incomplete for index 22 = हा
Warning: properties incomplete for index 23 = क्‌
Warning: properties incomplete for index 24 = या
Warning: properties incomplete for index 25 = कै
Warning: properties incomplete for index 26 = से
Warning: properties incomplete for index 27 = हो
Warning: properties incomplete for index 28 = ?
Warning: properties incomplete for index 29 = क
Warning: properties incomplete for index 30 = ब
Warning: properties incomplete for index 31 = त
Warning: properties incomplete for index 32 = आ
Warning: properties incomplete for index 33 = ओ
Warning: properties incomplete for index 34 = गे
Warning: properties incomplete for index 35 = नीं
Warning: properties incomplete for index 36 = द
Warning: properties incomplete for index 37 = र
Warning: properties incomplete for index 38 = ही
Config file is optional, continuing...
Failed to read data from: ./langdata/hin/hin.config
Failed to read data from: ./langdata/radical-stroke.txt
Error reading radical code table ./langdata/radical-stroke.txt

Message has been deleted

Piyush Chandra

unread,

Apr 8, 2020, 11:29:58 AM4/8/20

to tesseract-ocr

When I downloaded the devenagari.unicharset, Latin.unicharset and radical-stroke.txt

, it worked. What are these files and why we need this? Do we need to use these every time we work for new language or we need to create our own???

Shree Devi Kumar

unread,

Apr 8, 2020, 12:28:37 PM4/8/20

to tesseract-ocr

Why do you want to fine-tune eng to get to hindi traineddata?

You can fine-tune hin.traineddata or script/Devanagari.traineddata.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/77cf0099-a40e-4186-b76c-b844832e2240%40googlegroups.com.

Piyush Chandra

unread,

Apr 8, 2020, 2:41:11 PM4/8/20

to tesseract-ocr

Hi Shree,

I am actually learning about create a new language traineddata for new languages. I would also like to contribute for tesseract.

For this I am learning this. I have followed all your post as well as you projects on github. (Wanted to thank you for helping and contributing so many things online :))

I have already tried fine-tuning English language. Is there any information about why we need these files (devenagari.unicharset, Latin.unicharset and radical-stroke.txt) ? and do we need to use these files for new language like Chattisgarhi or any other language which is not available for tesseract??

Any help will be appreciated.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Shree Devi Kumar

unread,

Apr 8, 2020, 11:32:43 PM4/8/20

to tesseract-ocr

devenagari.unicharset, Latin.unicharset and radical-stroke.txt

The script unicharset are useful in setting character properties. For most scripts they are already available in langadata_lstm. I don't think they are mandatory for lstm training but by copying them once you can avoid the warning messages.

radical-stroke.txt is used only for CJK languages, but tesseract checks for it during training process, so you need to make it available.

For chattisgarhi, if training for as written in Devanagari, I will suggest training from script/Devanagari.traineddata rather than English.

Please note if you are starting from scratch, then you don't need a starting traineddata. If you use one, then you are finetuning.

Finally, you need to use the correct mode for Indic language with unicharset_extractor. Your unicharset should have Unicode codepoints, not akshara (consanant vowel sign combination).

Piyush Chandra

unread,

Apr 9, 2020, 2:45:09 AM4/9/20

to tesseract-ocr

Thank you Shree for giving the overview.

Could you please help me understand your last point? Your unicharset should have Unicode codepoints. what does that mean? any example would be helpful. I was actually using akshara (attached box fiile image) .

box hindi.PNG

Shree Devi Kumar

unread,

Apr 9, 2020, 3:00:12 AM4/9/20

to tesseract-ocr

Unicharset will look like the following:

द 1 34,72,192,192,100,122,0,0,99,114 Devanagari 11 0 11 द # द [926 ]x
र 1 58,64,192,192,84,119,0,0,81,110 Devanagari 12 0 12 र # र [930 ]x
् 0 3,32,61,197,12,181,0,0,0,1 Devanagari 13 17 13 ् # ् [94d ]
श 1 61,64,192,195,128,148,0,12,130,147 Devanagari 14 0 14 श # श [936 ]x
य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 15 0 15 य # य [92f ]x
त 1 61,64,192,192,112,135,0,0,110,126 Devanagari 16 0 16 त # त [924 ]x
ि 0 62,65,228,253,132,279,0,0,40,65 Devanagari 17 0 17 ि # ि [93f ]
प 1 63,64,192,192,98,126,0,0,97,119 Devanagari 18 0 18 प # प [92a ]x
ू 0 1,35,67,197,33,193,0,0,0,1 Devanagari 19 17 19 ू # ू [942 ]
ज 1 63,64,192,192,138,165,0,0,128,157 Devanagari 20 0 20 ज # ज [91c ]x

You can unpack any of the existing traineddatas from tessdata_best or tessdata_fast and check.

combine_tessdata -u

and looks at the lstm-unicharset in the components

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Apr 9, 2020, 3:04:38 AM4/9/20

to tesseract-ocr

# Normalization mode - 2, 1 - for unicharset_extractor and Pass through Recoder for combine_lang_model

ifeq ($(LANG_TYPE),Indic)
NORM_MODE =2

RECODER =--pass_through_recoder

Piyush Chandra

unread,

Apr 9, 2020, 7:13:53 AM4/9/20

to tesseract-ocr

Thanks for the help! :)

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Piyush Chandra

unread,

Apr 14, 2020, 7:31:20 AM4/14/20

to tesseract-ocr

Hi Shree,

When I used unicharset extractor command, I get these error:

unicharset_extractor --norm_mode 2 --output_unicharset min.unicharset hin.exp1.box

Extracting unicharset from box file hin.exp1.box

Invalid start of grapheme sequence:M=0x93e

Normalization failed for string 'αñ╛'

Invalid start of grapheme sequence:D=0x901

Normalization failed for string 'αñü'

Invalid start of grapheme sequence:M=0x941

Normalization failed for string 'αÑü'

Invalid start of grapheme sequence:M=0x947

Normalization failed for string 'αÑç'

Invalid start of grapheme sequence:M=0x940

Normalization failed for string 'αÑÇ'

Invalid start of grapheme sequence:M=0x948

Normalization failed for string 'αÑê'

Mirror ] of [ is not in unicharset

Wrote unicharset file min.unicharset

The box file used was:

ह 28 33 261 74 0

ा 28 33 261 74 0

ँ 28 33 261 74 0

, 28 33 261 74 0

28 33 261 74 0

म 28 33 261 74 0

ु 28 33 261 74 0

झ 28 33 261 74 0

े 28 33 261 74 0

28 33 261 74 0

[ 28 33 261 74 0

ख 28 33 261 74 0

28 33 261 74 0

ल 28 33 261 74 0

ग 28 33 261 74 0

ी 28 33 261 74 0

28 33 261 74 0

ह 28 33 261 74 0

ै 28 33 261 74 0

। 28 33 261 74 0

28 33 261 74 0

Do I need to just ignore them or what am I missing here?

On Thursday, 9 April 2020 12:34:38 UTC+5:30, shree wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Piyush Chandra

unread,

Apr 14, 2020, 4:22:43 PM4/14/20

to tesseract-ocr

For other files, when I try on linux, its coming like this:

unicharset_extractor --norm_mode 2 hin.desk0.box hin.desk1.box
Extracting unicharset from box file hin.desk0.box
Invalid start of grapheme sequence:H=0x94d
Normalization failed for string '्'

Invalid start of grapheme sequence:M=0x93e

Normalization failed for string 'ा'

Invalid start of grapheme sequence:M=0x947

Normalization failed for string 'े'

Invalid start of grapheme sequence:M=0x947

Normalization failed for string 'े'

Invalid start of grapheme sequence:M=0x93e

Normalization failed for string 'ा'
Invalid start of grapheme sequence:M=0x93f
Normalization failed for string 'ि'
Invalid start of grapheme sequence:M=0x94b
Normalization failed for string 'ो'
Invalid start of grapheme sequence:D=0x902
Normalization failed for string 'ं'

Invalid start of grapheme sequence:M=0x940

Normalization failed for string 'ी'

Invalid start of grapheme sequence:M=0x93e

Normalization failed for string 'ा'

Invalid start of grapheme sequence:M=0x947

Normalization failed for string 'े'

Invalid start of grapheme sequence:M=0x948

Normalization failed for string 'ै'
Invalid start of grapheme sequence:D=0x902
Normalization failed for string 'ं'
Invalid start of grapheme sequence:M=0x93f
Normalization failed for string 'ि'

Shree Devi Kumar

unread,

Apr 14, 2020, 9:22:48 PM4/14/20

to tesseract-ocr

How are you creating the box files?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/23e8e435-a720-455b-aa2a-563edbb8a93c%40googlegroups.com.

Piyush Chandra

unread,

Apr 15, 2020, 12:16:27 AM4/15/20

to tesseract-ocr

hin.des0.txt
These are the files I used.

For box file, I used the below command:

tesseract hin.des0.PNG hin.des0 -l hin lstmbox

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/23e8e435-a720-455b-aa2a-563edbb8a93c%40googlegroups.com.

hin.des0.PNG

Piyush Chandra

unread,

Apr 16, 2020, 9:22:25 AM4/16/20

to tesseract-ocr

lstmtraining --traineddata ./langdata/min/min.traineddata --model_output ./output --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c73]' --train_listfile ./hin_lstmf/hin.training_files.txt --eval_listfile ./hin_lstmf/hin.training_files.txt --max_iterations 5000

Getting the result :

Num outputs,weights in Series:
1,36,0,1:1, 0
Num outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx256:256, 361472
Fc73:73, 18761
Total weights = 522665
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys48Lfx96Lrx96Lfx256Fc73] from request [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c73]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.001, momentum=0.5
null char=72
Loaded 16/16 lines (1-16) of document ./hin_lstmf/hin.desk0.lstmf
Loaded 19/19 lines (1-19) of document ./hin_lstmf/hin.desk6.lstmf
Loaded 12/12 lines (1-12) of document ./hin_lstmf/hin.desk7.lstmf
Loaded 16/16 lines (1-16) of document ./hin_lstmf/hin.desk0.lstmf
Loaded 12/12 lines (1-12) of document ./hin_lstmf/hin.desk4.lstmf
Loaded 12/12 lines (1-12) of document ./hin_lstmf/hin.desk2.lstmf
Loaded 19/19 lines (1-19) of document ./hin_lstmf/hin.desk5.lstmf
Loaded 14/14 lines (1-14) of document ./hin_lstmf/hin.desk1.lstmf
Loaded 20/20 lines (1-20) of document ./hin_lstmf/hin.desk3.lstmf
Loaded 14/14 lines (1-14) of document ./hin_lstmf/hin.desk1.lstmf
Encoding of string failed! Failure bytes: e2 80 8d e0 a5 8d e0 a4 a4 e0 a5 8d e0 a4 b0 20 e0 a4 a7 e0 a4 b0 e0 a5 87 20 e0 a4 95 e0 a5 87 20 e0 a4 a7 e0 a4 b0 e0 a5 87 20 e0 a4 b0 e0 a4 b9 20 e0 a4 97 e0 a4 af e0 a5 87 20 e0 a4 b9 e0 a5 88 e0 a4 82 e0 a5 a4 e0 a4 95 e0 a5 81 e0 a4 9b 20 e0 a4 ad e0 a5 80 20 e0 a4 95 e0 a4 be e0 a4 ae 20 e0 a4 a8 e0 a4 b9 e0 a5 80 e0 a4 82 20 e0 a4 86 20 e0 a4 b0 e0 a4 b9 e0 a4 be 20 e0 a4 b9 e0 a5 88 e0 a5 a4 e0 a4 b8 e0 a5 8d e0 a4 a5 e0 a4 bf e0 a4 a4 e0 a4 bf 20 e0 a4 a4 e0 a5 8b 20 e0 a4 af e0 a4 b9 20 e0 a4 b9 e0 a5 8b 20 e0 a4 97 e0 a4 af e0 a5 80 20 e0 a4 b9 e0 a5 88 20 e0 a4 95 e0 a4 bf 20 e0 a4 9c e0 a5 8b 20 e0 a4 9c e0 a4 bf e0 a4 a4 e0 a4 a8 e0 a4 be
Can't encode transcription: 'बेबस और लाचार है।हमारे विकास के सारे तन्‍्त्र धरे के धरे रह गये हैं।कुछ भी काम नहीं आ रहा है।स्थिति तो यह हो गयी है कि जो जितना' in language ''
Encoding of string failed! Failure bytes: e0 a5 a5 21
Can't encode transcription: 'काश आप रहते तो समझ पाते॥!' in language ''
At iteration 100/100/102, Mean rms=8.457%, delta=69.25%, char train=105.103%, word train=100%, skip ratio=2%, New worst char error = 105.103 wrote checkpoint.

Q1> What is this "Failure bytes"?
Q2> What is the affect of "Can't encode transcription"?
Q3> Should I change anything to overcome this issue?

PFA the files used.

On Wednesday, 15 April 2020 06:52:48 UTC+5:30, shree wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/23e8e435-a720-455b-aa2a-563edbb8a93c%40googlegroups.com.

filesused.zip

Shree Devi Kumar

unread,

Apr 16, 2020, 9:56:04 AM4/16/20

to tesseract-ocr

U+200D

‍

e2 80 8d

ZERO WIDTH JOINER

Shree Devi Kumar

unread,

Apr 16, 2020, 9:56:15 AM4/16/20

to tesseract-ocr

U+0965

॥

e0 a5 a5

DEVANAGARI DOUBLE DANDA

Piyush Chandra

unread,

Apr 16, 2020, 10:21:46 AM4/16/20

to tesseract-ocr

Hi Shree,

Thanks for replying.

So shall I remove them from text file and create a unicharset file after that or do I have do do something while creating the lstmf files?

Also, Will this affect the training if I don't remove this? I saw that training was continuing but the best char error was 100 even after 5000 iteration and went to 96 after 7800 iteration. weird. :-\

Shree Devi Kumar

unread,

Apr 16, 2020, 10:38:52 AM4/16/20

to tesseract-ocr

You are training from scratch. It will take thousands of iterations. Try fine-tuning.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/36920c00-50b9-4d19-a018-8f1275cc481c%40googlegroups.com.

Piyush Chandra

unread,

Apr 27, 2020, 12:38:39 AM4/27/20

to tesseract-ocr

Hi Shree,

Could you please help me with the issue: https://groups.google.com/forum/#!topic/tesseract-ocr/DvuCBEKoVOo

Sorry for tagging you in this post.

Thanks in advance mam!

On Thursday, 16 April 2020 20:08:52 UTC+5:30, shree wrote:

You are training from scratch. It will take thousands of iterations. Try fine-tuning.

On Thu, Apr 16, 2020, 19:51 Piyush Chandra <piyus...@gmail.com> wrote:

Hi Shree,

Thanks for replying.

So shall I remove them from text file and create a unicharset file after that or do I have do do something while creating the lstmf files?

Also, Will this affect the training if I don't remove this? I saw that training was continuing but the best char error was 100 even after 5000 iteration and went to 96 after 7800 iteration. weird. :-\

On Thursday, 16 April 2020 19:26:15 UTC+5:30, shree wrote:
U+0965 ॥ e0 a5 a5 DEVANAGARI DOUBLE DANDA

On Thu, Apr 16, 2020, 19:25 Shree Devi Kumar <shree...@gmail.com> wrote:
U+200D ‍ e2 80 8d ZERO WIDTH JOINER

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Reply all

Reply to author

Forward

Tesseract error while combine_lang_model

Piyush Chandra

Piyush Chandra

Shree Devi Kumar

Piyush Chandra

Shree Devi Kumar

Piyush Chandra

Shree Devi Kumar

Shree Devi Kumar

Piyush Chandra

Piyush Chandra

Piyush Chandra

Shree Devi Kumar

Piyush Chandra

Piyush Chandra

Shree Devi Kumar

Shree Devi Kumar

Piyush Chandra

Shree Devi Kumar

Piyush Chandra