Error on combine_lang_model script; Null char=2 Invalid format in radical table at line 4: 3400 1.4 Creation of encoded unicharset failed!! Error writing recoder!!

318 views
Skip to first unread message

Shandigutt

unread,
Aug 6, 2018, 12:11:33 AM8/6/18
to tesseract-ocr
Hi,

I am trying to train Tesseract for Sinhala language. I was following training guidelines mentioned in Github wiki. I get an error with reference to the 4th step which is "Creating Starter Traineddata". Please find the below command I executed,

training/combine_lang_model --input_unicharset ../training/sin/sin.unicharset --script_dir ../langdata --words ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers ../langdata/sin/sin.numbers --output_dir ../training/combined_sin --version_str 1.0 --lang sin

I get the following output,

Loaded unicharset of size 94 from file ../training/sin/sin.unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 4 = ී
Warning: properties incomplete for index 6 = ි
Warning: properties incomplete for index 11 = ු
Warning: properties incomplete for index 15 = ්‌
Warning: properties incomplete for index 33 = ූ
Warning: properties incomplete for index 52 = ්‍ර
Warning: properties incomplete for index 56 = ්‍ය
Warning: properties incomplete for index 87 = ක්‍
Warning: properties incomplete for index 93 = ර්‍
Config file is optional, continuing...
Null char=2
Invalid format in radical table at line 4: 3400    1.4
Creation of encoded unicharset failed!!
Error writing recoder!!
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg

For more information I have attached my sin.unicharset file and sin.config files. 

I use below Tesseract version,

tesseract -v
tesseract 4.00.00dev-696-geba0ae3
 leptonica-1.74.4
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8

 Found SSE

I use below OS,

uname -a
Linux shandigutt-laptop-ubuntu 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Appreciate if somebody can please help me on this.

Thannks
sin.config
sin.unicharset

Shree Devi Kumar

unread,
Aug 6, 2018, 12:17:50 AM8/6/18
to tesser...@googlegroups.com
You are using an old version of tesseract. Please use the latest version from github. 

Make sure you remove/uninstall old version.

You error is related to radical stroke file in langdata. Make sure you use latest version of langdata repo.

>Invalid format in radical table at line 4: 3400    1.4

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/84872636-f425-4cc0-b228-00e7a3f5b6a3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

zwwts...@gmail.com

unread,
Aug 14, 2018, 5:12:43 AM8/14/18
to tesseract-ocr
I'v come across with the same fault before
Because I simply move langdata that clone on window to linux server.
As a consequence, the radical-stroke.txt file which need to be formed on "CL" turn to be "CR LF"
everything went right after I convert this file 

在 2018年8月6日星期一 UTC+8下午12:11:33,Shandigutt写道:

Ryhan Ahmed Tamim

unread,
Jun 7, 2020, 7:27:30 AM6/7/20
to tesseract-ocr
can you please share the converted radical-stroke.txt file?
Reply all
Reply to author
Forward
0 new messages