This problem has upset me for a few days!
I use tesstrain.sh to generate training files and eval files. e.g.
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstraining/chi_simtrain --fontlist 'STFangsong' 'NotoSerifCJKjp-ExtraLight'
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--tessdata_dir ./tessdata --output_dir ~/tesstraining/chi_simeval --fontlist "STFangsong"
I can successfully training from scratch and eval the output checkpoint. However, when I eval with the official provided traineddata, it fails. I am training with chinese simplified language.
The command is:
training/lstmeval --model ../tessdata/chi_sim.traineddata --eval_listfile ~/tessnew/chi_simeval/chi_sim.training_files.txt
the model is downloaded from tessdata_best, the eval_listfile is generated from tesstrain.sh as above.
The error is as follow:
Truth:救市学校武汉 BLOG信息 www正式灸 姨外科楂洗浴过敏卢斐 Resources 儿需要过
OCR :救市学校武汉 BLOG信息 www正式灸 姨外科楂洗浴过敏卢斐 Resources 儿需要过
Encoding of string failed! Failure bytes: ffffffe6 ffffffbc ffffffa9 ffffffe6 ffffffb6 ffffffa1 20 ffffffe6 ffffffa1 ffffff80 ffffffe9 ffffffaa ffffff9c 54 58 ffffffe5 ffffff90 ffffffaf ffffffe5 ffffff8a ffffffa8 2e ffffffe7 ffffffa7 ffffff8d ffffffe6 ffffffa4 ffffff8d ffffffe5 ffffff9d ffffff91 ffffffe9 ffffff81 ffffff93 30 33 20 ffffffe5 ffffff81 ffffff9a 20 ffffffe9 ffffff92 ffffff88 ffffffe7 ffffff81 ffffffb8 20 ffffffe9 ffffff81 ffffffad ffffffe6 ffffffae ffffff83 20 38 31 3b ffffffe8 ffffff80 ffffff81 ffffffe7 ffffff88 ffffffb7 ffffffe5 ffffff86 ffffff85 ffffffe5 ffffffae ffffffb9
Can't encode transcription: '啊时3全新跑送防伪漩涡 桀骜TX启动.种植坑道03 做 针灸 遭殃 81;老爷内容' in language ''
Encoding of string failed! Failure bytes: ffffffe8 ffffffb8 ffffff9d ffffffe5 ffffff8a ffffffa8 ffffffe6 ffffff80 ffffff81 20 ffffffe5 ffffff9e ffffff84 ffffffe6 ffffff96 ffffffad 31 ffffffe5 ffffff85 ffffffa8 ffffffe9 ffffff9d ffffffa2 34 20 ffffffe8 ffffff84 ffffff82 ffffffe8 ffffff82 ffffffaa 20 5b ffffffe9 ffffff80 ffffff89 ffffffe9 ffffffa1 ffffffb9 20 42 6f 6f 6b 6d 61 72 6b 20 3a ffffffe6 ffffffad ffffffa4 ffffffe5 ffffff9f ffffffba ffffffe7 ffffff9d ffffffa3 ffffffe5 ffffffbe ffffffb7 ffffffe6 ffffffba ffffff90 ffffffe7 ffffffa0 ffffff81 55 6e 69 76 65 72 73 69 74 79 20 ffffffe6 ffffffb3 ffffffb0 ffffffe9 ffffff93 ffffffa2 43 2b 2b
Can't encode transcription: '内踝动态 垄断1全面4 脂肪 [选项 Bookmark :此基督德源码University 泰铢C++' in language ''
Encoding of string failed! Failure bytes: ffffffe5 ffffff97 ffffffa6 20 ffffffe6 ffffffa0 ffffffbc ffffffe5 ffffffae ffffffbd ffffffe6 ffffff81 ffffff95 ffffffe6 ffffffb1 ffffffa4 ffffffe7 ffffff82 ffffffb3 ffffffe6 ffffff9d ffffff83
Can't encode transcription: '上海 汇总顶部网站?Effect 合同do 外国共振页 (铝箔J彼此嗦 格宽恕汤炳权' in language ''
Encoding of string failed! Failure bytes: ffffffe8 ffffff9e ffffffaf 29 20 ffffffe8 ffffff85 ffffffa5 34 38 20 ffffffe8 ffffff81 ffffff94 ffffffe7 ffffffb3 ffffffbb 43 ffffffe6 ffffffb5 ffffffb7 ffffffe9 ffffff9a ffffff90 ffffffe7 ffffff9e ffffff92 ffffffe5 ffffff87 ffffffbb ffffffe4 ffffffb8 ffffff80 ffffffe5 ffffff8f ffffff8d ffffffe5 ffffffba ffffff94 ffffffe8 ffffffaf ffffffb4 ffffffe6 ffffff98 ffffff8e 20 ffffffe4 ffffffba ffffffa4 ffffffe6 ffffff98 ffffff93 32 34 20 ffffffe7 ffffff9b ffffffaf ffffffe6 ffffffa2 ffffffa2 ffffffe5 ffffffbb ffffff96 20 ffffffe8 ffffff81 ffffff98 ffffffe9 ffffffb9 ffffff8a ffffffe6 ffffffa1 ffffffa5
Can't encode transcription: '工具免责羌螯) 腥48 联系C海隐瞒击一反应说明 交易24 盯梢廖 聘鹊桥' in language ''
Truth:有效期也有瘤AZ -04 09尤其火炬and 05酰或温度他乡意网友 用户 数量2 处 ICP
OCR :有效期也有瘤AZ -04 09尤其火炬and 05酰或温度他乡意网友 用户 数量2 处 ICE
Encoding of string failed! Failure bytes: ffffffe6 ffffff9a ffffff84 20 ffffffe6 ffffff96 ffffffbd ffffffe8 ffffffa1 ffffff8c ffffffef ffffffbc ffffff8c ffffffe5 ffffff9c ffffffa8 ffffffe7 ffffffba ffffffbf ffffffe5 ffffff8d ffffff81 ffffffe5 ffffffad ffffff97 ffffffe6 ffffff9e ffffffb6 ffffffe5 ffffff9f ffffff95 30 33 42 61 73 65 64 20 ffffffe7 ffffffbd ffffff91 ffffffe9 ffffffa1 ffffffb5 20 ffffffe5 ffffffad ffffff97 ffffffe8 ffffffb0 ffffff9c ffffffe5 ffffffa4 ffffffaf ffffffe6 ffffff82 ffffffac ffffffe8 ffffff87 ffffff82 ffffffe6 ffffff89 ffffff8d ffffffe8 ffffff83 ffffffbd ffffffe9 ffffff9c ffffff80 ffffffe9 ffffffb2 ffffff86 ffffffe8 ffffff81 ffffff94 ffffffe8 ffffffb0 ffffff8a
Can't encode transcription: '旗下的排名暄 施行,在线十字架埕03Based 网页 字谜夯悬臂才能需鲆联谊' in language ''
Encoding of string failed! Failure bytes: ffffffe6 ffffffb6 ffffff8e ffffffe9 ffffff82 ffffff8b 20 ffffffe6 ffffff9c ffffffba ffffffe5 ffffff9c ffffffba 2e 20 ffffffe5 ffffff8d ffffff87 ffffffe7 ffffffba ffffffa7 20 ffffffe6 ffffff8b ffffffa3 20 ffffffe9 ffffffad ffffff8d ffffffe6 ffffff9d ffffffa5 ffffffe5 ffffffb4 ffffff83 ffffffe5 ffffffa5 ffffff84 20 ffffffe6 ffffff8c ffffff81 ffffffe7 ffffffbb ffffffad ffffffe5 ffffff85 ffffffb1 ffffffe6 ffffff9c ffffff89 3e 20 ffffffe9 ffffff80 ffffff9a ffffffe6 ffffff8a ffffffa5 2d
Can't encode transcription: '出租机关教程峪惨绝人寰涎邋 机场. 升级 拣 魍来崃奄 持续共有> 通报-' in language ''
Truth:矢量 监测vip绀贮存登记2003卑时髦局长 客服 .库原创网络情况。傻瓜12与
OCR :入量 监测vip绀贮存登记2003卑时紧局长 客服 .库原创网络情况。俊瓜12与
Encoding of string failed! Failure bytes: ffffffe8 ffffff8b ffffffa3 ffffffe4 ffffffb8 ffffffb4 ffffffe5 ffffffba ffffff8a ffffffe8 ffffffbf ffffff99 ffffffe5 ffffff8c ffffff88 ffffffe7 ffffff89 ffffff99 ffffffe5 ffffff88 ffffffa9 20 ffffffe8 ffffff8f ffffff9c ffffffe7 ffffff95 ffffffa6 ffffffe6 ffffffa0 ffffff97 ffffffe8 ffffffae ffffffb2 ffffffe8 ffffffaf ffffff9d 20 ffffffe5 ffffffb4 ffffffbd ffffffe5 ffffffad ffffff90 ffffffe5 ffffff92 ffffff8c ffffffe8 ffffffb0 ffffff90 31 34 ffffffe9 ffffff80 ffffff9a ffffffe7 ffffff89 ffffff92 33 38 ffffffe6 ffffff9c ffffff80 ffffffe7 ffffffba ffffffba ffffffe7 ffffffbb ffffff87 ffffffe5 ffffff93 ffffff81 ffffffe6 ffffff8e ffffffa5 ffffffe5 ffffff8f ffffff97 ffffffe6 ffffff88 ffffff91 ffffffe4 ffffffbb ffffffac 20 ffffffe6 ffffff96 ffffffb9 ffffffe4 ffffffbe ffffffbf
I have google a lot, knowing that it might be the problem with unicharset, but I don't know how to solve it.
In addition, when I finetune this traineddata(from tessdata_best), with training files also generated with tesstrain.sh, it also report this bug.
Anyone who can help me? thanks!