Encoding of string failed when finetune fot adding new fonts is fas language

140 views
Skip to first unread message

john

unread,
Jun 30, 2018, 5:53:02 AM6/30/18
to tesseract-ocr
Encoding of string failed! Failure bytes: ffffffc2 ffffffa9 20 ffffffd8 ffffffa8 ffffffd8 ffffffa7 ffffffd8 ffffffae ffffffd8 ffffffaa ffffffd9 ffffff86 ffffffd8 ffffffa7 20 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd8 ffffffa4 ffffffd8 ffffffb3 20 ffffffdb ffffff8c ffffffd9 ffffff86 ffffffd8 ffffffa7 ffffffd8 ffffffb1 ffffffdb ffffff8c ffffffd8 ffffffa7 20 ffffffd8 ffffffa7 ffffffd8 ffffffa8 20 ffffffd8 ffffffaa ffffffd8 ffffffa8 ffffffd8 ffffffab ffffffd9 ffffff87 20 ffffffd8 ffffffaf ffffffd8 ffffffa7 ffffffd9 ffffff81 ffffffd8 ffffffaa ffffffd8 ffffffb3 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffdb ffffff8c ffffffd9 ffffff86 ffffffda ffffff86 ffffffd9 ffffff85 ffffffd9 ffffff87 20 ffffffd9 ffffff82 ffffffd9 ffffff84 ffffffd8 ffffffb7 ffffffd9 ffffff85
Can't encode transcription: '۱۹ 2006© باختنا لاؤس یناریا اب تبثه دافتسا نینچمه قلطم' in language ''
^C

when I finetune network for fas language i see top error?
what is wrong with training?

Shree Devi Kumar

unread,
Jun 30, 2018, 6:47:26 AM6/30/18
to tesser...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/11d5277e-2ef1-4ae9-8cb3-3f38290c1dfc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

john

unread,
Jun 30, 2018, 7:18:52 AM6/30/18
to tesseract-ocr
I saw that link. this error occured many times,how can i prevent that?

Shree Devi Kumar

unread,
Jun 30, 2018, 10:35:34 AM6/30/18
to tesser...@googlegroups.com
Then there must be a mismatch between the unicharset you are using and the training text. eg. check whether the copyright symbol is in your unicharset.


For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jun 30, 2018, 10:43:30 AM6/30/18
to tesser...@googlegroups.com
Also check that there is no tab or other unprintable character in your training text.

Which version of tesseract are you using? show output  of

tesseract -v

john

unread,
Jul 2, 2018, 12:45:13 AM7/2/18
to tesseract-ocr
I use tesseract 4.0.0-beta.1. downloaded from this link (UB mannheim)

ran go

unread,
Jul 2, 2018, 2:37:01 AM7/2/18
to tesser...@googlegroups.com
in my opinion error is for font-type, for some font there is no error but for some other fonts there is error

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Jul 2, 2018, 10:16:10 AM7/2/18
to tesser...@googlegroups.com
You can use find_fonts with your training_text to locate the fonts to use.

Modify the following command to match your directory setup and try

echo "###### FIND FONTS ######"
# Find fonts which can render your training_text. Run `fc-cache -vf` to refresh cache.
# You can change the minimum coverage % as needed.
# This process can take a while if you have a number of installed fonts.
# Review the generated fontlist and modify, if needed.
# 2000 fonts found. Use a smaller set

nice text2image --find_fonts \
--fonts_dir $fonts_dir \
--text $langdata_dir/$Lang/$Lang.training_text \
--min_coverage 0.999  \
--render_per_font=false \
--outputbase $langdata_dir/$Lang/$Lang \
|& grep raw \
 | sed -e 's/ :.*/@ \\/g' \
 | sed -e "s/^/ '/" \
 | sed -e "s/@/'/g" > $langdata_dir/$Lang/$Lang.fontslist.txt

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jul 2, 2018, 4:25:33 PM7/2/18
to tesser...@googlegroups.com

ran go

unread,
Jul 3, 2018, 2:48:53 AM7/3/18
to tesser...@googlegroups.com
the problem is still there, i saw those links but problem is still here

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages