ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

roberty...@gmail.com

unread,

Sep 14, 2017, 4:20:26 AM9/14/17

to tesseract-ocr

Hello,

I'm trying to train my traineddata model with Tess4.0, following the commands in the TrainingTesseract 4.00 tutorial. The first command to creat training data is showed as follows:

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \
--noextract_font_properties --langdata_dir ../langdata \
--fontlist "SIMSUN" --tessdata_dir ./tessdata --output_dir ~/tesstutorial/trainspecial

And the execution log for this command is as follows:

=== Phase I: Generating training images ===
Rendering using SIMSUN
[2017年 09月 14日星期四 16:01:57 CST] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.whlzhytMkp --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0 --max_pages=3 --font=SIMSUN --text=../langdata/chi_sim/chi_sim.training_text
Rendered page 0 to file /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[2017年 09月 14日星期四 16:01:58 CST] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset --norm_mode 1 /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box
Extracting unicharset from box file /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.SIMSUN.exp0.box
Invalid Unicode codepoint: 0xffffffe8
IsValidCodepoint(ch):Error:Assert failed:in file normstrngs.cpp, line 225
ERROR: /tmp/tmp.8JcoYdZI17/chi_sim/chi_sim.unicharset does not exist or is not readable

But an error appears in this progress, which shows that chi_sim.unicharset extracted error. I have checked the directory of /tmp/tmp.8JcoYdZI17/chi_sim/, and chi_sim.unicharset file does not exist.

How can I modify this error? Can you help me? Thanks.

ShreeDevi Kumar

unread,

Sep 14, 2017, 4:30:40 AM9/14/17

to tesser...@googlegroups.com

It is a known problem with the latest code in github - see https://github.com/tesseract-ocr/tesseract/issues/1114

Waiting for fix from Ray.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9b9b26b8-5fc8-42aa-bd7c-2305dffc6fd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

roberty...@gmail.com

unread,

Sep 14, 2017, 4:47:27 AM9/14/17

to tesseract-ocr

Shree, thanks for your reply.

But I have another problem in the project which needs your helpness:

Some italicized characters in my data need to be identified, but these italic characters tend to be low in recognition. Can I add some italic characters to train our model?

I have observed that we cannot add some italic characters in the chi_sim.training_text file in the https://github.com/tesseract-ocr/langdata/tree/master/chi_sim link.

How would I train these italic characters?

在 2017年9月14日星期四 UTC+8下午4:30:40，shree写道：

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

shree

unread,

Sep 16, 2017, 12:51:14 AM9/16/17

to tesseract-ocr

On Thursday, September 14, 2017 at 2:17:27 PM UTC+5:30, roberty...@gmail.com wrote:

Shree, thanks for your reply.

But I have another problem in the project which needs your helpness:

Some italicized characters in my data need to be identified, but these italic characters tend to be low in recognition. Can I add some italic characters to train our model?

I have observed that we cannot add some italic characters in the chi_sim.training_text file in the https://github.com/tesseract-ocr/langdata/tree/master/chi_sim link.

How would I train these italic characters?

Please see comment by Ray at https://github.com/tesseract-ocr/tesseract/issues/1074#issuecomment-327814244 regarding training for italics.

shree

unread,

Sep 16, 2017, 4:52:47 AM9/16/17

to tesseract-ocr

https://github.com/tesseract-ocr/tesseract/pull/1134/files

should fix it.

shree

unread,

Sep 18, 2017, 1:22:41 AM9/18/17

to tesseract-ocr

On Saturday, September 16, 2017 at 2:22:47 PM UTC+5:30, shree wrote:

https://github.com/tesseract-ocr/tesseract/pull/1134/files
should fix it.

Sorry, that is not the correct fix.

Reply all

Reply to author

Forward