fine tuning a few characters generating training images error

94 views
Skip to first unread message

Jingjing Lin

unread,
Jun 13, 2019, 3:47:13 PM6/13/19
to tesseract-ocr
when I tried to create new training data using the command below for fine tuning a few characters:
src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang chi_sim --linedata_only \
  --noextract_font_properties --langdata_dir ../langdata \
  --tessdata_dir ./tessdata --output_dir ~/tesstutorial/train

It's taking forever to do it (actually I think stuck in Phase I: Generating training images) by doing the rendered page to file **.tif

Rendered page 1285 to file /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif

Rendered page 1286 to file /tmp/chi_sim-2019-06-13.rk6/chi_sim.AR_PL_UKai_CN.exp0.tif


and sometimes gives the error below:

src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault      (core dumped) "${cmd}" "$@" 2>&1

     20850 Done                    | tee -a ${LOG_FILE}



What's the problem here?

Jingjing Lin

unread,
Jun 13, 2019, 4:04:45 PM6/13/19
to tesseract-ocr
before 

src/training/tesstrain_utils.sh: line 72: 20849 Segmentation fault      (core dumped) "${cmd}" "$@" 2>&1

     20850 Done                    | tee -a ${LOG_FILE}


it also shows:

Error in pixCreateNoInit: pix_malloc fail for data

Error in pixCreate: pixd not made


在 2019年6月13日星期四 UTC-4下午3:47:13,Jingjing Lin写道:

Jingjing Lin

unread,
Jun 13, 2019, 4:21:28 PM6/13/19
to tesseract-ocr
I didn't have any problem when following the instructions to add '±' to eng.traineddata. Is it because for Chinese there are much more characters?

在 2019年6月13日星期四 UTC-4下午4:04:45,Jingjing Lin写道:

Jingjing Lin

unread,
Jun 13, 2019, 5:50:21 PM6/13/19
to tesseract-ocr
turns out it is indeed because the chi_sim.training_text I was using was too large.
I downloaded it from langdata_lstm repository rather than langdata repository, which appears to be a problem. (Sometimes it's bad to be too careful :) )The .training_text from langdata is only 199kb but is like 20MB from langdata_lstm.

I found out this problem by check the tmp .tif file generated, which turns out to be 60MB, way too large.

在 2019年6月13日星期四 UTC-4下午4:21:28,Jingjing Lin写道:

Peyi Oyelo

unread,
Apr 20, 2020, 12:36:32 AM4/20/20
to tesseract-ocr
Thanks for the insight. Experiencing the same issue. My tiff file as well was 66MB.
Reply all
Reply to author
Forward
0 new messages