Another spurious error message while attempting to train Tesseract.

62 views
Skip to first unread message

David Maung

unread,
Sep 17, 2019, 3:51:43 PM9/17/19
to tesseract-ocr
This time I ran the following command to try and prepare 1 font for training

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only --noextract_font_properties --langdata_dir ~/tesstutorial/langdata \
--tessdata_dir ~/tesstutorial/tesseract/tessdata --output_dir ~/tesstutorial/engtrain --fontlist "Courier New" --overwrite

This gets much further than the command in post titled "Unclear error message when running tesstrain.sh".

It now ran for hours and resulted in this:


Page 3302
Loaded 171652/171652 lines (1-171652) of document /tmp/eng-2019-09-16.2SS/eng.Courier_New.exp0.lstmf
src/training/tesstrain_utils.sh: line 72: 12131 Segmentation fault      (core dumped) "${cmd}" "$@" 2>&1
     12132 Done                    | tee -a "${LOG_FILE}"
ERROR: Program tesseract failed. Abort.



Frankly this failure to get  TessTutorial to work after 2 weeks of attempts is rather unsatisfying.  So are the uninformative messages.

What are the minimum system requirements for this to work?  I am using Ubuntu 16.04 in VirtualBox with 8Gb RAM and 4 cores.

David



Shree Devi Kumar

unread,
Sep 17, 2019, 9:57:00 PM9/17/19
to tesseract-ocr

Page 3302
Loaded 171652/171652 lines (1-171652)

If you are trying the tutorial, I suggest that you run the whole process with a small training text file. The one in langdata repo for English is less than 100 lines. 

Once you get the process working correctly (you need to have all required files in the right places) then you can expand to larger training text, required only for training from scratch or replacing the top layer


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1cd58481-8a82-4adc-9a03-e4deb03917b0%40googlegroups.com.

Shree Devi Kumar

unread,
Sep 17, 2019, 9:58:34 PM9/17/19
to tesseract-ocr

On Wed, Sep 18, 2019, 01:21 David Maung <davidm...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages