New issue 477 by kdsfin...@gmail.com: tesseract 3.00 training segmentation
fault
http://code.google.com/p/tesseract-ocr/issues/detail?id=477
What steps will reproduce the problem?
1. tesseract eng.font_0.exp0.tif eng.font_0.exp0 nobatch box.train.stderr
What is the expected output? What do you see instead?
Expecting .tr.
Get
Tesseract Open Source OCR Engine with LibTiff
Segmentation fault
What version of the product are you using? On what operating system?
Tesseract3.00, leptonica1.68, libtiff3.9.4 on linux box
Please provide any additional information below.
I composed the tif and box file such that the single chars are on the same
base line and have a constant inter-char distance of 5 pixels. One row of
chars per page. The inter-char distance is highly sensitive.
As I re-arrange the chars between pages, the segmentation fault appears.
Attachments:
eng.font_6.exp0.box 620 bytes
eng.font_6.exp0.tif 8.5 KB
here is the valgrind info:
==16840== Invalid read of size 1
==16840== at 0x45CF956: strstr (in /lib/libc-2.10.1.so)
==16840== by 0x4033745: tesseract::TessBaseAPI::Recognize(ETEXT_STRUCT*)
(baseapi.cpp:568)
==16840== by 0x403488F: tesseract::TessBaseAPI::GetUTF8Text()
(baseapi.cpp:681)
==16840== by 0x804A17F: TesseractImage(char const*, IMAGE*, Pix*, int,
tesseract::TessBaseAPI*, STRING*) (tesseractmain.cpp:140)
==16840== by 0x804A5DF: main (tesseractmain.cpp:343)
==16840== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==16840==
==16840==
==16840== Process terminating with default action of signal 11 (SIGSEGV)
==16840== Access not within mapped region at address 0x0
==16840== at 0x45CF956: strstr (in /lib/libc-2.10.1.so)
==16840== by 0x4033745: tesseract::TessBaseAPI::Recognize(ETEXT_STRUCT*)
(baseapi.cpp:568)
==16840== by 0x403488F: tesseract::TessBaseAPI::GetUTF8Text()
(baseapi.cpp:681)
==16840== by 0x804A17F: TesseractImage(char const*, IMAGE*, Pix*, int,
tesseract::TessBaseAPI*, STRING*) (tesseractmain.cpp:140)
==16840== by 0x804A5DF: main (tesseractmain.cpp:343)
==16840== If you believe this happened as a result of a stack
==16840== overflow in your program's main thread (unlikely but
==16840== possible), you can try to increase the size of the
==16840== main thread stack using the --main-stacksize= flag.
==16840== The main thread stack size used in this run was 8388608.
==16840==
==16840== HEAP SUMMARY:
==16840== in use at exit: 244,200 bytes in 17,390 blocks
==16840== total heap usage: 24,225 allocs, 6,835 frees, 438,699 bytes
allocated
==16840==
==16840== LEAK SUMMARY:
==16840== definitely lost: 0 bytes in 0 blocks
==16840== indirectly lost: 0 bytes in 0 blocks
==16840== possibly lost: 2,052 bytes in 1 blocks
==16840== still reachable: 242,148 bytes in 17,389 blocks
==16840== suppressed: 0 bytes in 0 blocks
==16840== Rerun with --leak-check=full to see details of leaked memory
==16840==
==16840== For counts of detected and suppressed errors, rerun with: -v
==16840== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 5 from 5)
Segmentation fault
Comment #2 on issue 477 by zde...@gmail.com: tesseract 3.00 training
segmentation fault
http://code.google.com/p/tesseract-ocr/issues/detail?id=477
I can not reproduce problem in current tesseract (r684 aka 3.02) on
openSUSE 12.1 64bit:
$ tesseract eng.font_6.exp0.tif eng.font_6.exp0 nobatch box.train.stderr
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 0
APPLY_BOXES:
Boxes read from boxfile: 35
Found 35 good blobs.
TRAINING ... Font name = font_6
Generated training data for 1 words
Maybe it was fixed in meantime.