I work at a SaaS firm which provides cloud storage services specializing in documents. As a part of our service, we try to create PDFs with searchable text layers from scanned documents. When processing PPMs which are created by ImageMagick from the original document, Leptonica mangles the image before it can be OCR'd properly by Tesseract. This results in a PDF unreadable by both human eyes and Tesseract. This only seems to happen for some specific documents.
I have executed Tesseract with the config values tessedit_write_images 1
and tessedit_pageseg_mode 0
. From my understanding, the second option does not enable OCR at all while processing with Tesseract (which speeds up my test cases) and the first option outputs a .tif debug image which is apparently what Leptonica feeds to Tesseract after processing. That image is also mangled.
I have extracted a single page from a PDF -- the process works on a page-by-page basis and most of the documents we work with contain highly sensitive information, so I had no other option but to do this. Regardless, it is good sample data. The "pg_0009.ppm" file is the original input fed into Tesseract on the command line which was converted from the original scanned document by ImageMagick. The "tessinput.tif" file is the image produced by the tessedit_write_images 1
option which is supposed to be OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, something that doesn't usually happen, and I suspect it is because the text is overlapped so many times that the OCR engine has too much to handle.
Google Drive since it's too large for an attachment: https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing
Leptonica leaves the image mostly intact so that Tesseract can provide a proper text layer for the output PDF. Alternatively, a configuration option is available to bypass Leptonica.
Any and all help is appreciated with this issue. Thanks for reading.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1a15cfe1-54b0-4bec-a551-4627a79e8b9d%40googlegroups.com.
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
convert -depth 16 -density 300 -colorspace RGB -despeckle -flatten -compress lzw -background white -alpha off "/path/pg_0010.pdf" "/path/pg_0010.tif"
tesseract -l eng "/path/pg_0010.tif" "/path/pg_0010" pdf
tessedit_create_pdf 1
tessedit_pageseg_mode 3
tessedit_write_images true
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Thanks for replying. I will post an issue on Leptonica's board as well. I was not sure if it was an issue with Leptonica itself or merely a configuration/parameter issue in the way that Tesseract calls it.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1a15cfe1-54b0-4bec-a551-4627a79e8b9d%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com.
tesseract-ocr is already the newest version (4.00~git2844-607e8fd8-2).0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.