Leptonica sometimes mangles images when using PDF output mode

35 views
Skip to first unread message

Lucas L.

unread,
Mar 28, 2019, 2:31:36 PM3/28/19
to tesseract-ocr

Environment

  • Tesseract 4.0.0-beta.3-249-g607e
  • leptonica-1.76.0
  • Linux (hostname removed) 4.18.0-16-generic #17-Ubuntu SMP Fri Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

I work at a SaaS firm which provides cloud storage services specializing in documents. As a part of our service, we try to create PDFs with searchable text layers from scanned documents. When processing PPMs which are created by ImageMagick from the original document, Leptonica mangles the image before it can be OCR'd properly by Tesseract. This results in a PDF unreadable by both human eyes and Tesseract. This only seems to happen for some specific documents.

How do I know it's Leptonica, specifically?

I have executed Tesseract with the config values tessedit_write_images 1 and tessedit_pageseg_mode 0. From my understanding, the second option does not enable OCR at all while processing with Tesseract (which speeds up my test cases) and the first option outputs a .tif debug image which is apparently what Leptonica feeds to Tesseract after processing. That image is also mangled.

Sample data

I have extracted a single page from a PDF -- the process works on a page-by-page basis and most of the documents we work with contain highly sensitive information, so I had no other option but to do this. Regardless, it is good sample data. The "pg_0009.ppm" file is the original input fed into Tesseract on the command line which was converted from the original scanned document by ImageMagick. The "tessinput.tif" file is the image produced by the tessedit_write_images 1 option which is supposed to be OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, something that doesn't usually happen, and I suspect it is because the text is overlapped so many times that the OCR engine has too much to handle.

Google Drive since it's too large for an attachment: https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing

Expected Behavior:

Leptonica leaves the image mostly intact so that Tesseract can provide a proper text layer for the output PDF. Alternatively, a configuration option is available to bypass Leptonica.

Any and all help is appreciated with this issue. Thanks for reading.

Zdenko Podobny

unread,
Mar 29, 2019, 3:17:00 AM3/29/19
to tesser...@googlegroups.com, Dan Bloomberg
Thank for report: Can you use other image format for input?
It seems to be related to pnm format - after converting your image to tif/jpg/png pdf output look correct.


Zdenko


št 28. 3. 2019 o 19:31 Lucas L. <infinit...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Mar 29, 2019, 4:06:17 AM3/29/19
to tesser...@googlegroups.com, dan bloomberg
Please also post as an issue for leptonica at https://github.com/DanBloomberg/leptonica/issues.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Lucas L.

unread,
Mar 29, 2019, 11:39:30 AM3/29/19
to tesseract-ocr
Thanks for replying. I will post an issue on Leptonica's board as well. I was not sure if it was an issue with Leptonica itself or merely a configuration/parameter issue in the way that Tesseract calls it.

@zdenop, thanks so much for letting me know that the .TIF format works on your end. The service I am working on is supposed to make a first pass using a compressed image format, then try again with PPM only if it fails. It would appear that there is a code issue with my service and the first pass is failing when it shouldn't. I was also able to get the image to process correctly (with a nicely-read OCR layer, no less) by calling ImageMagick and then Tesseract from the command line. I just took ownership of this service so I do not know it by heart. Regardless, it is good that I was able to discover a possible issue related to PPM processing with Leptonica. 

Zdenko Podobny

unread,
Mar 29, 2019, 11:49:03 AM3/29/19
to tesser...@googlegroups.com
BTW: tif with lzw compression produced smallest pdf than png or jpeg for this specific image.


Zdenko


pi 29. 3. 2019 o 16:39 Lucas L. <infinit...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lucas L.

unread,
Mar 29, 2019, 12:42:43 PM3/29/19
to tesseract-ocr
OK, I am running up against another issue, and it's getting weirder. Since Tesseract does not take PDFs as input, this service does the deed of breaking a PDF into pages, and then converting each of those pages to an image format (either lzw-compressed TIFF or uncompressed PPM if that fails). Somehow, if I run ImageMagick and then Tesseract on these pages individually from a command line using the same parameters in the service code, it runs fine processing the TIFF. But when the service runs, I get: 
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened

And the output pdf has all of the pages and they are not mangled... however they are shrunk into a tiny corner of the page. I have attached the resulting file. I feel that it is obvious from the fact that it works when I run it outside the service that it is a code issue... however I really am not sure what it could be doing differently from my command line. The pages come out looking great when I run tesseract on the individual pages manually. The errors do not appear when I run the command lines manually.

The command lines and params I am using:

Convert the input PDF (which is scanned and has no OCR layer) to input image:
convert -depth 16 -density 300 -colorspace RGB -despeckle -flatten -compress lzw -background white -alpha off "/path/pg_0010.pdf" "/path/pg_0010.tif"
Process the input image for OCR and output to PDF:
tesseract -l eng "/path/pg_0010.tif" "/path/pg_0010" pdf

Configuration parameters from /usr/share/tesseract-ocr/4.00/tessdata/configs
tessedit_create_pdf 1
tessedit_pageseg_mode
3
tessedit_write_images
true


On Thursday, March 28, 2019 at 1:31:36 PM UTC-5, Lucas L. wrote:
pgc_0009.pdf

Shree Devi Kumar

unread,
Mar 29, 2019, 1:00:58 PM3/29/19
to tesser...@googlegroups.com
The default page segmentation mode is different for command line and api. Specify it explicitly and test.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Shree Devi Kumar

unread,
Mar 29, 2019, 1:04:17 PM3/29/19
to tesser...@googlegroups.com

Lucas L.

unread,
Mar 29, 2019, 3:24:02 PM3/29/19
to tesseract-ocr
Well yes, that's because I changed it. It's a config file. Config files are designed to be changed.
I find your suggestion strange because specifying the page seg mode is exactly what I did in my config. Then you told me I shouldn't have changed my config.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Zdenko Podobny

unread,
Mar 29, 2019, 3:28:17 PM3/29/19
to tesser...@googlegroups.com
Dan created issue -

Please respond there to help to solve issue.

Zdenko


pi 29. 3. 2019 o 16:39 Lucas L. <infinit...@gmail.com> napísal(a):
Thanks for replying. I will post an issue on Leptonica's board as well. I was not sure if it was an issue with Leptonica itself or merely a configuration/parameter issue in the way that Tesseract calls it.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lucas L.

unread,
Mar 29, 2019, 3:29:19 PM3/29/19
to tesseract-ocr
Also, please see this issue in regards to using default page seg mode for PDFs: 


On Thursday, March 28, 2019 at 1:31:36 PM UTC-5, Lucas L. wrote:

Zdenko Podobny

unread,
Mar 29, 2019, 3:52:28 PM3/29/19
to tesser...@googlegroups.com
First of all: upgrade to the latest tesseract code. A lot fo fixes were implemented in meantime

Next: "Error in fopenWriteStream"  indicate problem with the writing. Check privileges, space etc. Than try to use other format (jpeg, png) if it helps.

Zdenko


pi 29. 3. 2019 o 17:42 Lucas L. <infinit...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Lucas L.

unread,
Mar 29, 2019, 4:04:19 PM3/29/19
to tesseract-ocr
OK, I appreciate the suggestion and clarification, but the aptitude package manager doesn't seem to have a later version than the one that I have now. I suppose I should build it from source, but your own page for installing from source suggests using aptitude first. 
tesseract-ocr is already the newest version (4.00~git2844-607e8fd8-2).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

Also, how could there be a permissions issue when the PDF is created, just not sized correctly? I would expect the PDF to not be created at all if that were the case.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Lucas L.

unread,
Mar 29, 2019, 4:46:33 PM3/29/19
to tesseract-ocr
Well, apparently you are correct and it is tied into permissions somehow. I imagine it must need specific permissions for some read/write operations that occur within Leptonica and Tesseract. I was able to reproduce those errors using Tesseract from the command line just now, after I had messed around with the read/write/execute permissions on an input file. I'll keep drilling down...
Reply all
Reply to author
Forward
0 new messages