Avoid compression in PDF output from images

139 views
Skip to first unread message

John Muccigrosso

unread,
May 23, 2017, 9:51:42 PM5/23/17
to tesseract-ocr
I'm using tesseract to output to pdf, using pgm files as input. The resulting PDF shows jpeg compression. Is there any way to avoid this? TIA.

Zdenko Podobný

unread,
May 24, 2017, 7:13:26 AM5/24/17
to tesser...@googlegroups.com
Which tesseract version you use?

Zdenko

On Wed, May 24, 2017 at 3:51 AM, John Muccigrosso <jmuc...@gmail.com> wrote:
I'm using tesseract to output to pdf, using pgm files as input. The resulting PDF shows jpeg compression. Is there any way to avoid this? TIA.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a3c3588d-34ba-44f8-9921-b8df25ca38f2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zdenko Podobný

unread,
May 24, 2017, 7:28:43 AM5/24/17
to tesser...@googlegroups.com
The current logic (tesseract 3.05/4.00) is that for png is used flate[1] compression and for rest of formats is used leptonica function l_generateCIDataForPdf[2], that should used jpeg compression only for jpeg files...

[1] https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L720
[2]  https://github.com/DanBloomberg/leptonica/blob/master/src/pdfio2.c#L519

Zdenko

John Muccigrosso

unread,
May 24, 2017, 10:35:10 AM5/24/17
to tesseract-ocr
On Wednesday, May 24, 2017 at 7:13:26 AM UTC-4, zdenop wrote:
Which tesseract version you use?

Zdenko


tesseract 3.05.00
 leptonica-1.74.1
  libjpeg 8d : libpng 1.6.29 : libtiff 4.0.8 : zlib 1.2.5 

John Muccigrosso

unread,
May 24, 2017, 10:42:58 AM5/24/17
to tesseract-ocr


On Wednesday, May 24, 2017 at 7:28:43 AM UTC-4, zdenop wrote:
The current logic (tesseract 3.05/4.00) is that for png is used flate[1] compression and for rest of formats is used leptonica function l_generateCIDataForPdf[2], that should used jpeg compression only for jpeg files...

[1] https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L720
[2]  https://github.com/DanBloomberg/leptonica/blob/master/src/pdfio2.c#L519

Zdenko

That's definitely not happening. Here's what I do:

10:35 ~/ > tesseract a11.pgm test pdf
Tesseract Open Source OCR Engine v3.05.00 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
10:40 ~/
> pdfimages -list test.pdf
page   num  type   width height color comp bpc  enc interp  
object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   
1     0 image    1099  1705  gray    1   8  jpeg   no        11  0    70    71  397K  22%

Tom

unread,
May 24, 2017, 11:45:12 AM5/24/17
to tesseract-ocr
John, please also read the discussion here https://groups.google.com/forum/#!topic/tesseract-ocr/1W-aN8UUs0E

Tesseract 3.03: PDF-OCR generated PDFs show coding artefacts => do not use lossy (jpg) compression! Use lossless compression (png)!!


Tesseract should - in my view - never use lossy compression, because this will unevitably introduce coding artefacts around the font characters. And both, computer-generated images (characters!), and sharp-edged content (characters!) should not be compressed with lossy compression algorithms.

John Muccigrosso

unread,
May 26, 2017, 9:11:18 AM5/26/17
to tesseract-ocr
Thanks, Tom. I had already seen that thread.

My solution right now is to use convert to go from pgm to png before handing those files off to tesseract. png is smaller than pgm anyway and is lossless, so that makes some sense, and tesseract leaves the png's alone.

Still, tesseract shouldn't be compressing pgm without being told to. My preference would be for it to leave images alone when putting them into a PDF unless told otherwise, to be honest.

Tom

unread,
May 26, 2017, 9:32:06 AM5/26/17
to tesseract-ocr
John,

please see also this issue https://github.com/tesseract-ocr/tesseract/issues/660

"Implement a way to integrate (original image file, detected text) →searchable PDF"



and these somehow related OCRmyPDF issues:
and https://github.com/jbarlow83/OCRmyPDF/issues/125

"Output PDFs have decreased quality"


https://github.com/jbarlow83/OCRmyPDF/issues/163
Reply all
Reply to author
Forward
0 new messages