Reducing output image quality to make PDF smaller

xtro...@gmail.com

unread,

Nov 13, 2018, 1:49:09 AM11/13/18

to tesseract-ocr

I've not used Tesseract in many years until today. I'm very impressed with what I see now.

I need to process a PNG 300 DPI b/w image and have it create an Indexed PDF file. I've run a command line to do this and was very happy with the quality of the result but I would like to be able to feed it the same high quality image (about 100 Kb) but reduce the size of the PDF down. The PDF is about the same size as the PNG file.

Is it possible to do this by passing a value on the command line to tell Tesseract to compress or reduce the image quality down while still retaining the good OCR Indexing and so that PDF Readers will still know where in the graphical part a word is?

I'm not so concerned with the image quality in the PDF output but the OCR indexing is.

If I can't do this is there a series of steps that can be used? Maybe I need to create a hOCR output and combine that back into a reduced quality image?

Zdenko Podobny

unread,

Nov 13, 2018, 7:07:29 AM11/13/18

to tesser...@googlegroups.com

Tesseract approach is to not re-compress/change image type of input image in pdf creation.

So you need to use other tools for creating smaller pdf.

Zdenko

ut 13. 11. 2018 o 7:49 <xtro...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/82e3717f-5e8c-48ae-8afe-7969763939b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Nov 13, 2018, 8:37:37 AM11/13/18

to tesser...@googlegroups.com

You can try

https://pypi.org/project/ocrmypdf/

Which uses tesseract

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z7hoFg5KiRm3LPc9m0JHpijSL69%2BswsALzYEZ_o9Bx5Q%40mail.gmail.com.

Message has been deleted

xtro...@gmail.com

unread,

Nov 22, 2018, 2:11:52 AM11/22/18

to tesseract-ocr

On Wednesday, November 14, 2018 at 12:07:37 AM UTC+10:30, shree wrote:

You can try

https://pypi.org/project/ocrmypdf/

Which uses tesseract

Thank, ocrmypdf looks useful but I don't think it will do what I need.

I created a "hocr" output file using:
tesseract myimage.png myimage_2 hocr

Then I combined back the same image just to test it. The idea is later I would reduce the image resolution and/or quality before running this:
hocr2pdf -i myimage.png -o test.pdf < myimage_2.hocr

The above creates an Indexed PDF but searched text does not highlight correctly in the PDF rendering even though it's the same image.

Am I doing something incorrect or is this a bug?

Here is what I'm using:
$ tesseract --version
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX2
Found AVX
Found SSE

The hocr2pdf version (from Exactimage package):
ExactImage hOCR to PDF converter, version 1.0.1
Copyright (C) 2008 - 2015 René Rebe, ExactCODE GmbH
Copyright (C) 2008 Archivista

Reply all

Reply to author

Forward