Reducing output image quality to make PDF smaller

129 views
Skip to first unread message

xtro...@gmail.com

unread,
Nov 13, 2018, 1:49:09 AM11/13/18
to tesseract-ocr
I've not used Tesseract in many years until today.  I'm very impressed with what I see now.

I need to process a PNG 300 DPI b/w image and have it create an Indexed PDF file.  I've run a command line to do this and was very happy with the quality of the result but I would like to be able to feed it the same high quality image (about 100 Kb) but reduce the size of the PDF down.  The PDF is about the same size as the PNG file.

Is it possible to do this by passing a value on the command line to tell Tesseract to compress or reduce the image quality down while still retaining the good OCR Indexing and so that PDF Readers will still know where in the graphical part a word is?

I'm not so concerned with the image quality in the PDF output but the OCR indexing is.

If I can't do this is there a series of steps that can be used?  Maybe I need to create a hOCR output and combine that back into a reduced quality image?

Zdenko Podobny

unread,
Nov 13, 2018, 7:07:29 AM11/13/18
to tesser...@googlegroups.com
Tesseract approach is to not re-compress/change image type of input image in pdf creation.
So you need to use other tools for creating smaller pdf.

Zdenko


ut 13. 11. 2018 o 7:49 <xtro...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/82e3717f-5e8c-48ae-8afe-7969763939b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Nov 13, 2018, 8:37:37 AM11/13/18
to tesser...@googlegroups.com
Message has been deleted
Message has been deleted

xtro...@gmail.com

unread,
Nov 22, 2018, 2:11:52 AM11/22/18
to tesseract-ocr


On Wednesday, November 14, 2018 at 12:07:37 AM UTC+10:30, shree wrote:
You can try


Thank, ocrmypdf looks useful but I don't think it will do what I need.

I created a "hocr" output file using:
tesseract myimage.png myimage_2 hocr

Then I combined back the same image just to test it.  The idea is later I would reduce the image resolution and/or quality before running this:
hocr2pdf -i myimage.png -o test.pdf < myimage_2.hocr

The above creates an Indexed PDF but searched text does not highlight correctly in the PDF rendering even though it's the same image.

Am I doing something incorrect or is this a bug?

Here is what I'm using:
$ tesseract --version
tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

The hocr2pdf version (from Exactimage package):
ExactImage hOCR to PDF converter, version 1.0.1
Copyright (C) 2008 - 2015 René Rebe, ExactCODE GmbH
Copyright (C) 2008 Archivista
Reply all
Reply to author
Forward
0 new messages