How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.

768 views
Skip to first unread message

Sharp Subbu

unread,
Apr 14, 2021, 8:27:21 AM4/14/21
to tesseract-ocr
Dear friends,

Kindly guide/help us to find solution for the below point:
=============================
How to reduce the size of a OCRed pdf file using Tesseract OCR APIs.
===============================

Merlijn B.W. Wajer

unread,
Apr 14, 2021, 8:57:43 AM4/14/21
to tesser...@googlegroups.com
Hi,
Not sure exactly what use case you have in mind (OS, etc), but I have a
suggestion, as I dealt with this in the recent past.

I developed something similar to the foxit/luratech "PDF compression",
in Python and it is entirely open source. It uses the Tesseract hOCR
result files. The can lead to 3-15x compression ratios (sometimes more,
depending on the image formats that you use).

It converts images to JPEG2000 for best compression (but slower loading
times) and also attempts to create a "foreground", "background" and
"mask" image (Mixed Raster Content [0]), which can significantly improve
compression. It inserts a text layer just like Tesseract does (the code
is a port of Tesseract's C++).

Here is some info [1], and here is the source code [2].

There is a "openjpeg-wip" branch that can use OpenJPEG instead of Kakadu
for image compression.

Example usage to create a PDF from a set of images:

recode_pdf --from-imagestack 'images/*.jp2' --hocr-file
combined_tesseract_results.html -o out.pdf -v --use-openjpeg -m 2

There is also the --from-pdf option instead of --from-imagestack, but
that has only seen light testing.

You can combine the hOCR result files using hocr-combine-stream [3]

If this suits your use case, I'd be happy to help/assist here or off
list. There aren't many users of the software yet (the same offer
extends for others reading this list). If you have an example PDF that
you can send me, I'd be happy to try to send you a compressed PDF back.

Cheers,
Merlijn


[0] https://en.wikipedia.org/wiki/Mixed_raster_content
[1] https://archive.org/~merlijn/projects/archive-pdf-tools/index.html
[2] https://git.archive.org/merlijn/archive-pdf-tools
[3]
https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/hocr-combine-stream
Message has been deleted

Sharp Subbu

unread,
Apr 14, 2021, 9:26:23 AM4/14/21
to tesseract-ocr
Dear Merlijn,

Thank you very much for your reply. 
We are doing feasibility study on using Tesseract OCR featurs in our project on Windows 10 English 32/64-bit OS.
As part of this study, i am trying to find that is it possible to compress / reduce the size of the pdf file created by Tesseract OCR (CommandLine: > Tesseract input.tif outputFile pdf).
To find answer for this question, I have checked tesseract forums, and Tesseract APIs. I did not find any related information. Hence, I have posted the same question in Tesseract Google forums.
Regarding this, i received nice reply from you. Thank you very much for that.
Firstly, clarify that is Tesseract OCR API supports reducing / compressing the OCRed pdf file. Is this support present or not in Tesseract OCR sourc code.

Kindly fin dthe attached sample pdf file "Sample.pdf" for your reference. Kindly compress it and send the compressed pdf file.

Thank you very much for your nice help.
Subramanyam
Sample.pdf

Merlijn B.W. Wajer

unread,
Apr 14, 2021, 12:51:14 PM4/14/21
to tesser...@googlegroups.com
Hi,

On 14/04/2021 15:26, Sharp Subbu wrote:
> Dear Merlijn,
>
> Thank you very much for your reply.
> We are doing feasibility study on using Tesseract OCR featurs in our
> project on Windows 10 English 32/64-bit OS.
> As part of this study, i am trying to find that is it possible to compress
> / reduce the size of the pdf file created by Tesseract OCR (CommandLine: >
> Tesseract input.tif outputFile pdf).
> To find answer for this question, I have checked tesseract forums, and
> Tesseract APIs. I did not find any related information. Hence, I have
> posted the same question in Tesseract Google forums.
> Regarding this, i received nice reply from you. Thank you very much for
> that.
> Firstly, clarify that is Tesseract OCR API supports reducing / compressing
> the OCRed pdf file. Is this support present or not in Tesseract OCR sourc
> code.
>
> Kindly fin dthe attached sample pdf file "Sample.pdf" for your reference.
> Kindly compress it and send the compressed pdf file.

Please see attached 'sample-out.pdf' (compression ratio ~3.5), and
'sample-out2.pdf' (compression ratio ~7.5).

I also generated one other PDF which illustrates how the compression
works, rather than being actually compressed, but that file is not
attached due to file size reasons (~460KB). You can that file (plus the
other two files) here: https://archive.org/~merlijn/tmp/mrc-pdf/

Please see the commands I used:

1. Extract image from PDF (not necessary if you start with an image,
even better to start with the image and not have Tesseract generate the PDF)

> $ pdfimages -all /tmp/Sample.pdf /tmp/sample

2. OCR:

> $ tesseract /tmp/sample-000.jpg - hocr > /tmp/sample-000.hocr.html

3. Create PDF (dpi taken from JPEG image):

> $ PATH=$PATH:/home/merlijn/archive/pdf/bin recode_pdf --from-imagestack /tmp/sample-000.jpg --hocr-file /tmp/sample-000.hocr.html -o /tmp/sample-out.pdf -m 2 -v --dpi 200
> [...]
> Processed 1 pages at 1.80 seconds/page
> Compression ratio: 3.677732

4. Create more compressed PDF (ditto for dpi):

> $ PATH=$PATH:/home/merlijn/archive/pdf/bin recode_pdf --from-imagestack /tmp/sample-000.jpg --hocr-file /tmp/sample-000.hocr.html -o /tmp/sample-out2.pdf -m 2 --dpi 200 --bg-downsample 3 -v
> [...]
> Processed 1 pages at 1.59 seconds/page
> Compression ratio: 7.581879

Size comparison:

> $ ls -lsh /tmp/sample-out*.pdf /tmp/Sample.pdf
> 40K -rw-r--r-- 1 merlijn merlijn 38K Apr 14 18:08 /tmp/sample-out2.pdf
> 80K -rw-r--r-- 1 merlijn merlijn 79K Apr 14 18:11 /tmp/sample-out.pdf
> 296K -rw-r--r-- 1 merlijn merlijn 293K Apr 14 17:56 /tmp/Sample.pdf

Note that:

1. Compression is not lossless, but text should nevertheless be quite sharp.
2. I run Tesseract on the image in the PDF, so for your purposes you
might want to instead generate hOCR from the image and let 'recode_pdf'
make the PDF for you (it's pretty much the same code that Tesseract uses).
3. The mask in the PDF is encoded with 'ccitt' and not 'jbig2', which
would give you slightly better compression still. (This is a bug in
mupdf which will be fixed in the next mupdf release, I have a patched
version somewhere, but not at hand)
4. I have only tested this on Linux.
5. The above run uses Kakadu for JPEG2000 compression, but you could
also use Grok [0] or OpenJPEG [1] (OpenJPEG already works as per my
previous email).

Finally, for some reason it looks like the PDFs created with the recode
tool actually look better than the sample you sent me -- I think that is
because yours suffers from JPEG artifacts which gets mostly cancelled
out by the mask technique that MRC employs.

If this looks like something you might want to use, we could talk
off-list about how to make it works on Windows, to not bother the list
with details not relevant to Tesseract.

Cheers,
Merlijn

[0] https://github.com/GrokImageCompression/grok/
[1] https://www.openjpeg.org
sample-out.pdf
sample-out2.pdf

Zdenko Podobny

unread,
Apr 14, 2021, 1:03:27 PM4/14/21
to tesser...@googlegroups.com
Tesseract is an OCR engine and it does not change input image.
For recompressing pdf you need other tools e.g.  jbig2enc [1] ,  mupdf [2]...


Zdenko


st 14. 4. 2021 o 15:26 Sharp Subbu <sharp...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3993df63-515e-42b9-9e31-ffd5eb0f2d32n%40googlegroups.com.

Sharp Subbu

unread,
Apr 17, 2021, 2:02:31 PM4/17/21
to tesseract-ocr
Dear Merlijn,

Thank you for your efforts in providing the files "Samle-out.pdf" and "Samle-out.pdf2".
I have checked these files. The file Samle-out2.pdf highly compressed with good quality.
Kindly share the soiurce code or the tools that you have used to generate the file "Samle-out2.pdf".
Kindly let us know that whether i can use your source code or tools on Windows 10 PC.

Thanks and Regards,
Subramanyam

Sharp Subbu

unread,
Apr 19, 2021, 7:04:04 PM4/19/21
to tesseract-ocr
Dear Merlijn,

Kindly reply to my previous mail.

Thanks and Regards,
Subramanyam

Merlijn B.W. Wajer

unread,
Apr 19, 2021, 7:32:09 PM4/19/21
to tesser...@googlegroups.com, sharp...@gmail.com
Hi,

On 20/04/2021 01:04, Sharp Subbu wrote:
> Dear Merlijn,
>
> Kindly reply to my previous mail.

I will reply tomorrow -- off-list so that we don't bother others on this
list.

Regards,
Merlijn
Reply all
Reply to author
Forward
0 new messages