Hi,
On 14/04/2021 15:26, Sharp Subbu wrote:
> Dear Merlijn,
>
> Thank you very much for your reply.
> We are doing feasibility study on using Tesseract OCR featurs in our
> project on Windows 10 English 32/64-bit OS.
> As part of this study, i am trying to find that is it possible to compress
> / reduce the size of the pdf file created by Tesseract OCR (CommandLine: >
> Tesseract input.tif outputFile pdf).
> To find answer for this question, I have checked tesseract forums, and
> Tesseract APIs. I did not find any related information. Hence, I have
> posted the same question in Tesseract Google forums.
> Regarding this, i received nice reply from you. Thank you very much for
> that.
> Firstly, clarify that is Tesseract OCR API supports reducing / compressing
> the OCRed pdf file. Is this support present or not in Tesseract OCR sourc
> code.
>
> Kindly fin dthe attached sample pdf file "Sample.pdf" for your reference.
> Kindly compress it and send the compressed pdf file.
Please see attached 'sample-out.pdf' (compression ratio ~3.5), and
'sample-out2.pdf' (compression ratio ~7.5).
I also generated one other PDF which illustrates how the compression
works, rather than being actually compressed, but that file is not
attached due to file size reasons (~460KB). You can that file (plus the
other two files) here:
https://archive.org/~merlijn/tmp/mrc-pdf/
Please see the commands I used:
1. Extract image from PDF (not necessary if you start with an image,
even better to start with the image and not have Tesseract generate the PDF)
> $ pdfimages -all /tmp/Sample.pdf /tmp/sample
2. OCR:
> $ tesseract /tmp/sample-000.jpg - hocr > /tmp/sample-000.hocr.html
3. Create PDF (dpi taken from JPEG image):
> $ PATH=$PATH:/home/merlijn/archive/pdf/bin recode_pdf --from-imagestack /tmp/sample-000.jpg --hocr-file /tmp/sample-000.hocr.html -o /tmp/sample-out.pdf -m 2 -v --dpi 200
> [...]
> Processed 1 pages at 1.80 seconds/page
> Compression ratio: 3.677732
4. Create more compressed PDF (ditto for dpi):
> $ PATH=$PATH:/home/merlijn/archive/pdf/bin recode_pdf --from-imagestack /tmp/sample-000.jpg --hocr-file /tmp/sample-000.hocr.html -o /tmp/sample-out2.pdf -m 2 --dpi 200 --bg-downsample 3 -v
> [...]
> Processed 1 pages at 1.59 seconds/page
> Compression ratio: 7.581879
Size comparison:
> $ ls -lsh /tmp/sample-out*.pdf /tmp/Sample.pdf
> 40K -rw-r--r-- 1 merlijn merlijn 38K Apr 14 18:08 /tmp/sample-out2.pdf
> 80K -rw-r--r-- 1 merlijn merlijn 79K Apr 14 18:11 /tmp/sample-out.pdf
> 296K -rw-r--r-- 1 merlijn merlijn 293K Apr 14 17:56 /tmp/Sample.pdf
Note that:
1. Compression is not lossless, but text should nevertheless be quite sharp.
2. I run Tesseract on the image in the PDF, so for your purposes you
might want to instead generate hOCR from the image and let 'recode_pdf'
make the PDF for you (it's pretty much the same code that Tesseract uses).
3. The mask in the PDF is encoded with 'ccitt' and not 'jbig2', which
would give you slightly better compression still. (This is a bug in
mupdf which will be fixed in the next mupdf release, I have a patched
version somewhere, but not at hand)
4. I have only tested this on Linux.
5. The above run uses Kakadu for JPEG2000 compression, but you could
also use Grok [0] or OpenJPEG [1] (OpenJPEG already works as per my
previous email).
Finally, for some reason it looks like the PDFs created with the recode
tool actually look better than the sample you sent me -- I think that is
because yours suffers from JPEG artifacts which gets mostly cancelled
out by the mask technique that MRC employs.
If this looks like something you might want to use, we could talk
off-list about how to make it works on Windows, to not bother the list
with details not relevant to Tesseract.
Cheers,
Merlijn
[0]
https://github.com/GrokImageCompression/grok/
[1]
https://www.openjpeg.org