Compression with FPDFImageObj_LoadJpegFile

302 views
Skip to first unread message

Jeroen Bobbeldijk

unread,
Oct 17, 2023, 7:59:55 AM10/17/23
to pdfium
Hi there,

I'm trying to extend pdfium-cli to have functionality to compress PDFs, my current setup is:
 - FPDF_GetPageCount
 - Loop through every page
   - FPDF_LoadPage
   - FPDFPage_CountObjects
   - Loop through every object
     - FPDFPage_GetObject
     - FPDFPageObj_GetType == FPDF_PAGEOBJ_IMAGE
       - FPDFImageObj_GetBitmap
       - FPDFBitmap_GetStride
       - FPDFBitmap_GetWidth
       - FPDFBitmap_GetHeight
       - FPDFBitmap_GetFormat
       - Pull the image through libjpeg-turbo with the given compression
       - FPDFImageObj_LoadJpegFileInline the new file
   - FPDFPage_GenerateContent
   - FPDF_SaveAsCopy with FPDF_NO_INCREMENTAL

This all works pretty good (with some quirks that I still have to figure out), but the actual file is bigger than the original, which is quite odd since in my calculations, the files got between 75% and 99% smaller with a JPEG compression level of 80. 

After inspecting the PDF data, it seems that the images lost their FlateDecode filter, and that the image streams now don't have any compression applied. So the question is: would it be possible to have FPDFImageObj_LoadJpegFile add the stream of the new file with a FlateDecode filter?

Justin Pierce

unread,
Oct 17, 2023, 11:30:46 PM10/17/23
to pdfium
Jpegs replace FlateDecode with DctDecode, don't they?

Jeroen Bobbeldijk

unread,
Oct 18, 2023, 3:12:21 AM10/18/23
to pdfium
You can use FlateDecode and DCTDecode together. Or you would say that would not have any (or minimal) effect due to DCT?
In that case I need to figure out what is causing the PDF to become larger after the compression...

geisserml

unread,
Oct 18, 2023, 9:32:06 AM10/18/23
to pdfium
In general, I've got some suggestions how you might be able to improve the algorithm:
- I'd recommend FPDFImageObj_GetImageFilter() to check the image's current filters and exclude those with existing high-compression codecs such as JPX, JBIG2, CCITTFax, and also DCT itself to avoid [generation loss](https://en.wikipedia.org/wiki/Generation_loss).
- Another concern are 1bpp B/W images, which FPDFImageObj_GetBitmap() would convert to 8-bit Grayscale, leading to a major size increase. Supposedly you could check with FPDFImageObj_GetImageMetadata() to exclude such images from the processing. For quality reasons, I would also suggest to check the colorspace and exclude CMYK images, since GetBitmap() would transcode to RGB.

However, I would expect that replacing an RGB Flate image with a correctly encoded DCT equivalent should almost always lead to higher compression, so likely none of these points explain the size increase you're experiencing.
In that case, it would be helpful if you could share a before/after sample to see what's going on.
It might even be an issue with pdfium not removing the old Flate stream from the PDF or something? e.g. similar to how FPDFDoc_DeleteAttachment() does not actually remove the attachment stream, but merely unlink it from the view...

Lei Zhang

unread,
Oct 18, 2023, 12:13:40 PM10/18/23
to Jeroen Bobbeldijk, pdfium
It may be helpful to provide a sample input and output PDF, so folks
can examine them and see why the output is bigger, instead of
speculating.

I also think JPEGs generally don't need to be flate encoded. While
some JPEG files can be further compressed, many JPEGs are only 1%
smaller after compression. That gain is probably not worth the effort.
> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/13e166f3-730d-4426-a4dd-760c0579f83dn%40googlegroups.com.

Jeroen Bobbeldijk

unread,
Oct 18, 2023, 1:42:16 PM10/18/23
to pdfium
Thanks all!

I inspected the PDF with iText RUPS and it doesn't look like it's keeping a reference to the old stream.
I then compared both versions of the file and compared the separate images. I noticed that some images didn't compress well at all, and increased in size, and some compressed really well.  Tomorrow I'll look into @geisserml's hints and see which images could best be left out of the compression logic, I will also compare the resulting image to the size of FPDFImageObj_GetImageDataRaw to make sure I'm never replacing it with a bigger image.

The pages that I have been testing with only contain small images (logos for example), only one had a full page image as background, but that was also mostly white. I will also do some more tests with larger images (from a scanner for example).
Reply all
Reply to author
Forward
0 new messages