Issue 1144 in pdfium: Saved document file size is increased using Pdfium library.

199 views
Skip to first unread message

rparthi… via monorail

unread,
Aug 31, 2018, 1:58:27 AM8/31/18
to pdfiu...@googlegroups.com
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 1144 by rparthi...@gmail.com: Saved document file size is increased using Pdfium library.
https://bugs.chromium.org/p/pdfium/issues/detail?id=1144

What steps will reproduce the problem?
1.Load the attached document. (File size: 138 KB)
2.Save the document without any changes.
3.The document size is increased when compared with original document. ( File size: 143 KB)

What is the expected output? What do you see instead?
Expected output:

File size should be same after saving the document without any changes.

Current output:
File size is increased even without any changes in the document.



What version of the product are you using? On what operating system?
Latest version
Windows 10.

Please provide any additional information below.

Using below code to save the document:

FPDF_SaveAsCopy(m_documentPointer, stream, PdfiumNative.FPDF_SAVE_FLAGS.FPDF_NO_INCREMENTAL);


Note: I am using Pdfium library in c# framework via p/invoke method.



Attachments:
original.pdf 137 KB
output.pdf 142 KB

--
You received this message because:
1. The project was configured to send all issue notifications to this address

You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings

rharri… via monorail

unread,
Aug 31, 2018, 11:55:52 AM8/31/18
to pdfiu...@googlegroups.com
Updates:
Status: WontFix

Comment #1 on issue 1144 by rhar...@chromium.org: Saved document file size is increased using Pdfium library.
https://bugs.chromium.org/p/pdfium/issues/detail?id=1144#c1

Ran both of these through qpdf to decompress as much as possible to see what is going on.

Looks like the original document was built around a single content object stream, that has all of the objects embedded in it. PDFium puts these objects in their own streams. This introduces a bit of boiler plate for the additional stream entries. Additionally this significantly changes the size of the xref table, since the original doc has a single entry. So this is probably where the 5kB (~3.6%) increase is coming from.

This I believe is WAI, since there are multiple ways to encode a PDF to display the same data, so if you load a PDF in PDFium and ask for it be saved, you are asking for the PDF to be saved in the manner that PDFium encodes PDFs. If you know there has been no changes to the document, then you should just use the original data as the saved version, since that has advantage of requiring less computing resources. Loading and saving is fundamentally a transformative process, since the data needs to be converted to the libraries internal representation.

I am not enough of a guru of PDF layout to say if one of these encoding is intrinsically superior. Obviously the original document is smaller, but I don't know if there is some trade off being made in PDFium, like speeding up loading/rendering, that leads to choosing to layout in this fashion. Though from my experience I would say the manner that PDFium encodes documents, with a stream per object, is more common, but that might just be because a popular tool does it.

rharri… via monorail

unread,
Sep 5, 2018, 9:15:10 AM9/5/18
to pdfiu...@googlegroups.com

Comment #2 on issue 1144 by rhar...@chromium.org: Saved document file size is increased using Pdfium library.
https://bugs.chromium.org/p/pdfium/issues/detail?id=1144#c2

Issue 1146 has been merged into this issue.

rparthi… via monorail

unread,
Sep 6, 2018, 1:56:21 AM9/6/18
to pdfiu...@googlegroups.com

Comment #3 on issue 1144 by rparthi...@gmail.com: Saved document file size is increased using Pdfium library.
https://bugs.chromium.org/p/pdfium/issues/detail?id=1144#c3

Thanks for your update.

I explore the both PDF document and check content stream length. The both document content stream length is same.

As you mentioned "Additionally this significantly changes the size of the xref table, since the original doc has a single entry. So this is probably where the 5kB (~3.6%) increase is coming from." The original document is small size document so the 5kb was increased is not a big concerns from your point. But if we load the large document more the 10 MB it make big concerns it raise the saved document size up to 15 MB. Is it correct behavior of Pdfium?

Can you provide the workaround to save the document without increasing the size?

rharri… via monorail

unread,
Sep 6, 2018, 1:41:09 PM9/6/18
to pdfiu...@googlegroups.com

Comment #4 on issue 1144 by rhar...@chromium.org: Saved document file size is increased using Pdfium library.
https://bugs.chromium.org/p/pdfium/issues/detail?id=1144#c4

An increase from 10MB to 15MB would be more concerning, since that is a far larger and substantial percentage. Are you able to provide an example of that? If so please file a new bug with those docs.

If the doc display is still the same, i.e. visually indistinguishable, then I would say PDFium's is technically correct. If there are visual differences then there needs to be a bug filed about them. PDFium's encoding may not be optimal, but to my knowledge it is correct. I am not an expert/knowledgeable enough in the details of PDF encoding to dive into optimizing it. This is an area that if you have improvements for the output encoding, patches would be thoroughly welcome.

With respect to a work around, as mentioned before, if you know that there have been no changes to the doc after loading it into PDFium, then I would copy the input data source that you sent to PDFium to the output location and bypass PDFium for saving. If there have been edits to the document in PDFium, then you will have to accept the encoding that PDFium is generating.
Reply all
Reply to author
Forward
0 new messages