Re: Handle larger PDF files

12 views
Skip to first unread message

Lei Zhang

unread,
Jul 1, 2024, 1:11:49 PM (5 days ago) Jul 1
to farhad....@gmail.com, pdfium
+pdfium mailing list
-pdfium-review mailing list

Thanks for offering to help improve PDFium. It is probably best to examine many of the limitations individually and see if it makes sense to raise/remove the limit. e.g. Make GetSecondXRefStreamEntry() and friends return uint64_t should be fine. Redoing how PDFs are parsed entirely requires a bit more thought.

In any case, please file bug reports on https://crbug.com/pdfium/new and let's go from there.

On Mon, Jul 1, 2024 at 6:44 AM farhad....@gmail.com <farhad....@gmail.com> wrote:
The current Pdfium library has a few, rather arbitrary, limitations that prevent its use for handling larger PDF files. One of these limitations is in parsing xref table streams. The second parameter in such tables can refer to a file offset where a PDF object is located. The function that parses this entry ( GetSecondXRefStreamEntry) currently returns a uint32, silently letting the offset to overflow if the number of bytes is bigger than 4. 

We use Pdfium in .net environment for all our PDF rendering needs. We often have to handle files that are several GB in size, have many thousands of pages and millions of objects. Such files always fail in chrome or Edge after a long pause with a sad face icon. 

There is probably no good reason for having such limitations. In fact, our free Pdfium-based viewer (https://opait.com/Viewer/index.html) can open and view any file size without significant delays. To achieve this functionality, we normally patch the Pdfium library in the following areas:
  • Remove artificial limits (kPageMaxNum, kMaxObjectNumber, etc.).
  • Use the page tree structure within the PDF file directly, instead of parsing it totally before opening the file. This is the main cause of the long delay with Pdfium even for moderately large PDF files. We actually use the current pre-parsing for smaller number of pages (less than 1024) and switch to direct access for larger files. Even this switching is probably not necessary.
  • Fix the design bugs like the one reported here.
I would be more than happy to contribute our patches for review by the the community provided someone can show me how I could do that.

Thanks!



--
You received this message because you are subscribed to the Google Groups "pdfium-reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfium-review...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium-reviews/b5e1e0b1-4f11-479d-afd6-e37b2730b0b2n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages