Allowing Public Embedders to parse CPDF

Aryan Krishnan

unread,

May 21, 2026, 12:02:08 PMMay 21

to pdfium

I have seen quite a few requests for CPDF_Dictionary support and had created a few CLs for this already, however I just realized at this point that I forgot to check if PDFium is currently interested in such a feature (or if there are any specific concerns in terms of public read-only APIs for these). I have seen some discussions about across many isolated threads, thus, I am creating this thread to explore this feature further.

Thanks!

- Aryan

geisserml

unread,

May 22, 2026, 12:07:32 PMMay 22

to pdfium

From an embedder perspective, I can only say this would be very useful on our end.
Currently we have to combine PDFium with another PDF library (qpdf-based) that provides low-level read/write APIs. This basically doubles complexity, binary size and so on, which is not ideal.

Without direct access to the underlying PDF data structures, the public API is (and always would be) limited.

Lei Zhang

unread,

May 22, 2026, 9:27:57 PMMay 22

to geisserml, pdfium

One of my concerns is that PDFium performs CPDF_Object manipulation
internally for its own needs. This happens even if the PDFium embedder
is not modifying anything. While a low-level PDF object access API is
powerful and useful, it may also cause many surprises, ranging from
unexpected results to use-after-frees. The low-level APIs are also
less insulated, so if PDFium changes how it modifies CPDF_Objects
internally, that can impact the embedders using these APIs.

While I appreciate the efforts to add these APIs, I'm also concerned
about the potential for bugs and long term maintenance.

> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/pdfium/a033aebc-f47f-4e27-b40d-8688f227ab31n%40googlegroups.com.

Aryan Krishnan

unread,

May 23, 2026, 1:18:40 AMMay 23

to pdfium

Hi Lei,

Thanks for the points, I agree that there are indeed risks associated with adding a whole new suite of APIs (as with any API). This is especially valid for something that is so dynamic. However there are points that can be mostly addressed through better API design.

To frame this concretely upfront: if we have a formula one car and it is designed poorly, it obviously would require a lot of changes in terms of engine components, aero surfaces etc. But if we have a better designed formula one car (Think red bull 2023 :)) then we'd need less maintenance as a whole. Our current api gives embedders access to the output from our main computer - the speed, brake input, current gear and allows them to write to some of these as well. Dictionaries would let them understand the engine further, and if the computer (api) is built well enough, it won't need as much maintenance as it would already be robust enough to prevent these bugs.

Yes, PDFium does manipulate CPDF_Objects (and hence dictionaries) internally, however this can be addressed by smarter api design. For example, if it is possible that a CPDF_Object may be deleted and hence the dictionary as well, it is always possible to CTOR a new CPDF_Dictionary when requested with the needed keys. This is also part of the reason the initial APIs are read-only for now and write support is probably something we can consider a bit further down once we figure out a better way to get this to work. The use of these copies also helps to mitigate UAFs and reduce the unexpected behaviour (although one issue of this is the embedder may only get a "snapshot" of the dictionary).

The churn concern is real, but it's worth noting that the churn already exists; it's just happening invisibly on the embedder side. Pypdfium is already running qpdf alongside PDFium to compensate and other embedders may also be facing similar issues. This is not zero maintenance, it's just maintenance we can't see or control. This would likely be anyway caught by tests if the change is extremely significant and it is part of the reason why the APIs themselves are "Experimental" and can change at any time. When embedders implement these apis that are experimental it is likely they understand that they could change as it is defined in the api documentation.

Moreover, wouldn't adding more public APIs for each individual use case an embedder may need also add to the churn? I feel by doing this we are not removing the churn, only moving it to the API-side.

About the potential for bugs, while any public API can have the potential for bugs, the dynamics of the dictionaries increase the potential surface area, however again this could also be kept in mind while designing the API. Shadow copies and an initial read only phase act as a sort of "waterfall" for this feature. This could also be maybe coupled with a feature flag (or a bool that we can control to disable API if anything goes out of hand). About long term maintenance, the amount of maintenance might be high if the API isn't designed in a sufficiently robust manner, however it reduces the need to add many other additional APIs for each use case an embedder may need.

Compared to a purpose-built API per embedder use case, a single well-designed dictionary API is actually the shorter maintenance tail, not the longer one. It is not like that level of design isn't possible, other libraries such as qpdf have been able to bring about such APIs.

If it helps, I am willing to create a document of more possible API designs (like a design-doc of sorts) so we can spend time creating the "better designed Formula 1 car" and explore the potential implications of the design.

Thanks!

- Aryan

geisserml

unread,

May 23, 2026, 8:09:53 AMMay 23

to pdfium

Again thanks for the explanation, Lei.
Since I only use PDFium's surface, I was completely unaware of these valid concerns.

FWIW I believe it is theoretically possible to build a PDF library in a way that provides both low-level and higher-level access (qpdf shows it is), but I acknowledge the difficulties in trying to hack this into an existing library that may not have been designed for that use from the ground up.

Aryan Krishnan

unread,

May 23, 2026, 11:52:47 AMMay 23

to pdfium

In hindsight I feel I may have slightly downplayed these concerns in my last message, just want to correct that here:

These concerns are more fundamental than my earlier message may have made them sound. Especially after the discussing here, I believe the core challenge is shifting more towards the "how do we implement it" side of things. It is possible in theory, but what would be impacted in practice? This is one major question that is left unanswered.

I still believe that there will be quite a major impact on the embedders through the addition of this feature but simultaneously it is difficult to retrofit in such a feature to an engine not designed around stable access to CPDF_Objects/Dictionaries. I believe the most productive next step for this feature is to think through the API in a more concrete manner - lifetimes, UAFs, invalidation, and what should/shouldn't be publicly exposed. As I mentioned previously I'd like to take some time thinking through this, mostly as a design exercise but also to look for some potential options for such an API.

Thanks for the detailed discussion so far (and sorry for some of the confusion earlier)

- Aryan

Aryan Krishnan

unread,

May 25, 2026, 1:19:33 PMMay 25

to pdfium

Hi all, quick update: Just thinking out loud with a rough idea (@thestig - let me know if this helps).

Instead of exposing CPDF_Dictionaries in the API directly, what about we create some kind of query style API:

What I mean by this is the embedder requests for values through the traversal paths or keys (for example: Get sub dictionary X and then Get key Y as an array with {"X", "Y"}, "X/Y"). The way we have this input is something that we can discuss later but it could probably just be a const char** or a const char* array or something of that sort.

Something more like:

Embedder -> Path query -> FPDF headers -> Corresponding .cpp that deals with the dictionary behaviour -> Makes some calls to CPDF_Dict and other CPDF Helpers -> Goes back up the chain -> Gives the embedder rects, floats, c-strings, ints, bools or something of that sort.

Instead of Embedder -> Get an FPDF_Dict which is basically a reinterpretted CPDF_Dict -> Which could get deleted -> UAF later on -> Problems

I think we still need to explore this further - is still a very rough Idea at this point - just thought I'd share it here for further thoughts/suggestions or if there is anything I have missed here that could cause unintended/confusing behaviour.

But this way allows us to keep the dictionary aspects internal to PDFium for now while also providing this read only api without many of the downsides of the shadow-copy/snapshot approach.

Let me know if this could work as an alternative approach, or if you have any suggestions.

Thanks!

- Aryan

Aryan Krishnan

unread,

May 25, 2026, 1:20:19 PMMay 25

to pdfium

Clarification
Replace: as an array with {"X", "Y"}, "X/Y"
with: as an array with {"X", "Y"} or perhaps as "X/Y"

Aryan Krishnan

unread,

Jun 8, 2026, 7:27:10 AMJun 8

to pdfium

@thestig, following up - Thoughts on whether the new strategy helps or if there is a specific additional concern I'm missing here.

Allowing Public Embedders to parse CPDF_Dictionaries

Aryan Krishnan

geisserml

Lei Zhang

Aryan Krishnan

geisserml

Aryan Krishnan

Aryan Krishnan

Aryan Krishnan

Aryan Krishnan