Hi Lei,
Thanks for the points, I agree that there are indeed risks associated with adding a whole new suite of APIs (as with any API). This is especially valid for something that is so dynamic. However there are points that can be mostly addressed through better API design.
To frame this concretely upfront: if we have a formula one car and it is designed poorly, it obviously would require a lot of changes in terms of engine components, aero surfaces etc. But if we have a better designed formula one car (Think red bull 2023 :)) then we'd need less maintenance as a whole. Our current api gives embedders access to the output from our main computer - the speed, brake input, current gear and allows them to write to some of these as well. Dictionaries would let them understand the engine further, and if the computer (api) is built well enough, it won't need as much maintenance as it would already be robust enough to prevent these bugs.
Yes, PDFium does manipulate CPDF_Objects (and hence dictionaries) internally, however this can be addressed by smarter api design. For example, if it is possible that a CPDF_Object may be deleted and hence the dictionary as well, it is always possible to CTOR a new CPDF_Dictionary when requested with the needed keys. This is also part of the reason the initial APIs are read-only for now and write support is probably something we can consider a bit further down once we figure out a better way to get this to work. The use of these copies also helps to mitigate UAFs and reduce the unexpected behaviour (although one issue of this is the embedder may only get a "snapshot" of the dictionary).
The churn concern is real, but it's worth noting that the churn already exists; it's just happening invisibly on the embedder side. Pypdfium is already running qpdf alongside PDFium to compensate and other embedders may also be facing similar issues. This is not zero maintenance, it's just maintenance we can't see or control. This would likely be anyway caught by tests if the change is extremely significant and it is part of the reason why the APIs themselves are "Experimental" and can change at any time. When embedders implement these apis that are experimental it is likely they understand that they could change as it is defined in the api documentation.
Moreover, wouldn't adding more public APIs for each individual use case an embedder may need also add to the churn? I feel by doing this we are not removing the churn, only moving it to the API-side.
About the potential for bugs, while any public API can have the potential for bugs, the dynamics of the dictionaries increase the potential surface area, however again this could also be kept in mind while designing the API. Shadow copies and an initial read only phase act as a sort of "waterfall" for this feature. This could also be maybe coupled with a feature flag (or a bool that we can control to disable API if anything goes out of hand). About long term maintenance, the amount of maintenance might be high if the API isn't designed in a sufficiently robust manner, however it reduces the need to add many other additional APIs for each use case an embedder may need.
Compared to a purpose-built API per embedder use case, a single well-designed dictionary API is actually the shorter maintenance tail, not the longer one. It is not like that level of design isn't possible, other libraries such as qpdf have been able to bring about such APIs.
If it helps, I am willing to create a document of more possible API designs (like a design-doc of sorts) so we can spend time creating the "better designed Formula 1 car" and explore the potential implications of the design.
Thanks!
- Aryan