Confused about ownership of font objects

114 views
Skip to first unread message

Victor Suadicani

unread,
Jul 19, 2023, 10:54:15 AM7/19/23
to pdfium
Hi,

I'm looking at extracting fonts from PDF documents using PDFium's public API. For this, there is the (experimental) FPDFTextObj_GetFont(FPDF_PAGEOBJECT text) API. The API notes in the comments: "Returns a handle to the font object held by |text| which retains ownership."

This wording suggests to me that the text object retains ownership of the font and that I should not use the font object after having freed the text object. However text objects don't need to be freed, as they are owned by their containing page. So that makes me think that I can use the font object as long as the page containing the text object isn't closed.

But this seems a bit weird to me, since fonts are not (to my knowledge) associated with specific pages in a PDF. As far as I understand, fonts are just global objects in the PDF that can be referenced by any page. So that makes me think that a font object should live for as long as the containing PDF document isn't closed. But the API does not state this.

To add to the confusion, in a document with a single font and multiple pages, I can see that the pointers to font objects extracted from two different text objects on different pages are equal - they refer to the same font object. Does this mean that both pages refer to the same font in some reference-counted manner, or are fonts actually owned by the containing documents, not by the pages?

Would really appreciate if anyone has some insight here as the ownership of fonts is not clear to me. Thanks! :)

Best,
Victor Nordam Suadicani

geisserml

unread,
Jul 19, 2023, 3:55:39 PM7/19/23
to pdfium
I would guess that the doc comment is correct, that fonts are stored on document level in a reference counted manner so they're freed when the last reference is destroyed.
To verify this theory, you could attempt to access a font after all referrers (i.e. pages and loose pageobjects) have been closed, and see if it crashes or misbehaves.

Lei Zhang

unread,
Jul 19, 2023, 4:09:55 PM7/19/23
to Victor Suadicani, pdfium
Internally, the fonts are ref-counted, so there really isn't an owner.
The document does hold references to fonts it uses, so I believe it
can outlive the page. Patches to improve PDFium are welcome. In this
case, preferably with a test that shows closing a page does not affect
the font handle's validity.

On Wed, Jul 19, 2023 at 1:05 PM Lei Zhang <the...@google.com> wrote:
>
> Internally, the fonts are ref-counted, so there really isn't an owner.
> The document does hold references to fonts it uses, so I believe it
> can outlive the page. Patches to improve PDFium are welcome. In this
> case, preferably with a test that shows closing a page does not affect
> the font handle's validity.
> > --
> > You received this message because you are subscribed to the Google Groups "pdfium" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/17c527a0-6c7f-4a28-af1f-5482ce67748cn%40googlegroups.com.

geisserml

unread,
Jul 19, 2023, 4:18:16 PM7/19/23
to pdfium
Ok, I was wrong then. Thanks for clarifying & sorry for the additional confusion.

Victor Suadicani

unread,
Jul 20, 2023, 3:48:22 AM7/20/23
to pdfium
Thanks for the clarification. It'd be nice if there was an API to just get all the fonts from a given document, so you don't have to go through all the text objects to find the fonts. Anyways, thanks!

Victor Suadicani

unread,
Jul 20, 2023, 9:03:12 AM7/20/23
to Victor Suadicani, pdfium
So I just tried it out and it doesn't work, at least not entirely. I get a text object and its font through FPDFTextObj_GetFont. As long as the page is alive, I can call FPDFFont_GetFontData and get the data. However, as soon as I close the page, FPDFFont_GetFontData always returns a buffer of 0 length. In some cases (maybe all cases?) I also run into undefined behavior (at one point I got a crash as my program tried to allocate 1.3 terabytes). Strangely, I seem to still be able to call FPDFFont_GetWeight fine, but that might just be luck (undefined behavior and all that).

I can also see that FPDFTextObj_GetFont has this comment just before it returns the font handle:
// Unretained reference in public API. NOLINTNEXTLINE
return FPDFFontFromCPDFFont(pTextObj->GetFont());

The commit message adding that comment says:
"Add NOLINTNEXTLINE() to public methods with murky ownership.

Otherwise, clang analyzer is concerned about object lifetimes."

Perhaps clang analyzer is right to be concerned about the lifetime? I don't know the details of PDFium's internals, but the comment sounds like it is not "retaining" (incrementing the reference counter?) the font. Is this what is causing the font to be freed when the page is closed?

I would love to contribute/improve PDFium, but the code is quite opaque to me (I say that as someone with intermediate C++ skills) and barely commented, so it is quite difficult to figure out how this problem could be fixed. Would greatly appreciate any insight you might have :)

K. Moon

unread,
Jul 20, 2023, 12:06:32 PM7/20/23
to Victor Suadicani, pdfium
Retaining a reference in this case probably would result in a memory leak, since the C API can't run C++ destructors. We generally don't expect returned objects to be retained by the caller, unless there's an explicit create/destroy API.

Victor Suadicani

unread,
Jul 21, 2023, 3:56:15 AM7/21/23
to K. Moon, pdfium
I would personally much prefer an explicit create/destroy API. If I want to extract all fonts from a PDF (without duplicates) as it is right now, I need to keep *all* pages in memory to be able to compare the handles for uniqueness. For large documents, I would expect that to not be feasible. Most of the other APIs have an explicit destroy function - why not for fonts as well? It would really help in my use case.

Victor Suadicani

unread,
Jul 21, 2023, 4:05:01 AM7/21/23
to K. Moon, pdfium
Alternatively, having an API that would directly get all the fonts of a document from the document itself (i.e. FPDF_DOCUMENT) without having to go through pages and text objects would also work.

Victor Suadicani

unread,
Jul 21, 2023, 4:21:39 AM7/21/23
to K. Moon, pdfium
Also there actually already is FPDFFont_Close, which seems to destroy the font object. So presumably this would just have to be called and then it would be fine?

Victor Suadicani

unread,
Jul 21, 2023, 6:38:15 AM7/21/23
to K. Moon, pdfium
I think I actually managed to figure out the changes necessary for doing this and wrote a test as well. I'll try to do the first contributor stuff (corporate contributor agreement, adding to authors and such) and make a change request (not sure if that's the correct term).

Victor Suadicani

unread,
Aug 25, 2023, 5:35:11 AM8/25/23
to K. Moon, pdfium
So I _finally_ got the legal department to sign the contributor agreement and I put up this change: https://pdfium-review.googlesource.com/c/pdfium/+/111470

Feedback appreciated, would love to have this merged :)
Reply all
Reply to author
Forward
0 new messages