FPDFTextObj_GetText returns glyph ids instead of text

179 views
Skip to first unread message

Jens Thomsen

unread,
Feb 25, 2021, 9:46:20 AM2/25/21
to pdfium
Hi

I'm using FPDFTextObj_GetText to extract the text of pdf's, but there is a problem with the attached pdf. Instead of outputting "Schaal" I get "VFKDDO". This corresponds with it returning the glyph id's instead of the char codes. 
Should I create a ticket for this bug? 

Jens Thomsen

unread,
Feb 25, 2021, 9:47:38 AM2/25/21
to pdfium
WrongText.pdf

Olivia Yingst

unread,
Feb 25, 2021, 2:26:58 PM2/25/21
to Jens Thomsen, pdfium
Hi Jens,

Text extraction uses a ToUnicode map to map the char code into its unicode characters, and the ToUnicode map is optional in a PDF.
In this particular PDF, ToUnicode map is not available, so PDFium treats the char codes as unicodes and extracts "VFKDDO", which maps to "0056 0046 004B 0044 0044 004F".
If you test extracting text by using Acrobat or mac's Preview, you will notice you can only extract unknown/control characters.
If you test extracting the text by using Okular, it extracts the charcode directly without converting to its mapping unicode.

It really comes down to whether we want FPDFTextObj_GetText() to extract the text when the PDF deliberately doesn't want you to do so. I think it's worth filing a bug to see whether it's a good change to make the text extraction to be "nothing" when the ToUnicode map is not available.

Thanks,
Olivia

--
You received this message because you are subscribed to the Google Groups "pdfium" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/79170c37-27f2-4aee-8fc9-6a44eb155cb5n%40googlegroups.com.


--

Olivia(Hui) Yingst | Software Engineer | huiy...@google.com | 213-399-8487

Jens Thomsen

unread,
Feb 26, 2021, 8:58:34 AM2/26/21
to pdfium
Hi Olivia

It makes sense now, but is there then a way to get the unicode mapping that are used by pdfium to render pdf's, so I can map "VFKDDO" to "Schaal"? 
Kind regards 
Jens

Olivia Yingst

unread,
Feb 26, 2021, 1:12:53 PM2/26/21
to Jens Thomsen, pdfium
Hi Jens,

The answer is no. A PDF creator should have the freedom to prevent the text inside the PDF being extracted.
It's against the PDF standard if we use the mapping for rendering to do text extraction, it's the same the other way around.
And the mapping for rendering is often char codes mapped to the glyphs' bitmaps, not to their unicodes.

Thanks.
Olivia

Reply all
Reply to author
Forward
0 new messages