Garbled characters appear when using pdfium to extract text content

443 views
Skip to first unread message

Haye Lee

unread,
Oct 30, 2023, 2:18:43 PM10/30/23
to pdfium
As mentioned, I have a PDF document and when I implemented a simple Demo to call the pdfium interface for extracting the textual content from the document, the extracted output appears as garbled code.

Haye Lee

unread,
Oct 30, 2023, 9:48:22 PM10/30/23
to pdfium
Here is a portion of my code
                        int nPageCout = GetPageCount(pDocument);
CStringW strAllTextW;
for (int i = 0; i < nPageCout; i++)
{
void* page = LoadPage(pDocument, i);
if (page != NULL)
{
        void* textpage = LoadTextPage(page);
if (textpage != NULL)
{
int nCharCout = CounrTextChars(textpage);
byte* bufferW = new byte[(nCharCout + 1) * 2];
GetText(textpage, 0, nCharCout, bufferW);


CStringW strTextW((WCHAR *)bufferW, nCharCout);
strAllTextW += strTextW;

delete[] bufferW;
bufferW = NULL;
}
}
}

When I retrieve the returned string, I find that it is not the text content of the pdf document, but a piece of unreadable content

Lei Zhang

unread,
Oct 31, 2023, 5:58:02 PM10/31/23
to Haye Lee, pdfium
Does this issue occur on all PDFs, or only certain PDFs? If it only
happens on certain PDFs, then please provide a sample.

PDFium provides FPDF_GetPageCount(). Is GetPageCount() in your code
snippet some kind of wrapper function?
> --
> You received this message because you are subscribed to the Google Groups "pdfium" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdfium+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdfium/0faf80a3-d90f-4d61-ad62-36b7a01dec14n%40googlegroups.com.

Haye Lee

unread,
Oct 31, 2023, 10:56:50 PM10/31/23
to pdfium
Thank you for your reply. This problem only appears on specific pdf documents, and I have solved it because there is an anomaly in the data of pdf documents.
But now there is a new problem that the image data in the pdf document I extracted is abnormal. Similarly, this problem only appears in some specific pdf documents. The following is part of my code and one of the pdf documents.
Looking forward to your reply.
        int nPageCount = FPDF_GetPageCount(pDocument);
int nImageCount = 0;
for (int i = 0; i < nPageCount; i++)
{
void* page = FPDF_LoadPage(pDocument, i);
int nPageObjectCount = FPDFPage_CountObjects(page);
for (int j = 0; j < nPageObjectCount; j++)
{
void* pageObject = FPDFPage_GetObject(page, j);
int nType = FPDFPageObj_GetType(pageObject);
if (nType == 3)
{
nImageCount++;
DWORD buflen = FPDFImageObj_GetImageDataDecoded(pageObject, NULL, 0);
byte* buffer = new byte[buflen];
FPDFImageObj_GetImageDataDecoded(pageObject, buffer, buflen);

CStringW strFilePath(strDirPath + L'\\');
strFilePath.AppendFormat(L"Image%d.jpg", nImageCount);
WriteBufToFile(strFilePath, buffer, buflen);

delete[] buffer;
buffer = NULL;
}
}
}
WriteBufToFile implements only one function to save content of a specified length to a specified file
Test.pdf

Justin Pierce

unread,
Nov 1, 2023, 2:51:55 AM11/1/23
to pdfium
Try using `FPDFImageObj_GetBitmap` for the easiest extraction. Getting decoded image data is more complicated and requires more knowledge to use properly. Also available is `FPDFImageObj_GetImageDataRaw`
Message has been deleted

Haye Lee

unread,
Nov 1, 2023, 4:25:04 AM11/1/23
to pdfium
I get the data using FPDFImageObj_GetBitmap, but how do I turn this data into an image?

geisserml

unread,
Nov 1, 2023, 4:37:20 PM11/1/23
to pdfium
You'd have to use some imaging library that can encode the raw bitmap data into a file format, e.g. pypdfium2 can use pillow or numpy+cv2
It should also be relatively straightforward to build your own PPM writer without additional libraries.

However, FPDFImageObj_GetBitmap() isn't ideal for extracting to a file because of the re-encoding. For JPEG/JP2 FPDFImageObj_GetImageDataDecoded() should work losslessly already.
See https://crbug.com/pdfium/1930 for discussion of other formats.

Haye Lee

unread,
Nov 1, 2023, 10:04:15 PM11/1/23
to pdfium
The meaning of your statement is not entirely clear to me, but are you suggesting that even after using FPDFImageObj_GetImageDataDecoded(), I still need to further process the resulting data and compile it into an actual image file?

Haye Lee

unread,
Nov 2, 2023, 3:03:39 AM11/2/23
to pdfium
According to the description, I understood the PPM picture file format, and then implemented a function to generate the PPM file header. After assembling all the data together, I can now open it as a picture file, thank you

geisserml

unread,
Nov 2, 2023, 9:04:08 AM11/2/23
to pdfium
Nice you got it working. Some clarifications: After FPDFImageObj_GetBitmap(), one always has to encode the data to save an image file.
Note that FPDFImageObj_GetImageDataDecoded() only decodes a subset of filters, and leaves complex filters in place, so if the image uses the DCT (JPEG) filter, you could save it as-is and don't need to encode.
If the image has no complex filters, however, you'd need to encode. (I agree the function name is a bit confusing.)
Reply all
Reply to author
Forward
0 new messages