Issue 1597 in pdfium: Wrong text when using FPDFTextObj_GetText

217 views
Skip to first unread message

gipet… via monorail

unread,
Oct 12, 2020, 8:41:55 AM10/12/20
to pdfiu...@googlegroups.com
Status: Unconfirmed
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 1597 by gipet...@gmail.com: Wrong text when using FPDFTextObj_GetText
https://bugs.chromium.org/p/pdfium/issues/detail?id=1597

What steps will reproduce the problem?

Use the following code to read the text of the attached PDF file.
FPDF_LIBRARY_CONFIG config;
config.version = 2;
config.m_pUserFontPaths = NULL;
config.m_pIsolate = NULL;
config.m_v8EmbedderSlot = 0;
FPDF_InitLibraryWithConfig(&config);

FPDF_DOCUMENT pdfDocument = FPDF_LoadDocument("PATH_TO_embedded_images.pdf", NULL);
FPDF_PAGE page = FPDF_LoadPage(pdfDocument, 0);
FPDF_TEXTPAGE textPage = FPDFText_LoadPage(page);
int objectCount = FPDFPage_CountObject(page);
for (int i = 0; i < objectCount; i++)
{
FPDF_PAGEOBJECT pageObject = FPDFPage_GetObject(page, i);
int type = FPDFPageObj_GetType(pageObject);
if (type == FPDF_PAGEOBJ_TEXT)
{
unsigned long size = FPDFTextObj_GetText(pageObject, textPage, nullptr, 0);
std::vector<unsigned short> buffer(size / 2);
FPDFTextObj_GetText(pageObject, textPage, buffer.data(), size);
std::wstring str(buffer.begin(), buffer.end());
std::wcout << "Size: " << size << " \"" << str << "\"" << std::endl;
}
}

What is the expected output? What do you see instead?
The forth text is returned as "LZW tiff; Flate RGB jpeg; Flate CMYK j"
Shouldn't this return three separate text?
When we try to visualize this we get a wrong result as depicted in WrongText.png

Attachments:
embedded_images.pdf 33.5 KB
WrongText.png 64.6 KB

--
You received this message because:
1. The project was configured to send all issue notifications to this address

You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings

n… via monorail

unread,
Oct 30, 2020, 3:48:48 PM10/30/20
to pdfiu...@googlegroups.com
Updates:
Status: WontFix

Comment #1 on issue 1597 by ni...@chromium.org: Wrong text when using FPDFTextObj_GetText
https://bugs.chromium.org/p/pdfium/issues/detail?id=1597#c1

Once you uncompress the PDF file, you will see "LZW tiff; Flate RGB jpeg; Flate CMYK j" was written within one text object (Shown as below):
[(L)-11(Z)-8(W)4( t)-6(if)12(f)10(;)-4( )-3( )-3( )-3( )-3( )-3( )-3( )19( )-3( )-3( )-3( )-3( )-3( )-3( )21( )-3( F)3(la)4(t)-5(e )-2(RG)-7(B )20(j)-10(peg)-5(;)-4( )-3( )19( )-3( )-3( )-3( )21( )-3( )-3( )-3( )-3( )-3( )-3( )-3( )-3( )-3( )21( F)3(la)4(t)-5(e )-2(C)10(M)-8(Y)10(K )-3(j)] TJ

Since all characters are drawn with on TJ operator, FPDFTextObj_GetText() is working as intended.

gipet… via monorail

unread,
Oct 30, 2020, 9:18:31 PM10/30/20
to pdfiu...@googlegroups.com

Comment #2 on issue 1597 by gipet...@gmail.com: Wrong text when using FPDFTextObj_GetText
https://bugs.chromium.org/p/pdfium/issues/detail?id=1597#c2

So, perhaps there are missing spaces in the returned text? Otherwise, how does the PDF have spaces in Chrome?

sharo… via monorail

unread,
Nov 18, 2020, 1:42:33 AM11/18/20
to pdfiu...@googlegroups.com

Comment #3 on issue 1597 by sharo...@gmail.com: Wrong text when using FPDFTextObj_GetText
https://bugs.chromium.org/p/pdfium/issues/detail?id=1597#c3

(No comment was entered for this change.)

Attachments:
457641_314432011953759_499152530_o.jpg 249 KB

sharo… via monorail

unread,
Nov 18, 2020, 1:43:37 AM11/18/20
to pdfiu...@googlegroups.com

Comment #4 on issue 1597 by sharo...@gmail.com: Wrong text when using FPDFTextObj_GetText
https://bugs.chromium.org/p/pdfium/issues/detail?id=1597#c4

thank you
Reply all
Reply to author
Forward
0 new messages