Issue 1618 in pdfium: FPDFText_GetText mangles first arabic character

48 views
Skip to first unread message

st… via monorail

unread,
Nov 13, 2020, 1:59:32 AM11/13/20
to pdfiu...@googlegroups.com
Status: Unconfirmed
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 1618 by st...@logicore.se: FPDFText_GetText mangles first arabic character
https://bugs.chromium.org/p/pdfium/issues/detail?id=1618

What steps will reproduce the problem?
1.Download https://dyt.se/wp-content/uploads/2018/01/99-names-1.pdf
2.Extract text of first page using FPDFText_LoadPage()/FPDFText_GetText()


What is the expected output? What do you see instead?
The second line should read:
1 Allah ( الله ) Gud, Gud allena
But the received text is:
1 Allah (ال (Gud, Gud allena

What version of the product are you using? On what operating system?
Windows 10, latest 64-bit binaries from https://github.com/bblanchon/pdfium-binaries

Please provide any additional information below.

Note that the parentheses are also messed up. The same thing happens if the pdf is opened in the Chrome browser in windows and the text is copied out as well.

--
You received this message because:
1. The project was configured to send all issue notifications to this address

You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings

n… via monorail

unread,
Nov 13, 2020, 1:47:52 PM11/13/20
to pdfiu...@googlegroups.com
Updates:
Status: Available

Comment #1 on issue 1618 by ni...@chromium.org: FPDFText_GetText mangles first arabic character
https://bugs.chromium.org/p/pdfium/issues/detail?id=1618#c1

Confirmed that both Okular (on linux) and Acrobat (on Windows) can extract "لله" and "(" ")" correctly.
Wrong text was extracted by chrome's PDF viewer across platforms with dev channel 88.0.4315.5.

n… via monorail

unread,
Nov 13, 2020, 5:26:24 PM11/13/20
to pdfiu...@googlegroups.com

Comment #2 on issue 1618 by ni...@chromium.org: FPDFText_GetText mangles first arabic character
https://bugs.chromium.org/p/pdfium/issues/detail?id=1618#c2

A simplified test case for the single Arabic character "لله " is attached below. it's character code "!" (x21) matches the unicde combination of <0644 0644 0647> in the ToUnicode map. And "لله" 's composition is ا (U+0627) - ل (U+0644) - ل (U+0644) - ه (U+0647). Could be PDFium is not handling this translation from char code to Unicode gracefully, that only unicode ل (U+0644) is returned.

As for the parenthesis issue, singling out the parenthesis into a simpler test doesn't trigger the issue anymore. We still need to create a minimal case to represent this issue.

Attachments:
bug_1618_letter.pdf 19.6 KB

n… via monorail

unread,
Nov 13, 2020, 6:01:14 PM11/13/20
to pdfiu...@googlegroups.com

Comment #3 on issue 1618 by ni...@chromium.org: FPDFText_GetText mangles first arabic character
https://bugs.chromium.org/p/pdfium/issues/detail?id=1618#c3

A simplified PDF is attached below which can still reproduce the reversed parenthesis issue. This test case still keeps the string "Guds 99 namn (ur koranen och hadit", "1" and "Allah" characters in place since deleting these text will stop the triggering the reversed parenthesis issue.

Attachments:
bug_1618_parenthesis.pdf 50.9 KB
Reply all
Reply to author
Forward
0 new messages