Issue 1525 in pdfium: Extra spaces between characters

30 views
Skip to first unread message

gdv1… via monorail

unread,
May 8, 2020, 5:23:59 AM5/8/20
to pdfiu...@googlegroups.com
Status: Unconfirmed
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 1525 by gdv1...@gmail.com: Extra spaces between characters
https://bugs.chromium.org/p/pdfium/issues/detail?id=1525

What steps will reproduce the problem?
1. Open PDF in Chrome browser
2. Select textline
3. Paste it on notepad

What is the expected output? What do you see instead?
expected: Monday, February 13, 2012
actual: M o n d a y , F e b ru a ry 13, 2 0 1 2

What version of the product are you using? On what operating system?
Windows Chrome browser v81.0.4044.138

Please provide any additional information below.
We are try to extract text from PDF document, but extracted text contains extra spaces between characters.
PDFium returns CharBox with zero height and width for extra characters.
I do not have permission from the customer to put PDF file to the open sources, but I can send it by email.

--
You received this message because:
1. The project was configured to send all issue notifications to this address

You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings

n… via monorail

unread,
May 8, 2020, 12:55:27 PM5/8/20
to pdfiu...@googlegroups.com
Updates:
Labels: Needs-Feedback

Comment #1 on issue 1525 by ni...@chromium.org: Extra spaces between characters
https://bugs.chromium.org/p/pdfium/issues/detail?id=1525#c1

@Reporter:
Is there any way that you can remove the sensitive information and only leave the date text line in the PDF file?

n… via monorail

unread,
May 8, 2020, 1:15:11 PM5/8/20
to pdfiu...@googlegroups.com

Comment #2 on issue 1525 by ni...@chromium.org: Extra spaces between characters
https://bugs.chromium.org/p/pdfium/issues/detail?id=1525#c2

@Reporter:
Two ways to get the text case for this issue:
1. If you have any PDF editing tools, you can try to remove all the sensitive information out of the PDF and only leave the date text in a PDF page. Then check with the customer if it's OK to have this new PDF uploaded to the bug tracker.
2. Send the PDF through this chromium email since you have got the permission to do so. Then we can try to create a new test case based on that, without any sensitive information included.

n… via monorail

unread,
May 12, 2020, 12:35:15 PM5/12/20
to pdfiu...@googlegroups.com
Updates:
Labels: -Needs-Feedback
Owner: ni...@chromium.org
Status: Assigned

Comment #3 on issue 1525 by ni...@chromium.org: Extra spaces between characters
https://bugs.chromium.org/p/pdfium/issues/detail?id=1525#c3

Got the permission to upload a simplified test case: test.pdf

This issue happens when extracting texts in Acrobat and Okular as well, but some of the characters were extracted with discrepancies.
Acrobat: "M on d a y , F e b ru a ry 13, 2012"
Okular: "M o n d a y , F e b ru a ry 13, 2 0 1 2"
PDF viewer: "M o n d a y , F e b ru a ry 13, 2 0 1 2"

Even though for this date text, Okular and PDF viewer have the same text extraction result, but it doesn't mean they have the best results. In general. text extraction results from Acrobat on this specific font, has way fewer spaces compared to the PDF viewer.

There are more characters which are extracted with discrepancies in the original PDF file. We will create a test case providing more test coverage on this issue based on test.pdf.

Attachments:
test.pdf 2.7 KB

ivars… via monorail

unread,
Oct 26, 2020, 3:46:37 AM10/26/20
to pdfiu...@googlegroups.com

Comment #4 on issue 1525 by ivars...@gmail.com: Extra spaces between characters
https://bugs.chromium.org/p/pdfium/issues/detail?id=1525#c4

There is a character mapping that maps the character's in a seemingly random order to an offset starting at 01.
It spaces the characters individually in the TJ command with negative kerning
I has a Widths modifier for the used font.
Speculation : Could it be that the spaces occur because of rounding differences between the characters making the detection of spaces inconclusive?
Reply all
Reply to author
Forward
0 new messages