Issue 1609 in pdfium: Missing spaces from text

27 views
Skip to first unread message

gipet… via monorail

unread,
Oct 31, 2020, 6:20:06 AM10/31/20
to pdfiu...@googlegroups.com
Status: Unconfirmed
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 1609 by gipet...@gmail.com: Missing spaces from text
https://bugs.chromium.org/p/pdfium/issues/detail?id=1609

Check issue #1597 which was prematurely closed.
Looking at CPDF_TextPage::GetTextByPredicate and usage of IsContainPreChar, does this mean that it will allow only one space between letters in text? Does this need to be changed to allow multiple spaces to get the correct text?

--
You received this message because:
1. The project was configured to send all issue notifications to this address

You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings

n… via monorail

unread,
Nov 11, 2020, 9:18:17 PM11/11/20
to pdfiu...@googlegroups.com

Comment #1 on issue 1609 by ni...@chromium.org: Missing spaces from text
https://bugs.chromium.org/p/pdfium/issues/detail?id=1609#c1

That's correct, currently we only put 1 space in between if there is distance between characters, but not the " " character.
Different pdf viewers might choose to use space or "\n" to separate texts, but what's common between these viewers is that when there is huge space between characters within the same text object, a single " " or "\n" is used to separate them.

If you want a feature request of increasing the number of spaces in the extracted text depending on the distance between characters, it will be hard to find a one-fit-all requirement that meets every user's need. (Imaging having a really wide PDF with huge distance between character "a" and "b", then copy paste it onto a text reader which is fairly narrow, you might find "b" several lines lower than where "a" is. Another commonly used case is copying the text to a search bar with limited width.)

n… via monorail

unread,
Nov 11, 2020, 9:27:06 PM11/11/20
to pdfiu...@googlegroups.com
Updates:
Labels: -Type-Defect Type-Enhancement

Comment #2 on issue 1609 by ni...@chromium.org: Missing spaces from text
https://bugs.chromium.org/p/pdfium/issues/detail?id=1609#c2

Yes, disregard how wide the distance between characters are, a single space is used for separating the characters in the extracted text.

Different PDF viewers might choose " " or "\n" for separating extracted characters when there is a huge distance between them and it's common that only a single " " or "\n" is used for character separation disregarding how big the distance is.

If you have the feature request of increasing the number of spaces between two characters in the extracted text depending on the distance between the character, it will be hard to find a one-fit-all solution to properly define the feature on issues such as how many pixels for every space added. Imagine copying text from a super wide PDF to a narrow text editor, if two characters are at each end of a line, then the copied text might end up in different lines inside the text editor. Another use case would be copy a long text like the example you attached inside issue #1587, then pasted it into a search bar which has limited width. Then you wont be able to see the whole text inside the search bar just because spaces take a lot of the room.

gipet… via monorail

unread,
Nov 12, 2020, 7:07:16 AM11/12/20
to pdfiu...@googlegroups.com

Comment #3 on issue 1609 by gipet...@gmail.com: Missing spaces from text
https://bugs.chromium.org/p/pdfium/issues/detail?id=1609#c3

Thanks for the information. What I am trying to achieve is using my own graphics library to render the PDF, not search the text. So, how does Chrome show the correct spacing for the letters compared to my screenshot in issue #1597? Does it use different API than FPDFTextObj_GetText? Does it render the characters one by one?

n… via monorail

unread,
Nov 12, 2020, 1:17:31 PM11/12/20
to pdfiu...@googlegroups.com

Comment #4 on issue 1609 by ni...@chromium.org: Missing spaces from text
https://bugs.chromium.org/p/pdfium/issues/detail?id=1609#c4

1. FPDFTextObj_GetText is for text extraction only, it should not be involved in the rendering process.

2. Chrome's rendering result is attached below, which is showing large distance between these texts. (Other PDF viewers such as acrobat and Okular, both shows the large distance between the texts, unlike the screenshot you attached in issue #1597).

3. Chrome uses the PDFium API FPDF_RenderPage_* to render pages. If you want to see how characters are rendered individually, you can try to look into the rendering process by using pdfium_test (build by PDFium), run pdfium_test --png /PATH_TO_PDF/yourfile.pdf.
characters are rendered one by one in PDFium, since PDFium finds the glyph for each character and reads each character's position from the PDF (that's how you get the large gap). A key function to look into is GetCharPosList(), which calculates each character's rendering positions.

n… via monorail

unread,
Nov 12, 2020, 1:20:37 PM11/12/20
to pdfiu...@googlegroups.com

Comment #5 on issue 1609 by ni...@chromium.org: Missing spaces from text
https://bugs.chromium.org/p/pdfium/issues/detail?id=1609#c5

Chrome rendering result is attached here.

Attachments:
Screenshot 2020-11-12 at 10.18.53 AM - Display 1.png 54.2 KB
Reply all
Reply to author
Forward
0 new messages