Issue 301 in pdfium: FPDFText_GetText - the extracted text is returned with a very weird structure

1,366 views
Skip to first unread message

prumyantsev@gmail.com via Monorail

unread,
Dec 5, 2015, 9:26:47 PM12/5/15
to pdfiu...@googlegroups.com
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 301 by prumyant...@gmail.com: FPDFText_GetText - the extracted
text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301

I'm using FPDFText_GetText in order to extract text from a pdf file and in
that particular file the extracted text is returned with a very weird
structure. If I open the same pdf using, for example, Acrobat Reader and
then select and copy the text to the clipboard the structure is a
more "normal" one.

Attached is the pdf file and the resulting texto from GetText operation.



Attachments:
61958969.pdf 210 KB
GetText.txt 9.7 KB

--
You received this message because:
1. The project was configured to send all issue notifications to this
address

You may adjust your notification preferences at:
https://bugs.chromium.org/hosting/settings

thestig@chromium.org via Monorail

unread,
Dec 31, 2015, 6:20:56 PM12/31/15
to pdfiu...@googlegroups.com
Updates:
Cc: jun_f...@foxitsoftware.com

Comment #1 on issue 301 by thes...@chromium.org: FPDFText_GetText - the
extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c1

If you uncompress the PDF and look inside, most of the text in the PDF file
is written out one character at a time. So FPDFText_GetText() is accurately
giving you back the text in the PDF.

Working as intended?

farhad.khalafi@gmail.com via Monorail

unread,
Jun 11, 2016, 10:32:49 AM6/11/16
to pdfiu...@googlegroups.com

Comment #2 on issue 301 by farhad.k...@gmail.com: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c2

I have experienced similar issue. My attached file is much simpler. The order of characters in the PDF file is different from their reading order. In my case, the lines in the file are extracted starting from the last line on the page. The index of characters start from first character on last line, follow the line, then jump to first character on the line before the last, an so on.

Your case is a lot more complicated. You have multiple columns and the characters are placed in what appears as a random order on the page.

I am not sure whether Pdfium attempts to sort characters using their boundaries and/or attempt to recognize columns and the reading order. The primitives are there to do your own text analysis but it is a lot of work.

Hope the good people at Pdfium will address this.

Cheers!

Attachments:
weblinks.pdf 1.1 KB
weblinks.txt 362 bytes

farhad.k…@gmail.com via Monorail

unread,
Jul 7, 2016, 12:49:35 AM7/7/16
to pdfiu...@googlegroups.com

Comment #3 on issue 301 by farhad.k...@gmail.com: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c3

The problem you have encountered relates to automatic segmentation and discovery of the reading order in a class of PDF documents often found in scientific publications and specially consumer magazines. Without proper identification of document structure (columns, headers, footnotes, tables, graphs, equations and so on), it is difficult to come up with a method to discover the proper reading order in such documents.

I did some research (no shortage of that on this subject, specially in recent years) and built a utility using Pdfium to investigate the various algorithms.

I started by treating a PDF text page as a bag of characters with only positional and font info, similar to what you get from an OCR engine.

From the character collection, I detected text line fragments, combined them into text blocks and assigned a reading order using spatial ordering and some heuristics.

I applied the methodology to your sample page and have attached the result in a text file. I have a favor to ask. Since I don't know Portuguese, I would appreciate if you could take a look at the text file and let me know where the algorithm has made mistakes (even minor ones).

I have also attached a screenshot of the detected reading order for this sample page.

Thanks!

Attachments:
ReadingOrder.png 437 KB
61958969.txt 4.2 KB

dsincl… via monorail

unread,
Sep 30, 2016, 7:45:00 PM9/30/16
to pdfiu...@googlegroups.com
Updates:
Cc: n...@chromium.org

Comment #4 on issue 301 by dsin...@chromium.org: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c4

(No comment was entered for this change.)

wiebre… via monorail

unread,
Mar 29, 2017, 11:02:14 AM3/29/17
to pdfiu...@googlegroups.com

Comment #5 on issue 301 by wiebre...@gmail.com: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c5

As workaround a tried to load the pdf in the pdfviewer of pdfium. and the select all text and then use it.
Then i noticed that it came out the same weird way.
If you open up the pdf and select manually the text you'll notice how it jumps around and not going from top to bottom and left to right.
It almost seems as there is a Tabindex on the pdf document that controls the way the elements come out. And that the document got edited along the way and that the tabindex is all over the document.
Even the X and Y position don't even make much sense. And some element have 0, 0 as position. There must be somewhere the position of the text since it renders fine..

farhad.k… via monorail

unread,
Mar 29, 2017, 11:39:31 AM3/29/17
to pdfiu...@googlegroups.com

Comment #6 on issue 301 by farhad.k...@gmail.com: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c6

I wrote a Pdfium based viewer that handles the text layout problem for documents where the character order is much different from the intended reading order. This is a free program available from www.pdfgold.com (3.2MB download). When you load the document, press F12 to toggle the (undocumented) text layout mode to display a toolbar at the bottom of each page. This visually shows how characters, words, text line segments, text blocks and the reading order are computed. The program still needs more work/debugging. Please let me know if you see any bugs.

dsincl… via monorail

unread,
Mar 29, 2017, 12:20:15 PM3/29/17
to pdfiu...@googlegroups.com
Updates:
Cc: -jun_f...@foxitsoftware.com
Labels: -Priority-Medium Priority-Low
Owner: dsin...@chromium.org
Status: Accepted

Comment #7 on issue 301 by dsin...@chromium.org: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c7

I've been looking into the text extraction code a bit, hopefully we can fix it up a bit. Marking the issue as low priority as I'm not sure when I'll get to it.

Any resources or references to the mentioned #3 would be greatly appreciated.

dsincl… via monorail

unread,
Apr 3, 2017, 11:47:50 AM4/3/17
to pdfiu...@googlegroups.com
Issue 301: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301

This issue is now blocking issue 647.
See https://bugs.chromium.org/p/pdfium/issues/detail?id=647

dsincl… via monorail

unread,
Apr 3, 2017, 11:48:10 AM4/3/17
to pdfiu...@googlegroups.com
Issue 301: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301

This issue is now blocking issue 199.
See https://bugs.chromium.org/p/pdfium/issues/detail?id=199

dsincl… via monorail

unread,
Apr 3, 2017, 11:48:39 AM4/3/17
to pdfiu...@googlegroups.com
Issue 301: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301

This issue is now blocking issue 521.
See https://bugs.chromium.org/p/pdfium/issues/detail?id=521

rttsoftw… via monorail

unread,
Jan 23, 2020, 9:58:27 AM1/23/20
to pdfiu...@googlegroups.com

Comment #12 on issue 301 by rttsoftw...@gmail.com: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c12

Can't the ideas, or even code, from the Tesseract OCR project be used to fix this issue? E.g. Hybrid Page Layout Analysis via Tab-Stop Detection ( https://research.google/pubs/pub35094/ )
The problem here seems to be much more simple, as we already have the text objects/characters coordinates.

Functionalities such as read out loud or text reflow are unreliable the way PDFium returns text currently. :-(

thes… via monorail

unread,
Jan 23, 2020, 1:05:08 PM1/23/20
to pdfiu...@googlegroups.com

Comment #14 on issue 301 by the...@chromium.org: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c14

re: comment 12 - One can certainly separately use OCR software or other means to scan the rendered PDF and extract information that way.

rttsoftw… via monorail

unread,
Jan 23, 2020, 7:53:58 PM1/23/20
to pdfiu...@googlegroups.com

Comment #15 on issue 301 by rttsoftw...@gmail.com: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c15

I'm referring to the techniques used to join the character blobs into words, lines, columns and return the result text in reading order, as the comment #3 explains. The Tesseract-OCR project has code (https://github.com/tesseract-ocr/tesseract/tree/master/src/textord) to do this and the comment #7 is asking for references...

In our case the "blobs" are already characters, with position and font information, so the OCR step is not needed.
Reply all
Reply to author
Forward
0 new messages