Comment #3 on issue 301 by
farhad.k...@gmail.com: FPDFText_GetText - the extracted text is returned with a very weird structure
https://bugs.chromium.org/p/pdfium/issues/detail?id=301#c3The problem you have encountered relates to automatic segmentation and discovery of the reading order in a class of PDF documents often found in scientific publications and specially consumer magazines. Without proper identification of document structure (columns, headers, footnotes, tables, graphs, equations and so on), it is difficult to come up with a method to discover the proper reading order in such documents.
I did some research (no shortage of that on this subject, specially in recent years) and built a utility using Pdfium to investigate the various algorithms.
I started by treating a PDF text page as a bag of characters with only positional and font info, similar to what you get from an OCR engine.
From the character collection, I detected text line fragments, combined them into text blocks and assigned a reading order using spatial ordering and some heuristics.
I applied the methodology to your sample page and have attached the result in a text file. I have a favor to ask. Since I don't know Portuguese, I would appreciate if you could take a look at the text file and let me know where the algorithm has made mistakes (even minor ones).
I have also attached a screenshot of the detected reading order for this sample page.
Thanks!
Attachments:
ReadingOrder.png 437 KB
61958969.txt 4.2 KB