Support
unread,Mar 3, 2009, 9:49:51 PM3/3/09Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to PDFTron PDFNet SDK
Q: It sometimes appears that the method TextExtractor.GetAsText()
returns text that is in the internal physical structure of the PDF
document, not the structure that corresponds to how the document
prints out as a PDF document. I need to be able to extract the text in
"reading order" instead of PDF "layout order". Is there a way for me
to do this?
------
A: TextExtractor.GetAsText() does not return text as it is stored in
the internal physical structure of the PDF document. Instead this
method attempts to reconstruct the "reading order". Unfortunately this
is a non-exact, error prone process. For many PDF documents the method
returns correct reading order, however there will be always some files
(especially for multi-column or scattered text) for which the
reconstructed reading order is incorrect. If you send us a sample file
(to support at pdftron), we will take a look into it and will try to
improve the text recognition algorithm.
Please keep in mind that using TextExtractor you can also access text
flows, blocks, lines, and words (along with their positioning and
styling information). You can use this information to build your own
text reading order.