Command Line Tool for Extracting Text Coordinates in PDF

705 views
Skip to first unread message

Support

unread,
May 5, 2008, 2:59:01 PM5/5/08
to PDF2Text
Q: We are trying to extract coordinates from pdfs based on a text
search string. We would prefer to keep you as our sole vendor for
this sort of thing. Do you have a product that would fit the bill?

--------
A: One option is PDF2Text command-line utility (for Windows, Linux, or
Mac). You can download the utility from the downloads page (http://
www.pdftron.com/downloads.html#PTCMD). PDF2Text allows text extraction
word-by-word, or in the form of XML output that can include both
positioning and styling information.

In case, you find that PDF2Text doesn't fit the bill, you can also use
TextExtractor class in PDFNet SDK (http://www.pdftron.com/net) to
implement the required functionality (PDF2Text itself is a simple
utility developed using PDFNet SDK API). As a starting point you may
want to take a look at TextExtract sample project (www.pdftron.com/net/
samplecode.html#TextExtract).

Support

unread,
May 12, 2008, 8:07:01 PM5/12/08
to PDF2Text
Q: You mention that in case PDF2TEXT does not fit the bill, you have
another product. Can you tell me the known limitations why PDF2TEXT
may not work? We have several PDFs that we process everyday and would
like to know limitations and dependencies ahead of time.

----
A: The main difference is that PDF2Text is a simple to use command-
line application, where PDFNet SDK is a Software Development Toolkit
(SDK). The advantage of PDF2Text is that you don’t need to be a
developer in order to use it, however it does not have all of the
features that are available in PDFNet SDK (http://www.pdftron.com/
net). Also PDF2Text itself is built using PDFNet SDK API.

When it comes to text extraction from PDF, there are several things to
keep in mind. Most PDF documents do not store logical structure.
Logical information is the meta-information that groups graphical page
elements into a hierarchical structure. For example, a document is a
collection of text flows, a flow is a list of paragraphs, a paragraph
is a list of lines, a line is a list of words, a word is a list of
text runs, etc. In order to properly extract text, a text extractor
must reconstruct parts of the missing logical structure. Because this
information is not explicitly specified, the reconstruction is an
error prone process (similar to the concept of OCR -
http://en.wikipedia.org/wiki/Optical_character_recognition).

Another thing to keep in mind is that these days PDF documents are
generated using all types of buggy PDF creators and may contain custom
encoded text and broken Unicode mapping tables. These types of files
may present problems to text extraction engines even though a document
may appear completely fine on-screen.

Having said this both PDF2Text and PDFNet SDK employ state of the art
techniques to get the best possible text extraction results. Besides
high quality text extraction, additional attributes that set apart
these products from other offerings are efficiency, and robustness and
reliability.


Reply all
Reply to author
Forward
0 new messages