Q: You mention that in case PDF2TEXT does not fit the bill, you have
another product. Can you tell me the known limitations why PDF2TEXT
may not work? We have several PDFs that we process everyday and would
like to know limitations and dependencies ahead of time.
A: The main difference is that PDF2Text is a simple to use command-
line application, where PDFNet SDK is a Software Development Toolkit
(SDK). The advantage of PDF2Text is that you don’t need to be a
developer in order to use it, however it does not have all of the
features that are available in PDFNet SDK (http://www.pdftron.com/
net). Also PDF2Text itself is built using PDFNet SDK API.
When it comes to text extraction from PDF, there are several things to
keep in mind. Most PDF documents do not store logical structure.
Logical information is the meta-information that groups graphical page
elements into a hierarchical structure. For example, a document is a
collection of text flows, a flow is a list of paragraphs, a paragraph
is a list of lines, a line is a list of words, a word is a list of
text runs, etc. In order to properly extract text, a text extractor
must reconstruct parts of the missing logical structure. Because this
information is not explicitly specified, the reconstruction is an
error prone process (similar to the concept of OCR -
Another thing to keep in mind is that these days PDF documents are
generated using all types of buggy PDF creators and may contain custom
encoded text and broken Unicode mapping tables. These types of files
may present problems to text extraction engines even though a document
may appear completely fine on-screen.
Having said this both PDF2Text and PDFNet SDK employ state of the art
techniques to get the best possible text extraction results. Besides
high quality text extraction, additional attributes that set apart
these products from other offerings are efficiency, and robustness and