In my company they need to compare two versions of a PDF file. I found a lot
of .NET components to generate and merge PDFs. Anyways neither was able just
to read the text. The PDF file is in german and has many photos and text box
and not just a plan text (but only the text has to be compared).
Can anybody help me finding a .NET component (or com !) to extract only
the text from such a file?
Thanks alot in advance
I am sure there is nothing for free that will suit your needs, only
commercial products
provide such a professional text extraction implementation.
Maybe this one could give you a starter, though:
http://www.codeproject.com/cpp/ExtractPDFText.asp
It provides an open source text extraction implementation that can handle
GZIP compressed
(which is one of at least 3 common compression algorithms used for text)
streams of text in some formats.
But it is far from a complete PDF text extraction tool.
If the text to be compared is stored in separate content streams within the
documents,
maybe you could achieve it by comparing the streams at byte level.
--
Regards,
Dennis JD Myrén
Oslo Kodebureau
"Stefan Rosi" <inv...@invalid.com> wrote in message
news:%237a$BAojE...@TK2MSFTNGP09.phx.gbl...