Reading PDF file

조회수 3회

읽지 않은 첫 메시지로 건너뛰기

Stefan Rosi

읽지 않음,

2004. 8. 30. 오전 6:56:4604. 8. 30.

받는사람

Heallo,

In my company they need to compare two versions of a PDF file. I found a lot
of .NET components to generate and merge PDFs. Anyways neither was able just
to read the text. The PDF file is in german and has many photos and text box
and not just a plan text (but only the text has to be compared).
Can anybody help me finding a .NET component (or com !) to extract only
the text from such a file?

Thanks alot in advance

Dennis Myrén

읽지 않음,

2004. 8. 30. 오전 8:10:1704. 8. 30.

받는사람

Extracting text from PDF streams is an extremely complicated task.
I wont get deeper into that here, but you may read chapters 4-5 of the
PDF specification to get an idea.
There is not just one but many different formats which can be used to
describe text
in PDF documents.

I am sure there is nothing for free that will suit your needs, only
commercial products
provide such a professional text extraction implementation.
Maybe this one could give you a starter, though:
http://www.codeproject.com/cpp/ExtractPDFText.asp
It provides an open source text extraction implementation that can handle
GZIP compressed
(which is one of at least 3 common compression algorithms used for text)
streams of text in some formats.
But it is far from a complete PDF text extraction tool.

If the text to be compared is stored in separate content streams within the
documents,
maybe you could achieve it by comparing the streams at byte level.

--
Regards,
Dennis JD Myrén
Oslo Kodebureau
"Stefan Rosi" <inv...@invalid.com> wrote in message
news:%237a$BAojE...@TK2MSFTNGP09.phx.gbl...

새 메시지 0개