Delphi PDF Interpreter - Text Extractor?

Lachlan Coote

unread,

Jul 30, 2002, 2:07:14 AM7/30/02

to

I am looking for free Delphi source code for a PDF interpreter if anyone
knows of it's existence, for a personal web crawling project. Essentially I
need to extract text from pdf documents, and I am hesitant to write a full
blown parser if I don't need to.

My searches on the web only seem to have yielded libraries that convert text
TO PDFs, except the GhostScript libraries where all the PDF conversion
source code is written in postcript.

Any help would be greatly appreciated.

Brian Moelk

unread,

Jul 30, 2002, 9:29:20 AM7/30/02

to

> I am looking for free Delphi source code for a PDF interpreter if anyone
> knows of it's existence, for a personal web crawling project. Essentially
I
> need to extract text from pdf documents, and I am hesitant to write a full
> blown parser if I don't need to.

There's only one library that I've seen that addresses this:

http://www.sedtech.com/isedquickpdf/

Take a look at the reference for "GetPageText"

It's not free, costs $50. I have yet to make the time to evaluate/use this
library; so I don't know if it will do what you want.

HTH

--
Brian Moelk
bmoe...@SPAMbrainendeavorFOR.MEcom
http://www.brainendeavor.com

Rene Tschaggelar

unread,

Jul 31, 2002, 11:16:48 AM7/31/02

to

I just know the options of Acrobat (not the reader).
pdf can contain images of various formats, text, font libraries,
The compression of image/text/fonts can be choosen with various
options. It further has the option to 'view but not print' and
more.

Some companies have their datasheets scanned from a paperversion.
The content of the pdf then is a bitmap image.

I'd assume your task to be a decompression nightmare.

Good luck.

Rene
--
Ing.Buero R.Tschaggelar - http://www.ibrtses.com
& commercial newsgroups - http://www.talkto.net

Mike Leftwich

unread,

Jul 31, 2002, 1:23:11 PM7/31/02

to

"Lachlan Coote" <wester...@hotmail.com> wrote in message
news:3d462d13_2@dnews...

> I am looking for free Delphi source code for a PDF interpreter if anyone
> knows of it's existence, for a personal web crawling project. Essentially
I
> need to extract text from pdf documents, and I am hesitant to write a full
> blown parser if I don't need to.

The Acrobat Reader 5.0 can save a PDF as an RTF file. In my experience the
formatting suffers considerably, but tools that read and manipulate RTF are
easy to find. It's not a complete solution, but perhaps it could be made to
work.

HTH

Mike

Alexander Halser

unread,

Jul 31, 2002, 6:56:13 PM7/31/02

to

"Lachlan Coote" <wester...@hotmail.com> wrote

> I am looking for free Delphi source code for a PDF interpreter if anyone

The component wPDF from www.wptools.com can extract text from PDF files.
This library is actually meant to _create_ PDF files but it can partially
read them, too. However, it's not free...

--
Alexander Halser
http://www.ec-software.com

Brian Moelk

unread,

Jul 31, 2002, 7:27:02 PM7/31/02

to

> The component wPDF from www.wptools.com can extract text from PDF files.
> This library is actually meant to _create_ PDF files but it can partially
> read them, too. However, it's not free...

Have you used it for such purpose? How well does it do? All I can see on
the website is "Reads PDF files and can convert the pages into watermarks
(useful for form printing!)"

TIA

Arioch

unread,

Aug 1, 2002, 1:58:20 AM8/1/02

to

The stars so gaily glistened...
...while the fading voice of Lachlan Coote whispered through the darkness...

try to ask ElcomSoft for that <g>

Alexander Halser

unread,

Aug 1, 2002, 4:40:53 AM8/1/02

to

"Brian Moelk" <bmo...@NObrainSPAMendeavorFOR.MEcom> wrote in message
news:3d48726d$1_1@dnews...

> Have you used it for such purpose? How well does it do?

I am not sure how good it is - I never tried it myself. But wPDF installs a
component called "TWPDFPagesImport" which can read PDF files. The read
function is limited (as far as the author Julian Ziersch told me) but can
retrieve at least text from a PDF.

Geurt Lagemaat

unread,

Aug 2, 2002, 4:15:59 AM8/2/02

to

My experience is that the text is not be returned in the correct order. Even
words in a single para can be mixed. To try yourself (with some trial
components):

WPDFPagesImport1.PDFFile := 'C:\Temp\MyFile.PDF';

WPDFPagesImport1.Execute;

memoTest.Text := WPDFPagesImport1.Pages[0].Text;

Regards,

Geurt Lagemaat

Oriana Automatisering

"Brian Moelk" <bmo...@NObrainSPAMendeavorFOR.MEcom> schreef in bericht
news:3d48726d$1_1@dnews...