[PDF] ? Regarding extracting text from PDFs

Thomas Connolly

unread,

Apr 1, 2003, 12:43:00 PM4/1/03

to

The PDF list is a service provided by PDFzone.com | http://www.pdfzone.com
__________________________________________________________________

I am currently looking for some way (preferably automated through software
or scripts) that would allow for the text from a multi-page PDF document to
be extracted, and inserted into a mySQL database with a separate record for
each page containing and identifier (page #) and the text from that page.
Any ideas?

--
Tom Connolly
Webmaster
Curran & Connors
http://www.curran-connors.com
631.435.0400

To change your subscription:
http://www.pdfzone.com/discussions/lists-pdf.html

Scott Brenner

unread,

Apr 1, 2003, 1:25:32 PM4/1/03

to

The PDF list is a service provided by PDFzone.com | http://www.pdfzone.com
__________________________________________________________________

Dear Tom Connolly,

In adobe acrobat 5.0. The document object has a method called getPageNthWord(nPage,nWord,bStrip) that will let you loop through all the words in a document.
You also have access to the adobe ADBC object. ADBC allows you to link to any ODBC table.

I am not sure about is the limits of ODBC/ADBC text fields, or how perfectlly you need the text recorded but it is an idea you can explore.

Once you get the javascript down you could turn the javascript into a batch job and run it against entire PDF directories on demand.

Good Luck,
Sam BRenner

>>> tcon...@curran-connors.com 04/01/03 12:18:12 >>>

> I am currently looking for some way (preferably automated through software
> or scripts) that would allow for the text from a multi-page PDF document to
> be extracted, and inserted into a mySQL database with a separate record for
> each page containing and identifier (page #) and the text from that page.
> Any ideas?

Steven M. Harris

unread,

Apr 1, 2003, 3:17:27 PM4/1/03

to

The PDF list is a service provided by PDFzone.com | http://www.pdfzone.com
__________________________________________________________________

For text extraction you might consider 'AutoTextExtractor" -
http://www.pdfstore.com/details.asp?ProdID=578
For something that will create an index showing you where key words and
phrases are by page number within
a PDF file (or lots of PDF files) "SuperFindar" will create an index in
database form (it has two other
index options as well) that shows the page numbers on which a matching
word/phrase was found. http://www.pdfstore.com/details.asp?ProdID=602
SuperFindAR-Pro does the same thing but will accept an unlimited number of
words and phrases to index.

Leonard Rosenthol

unread,

Apr 1, 2003, 3:30:20 PM4/1/03

to

The PDF list is a service provided by PDFzone.com | http://www.pdfzone.com
__________________________________________________________________

At 12:18 PM 4/1/2003 -0500, Thomas Connolly wrote:
>I am currently looking for some way (preferably automated through software
>or scripts) that would allow for the text from a multi-page PDF document to
>be extracted, and inserted into a mySQL database with a separate record for
>each page containing and identifier (page #) and the text from that page.
>Any ideas?
>

There are a number of application, on different OS platforms, with
different programming interfaces that couuld be used for this purpose.

Leonard
---------------------------------------------------------------------------
Leonard Rosenthol <mailto:leon...@pdfsages.com>
Chief Technical Officer <http://www.pdfsages.com>
PDF Sages, Inc. 215-629-3700 (voice)

Ahmed Zaki

unread,

Apr 2, 2003, 1:28:02 AM4/2/03

to

The PDF list is a service provided by PDFzone.com | http://www.pdfzone.com
__________________________________________________________________

well if it is tif file or jpg not PDF then I would recommend Form Reader
www.ABBYY.com
it will extract the text and inserted in your ODBC Data Base

Ahmed

----- Original Message -----
From: "Thomas Connolly" <tcon...@curran-connors.com>
To: <p...@lists.pdfzone.com>
Sent: Tuesday, April 01, 2003 8:18 PM
Subject: [PDF] ? Regarding extracting text from PDFs

>