Re: PDF to Access DB

Jeff Boyce

unread,

Mar 31, 2009, 5:15:55 PM3/31/09

to

Not sure why you posted twice ...

A PDF file is an image. To get the "data" out of it, you'd need to convert
it to something other than an image.

If you have a "pro" version of something like Adobe, you can try the OCR
(optical character recognition) feature to re-build the underlying data ...
but be aware that OCR is less than 100% accurate. Plan on having some of
your data 'lost in translation'.

Regards

Jeff Boyce
Microsoft Office/Access MVP

"FordsAngel" <Fords...@discussions.microsoft.com> wrote in message
news:5B120BEF-D9DF-4499...@microsoft.com...
> Is there a way to load tables in MSAccess with information from a PDF
> file? I
> have been told that it is possible with pdf pro, but have not yet figured
> out
> the process...
>
> Thank you!

a a r o n _ k e m p f

unread,

Mar 31, 2009, 5:54:43 PM3/31/09

to

SQL Server can search through PDF files using FullText Search.
SQL Server can search through PDF files using FullText Search.
SQL Server can search through PDF files using FullText Search.
SQL Server can search through PDF files using FullText Search.

a a r o n _ k e m p f

unread,

Mar 31, 2009, 5:56:50 PM3/31/09

to

I believe that you just need to register / install something that
implements the Acrobat IFilter interface

http://dineshasanka.spaces.live.com/Blog/cns!22A79FCE82651673!248.entry

Jeff Boyce

unread,

Mar 31, 2009, 7:27:43 PM3/31/09

to

Did you also notice that some SQL-Server gurus point out that modifying
SQL-Server's FullText search to add the ability to search PDF files also
increases the security risk?

Regards

Jeff Boyce
Microsoft Office/Access MVP

"a a r o n _ k e m p f" <aaron...@hotmail.com> wrote in message
news:7290eace-9c7f-4aa5...@y34g2000prb.googlegroups.com...

a a r o n _ k e m p f

unread,

Mar 31, 2009, 8:44:40 PM3/31/09

to

uh, I don't believe everything I hear.. especially from so-called
'mvps'

I'd take any MCP over any MVP any day of the week.. it's about
demonstrable knowledge, not 'who you know'.. you know?

-Aaron

On Mar 31, 4:27 pm, "Jeff Boyce" <nonse...@nonsense.com> wrote:
> Did you also notice that some SQL-Server gurus point out that modifying
> SQL-Server's FullText search to add the ability to search PDF files also
> increases the security risk?
>
> Regards
>
> Jeff Boyce
> Microsoft Office/Access MVP
>

> "a a r o n _ k e m p f" <aaron_ke...@hotmail.com> wrote in messagenews:7290eace-9c7f-4aa5...@y34g2000prb.googlegroups.com...

>
> > SQL Server can search through PDF files using FullText Search.
> > SQL Server can search through PDF files using FullText Search.
> > SQL Server can search through PDF files using FullText Search.
> > SQL Server can search through PDF files using FullText Search.
>
> > Jeff Boyce wrote:
> >> Not sure why you posted twice ...
>
> >> A PDF file is an image. To get the "data" out of it, you'd need to
> >> convert
> >> it to something other than an image.
>
> >> If you have a "pro" version of something like Adobe, you can try the OCR
> >> (optical character recognition) feature to re-build the underlying data
> >> ...
> >> but be aware that OCR is less than 100% accurate. Plan on having some of
> >> your data 'lost in translation'.
>
> >> Regards
>
> >> Jeff Boyce
> >> Microsoft Office/Access MVP
>

> >> "FordsAngel" <FordsAn...@discussions.microsoft.com> wrote in message

Paul Shapiro

unread,

Mar 31, 2009, 9:03:07 PM3/31/09

to

A PDF file is not necessarily an image. It can contain text if it was
created from a text-based source. But it's still an "uncomfortable" medium
for extracting data since that's not it's intended purpose. If you open a
PDF in Adobe Acrobat, you can try the Save As menu to see what options you
have. Plain text is one of them, but it will still be tough to count on
extracting the data cleanly and reliably. If you have any option to get data
in a data-centric format, you'll have a much easier time.

"Jeff Boyce" <nons...@nonsense.com> wrote in message
news:ujNThXks...@TK2MSFTNGP03.phx.gbl...

Larry Linson

unread,

Mar 31, 2009, 9:57:17 PM3/31/09

to

"a a r o n _ k e m p f" <aaron...@hotmail.com> wrote

> I'd take any MCP over any MVP any day of the week..
> it's about demonstrable knowledge, not 'who you know'..
> you know?

It would be interesting to have the psychic ability to know just how many
people are rolling in the floor, laughing at the idea of trusting Mr. Kempf
(who claims, but hasn't provided a link to prove, some flavor of MCP) over
any of the MVPs who post in this forum.

I'm not "calling for a vote", but I think I can hear the peals of laughter
from here.

Larry

James A. Fortune

unread,

Mar 31, 2009, 9:54:08 PM3/31/09

to

Jeff Boyce wrote:
> Not sure why you posted twice ...
>
> A PDF file is an image. To get the "data" out of it, you'd need to convert
> it to something other than an image.
>
> If you have a "pro" version of something like Adobe, you can try the OCR
> (optical character recognition) feature to re-build the underlying data ...
> but be aware that OCR is less than 100% accurate. Plan on having some of
> your data 'lost in translation'.
>
> Regards
>
> Jeff Boyce
> Microsoft Office/Access MVP

A PDF file can contain images, but to claim that "a PDF file is an
image" seems shockingly simplistic, IMO, unless you are only considering
the output to your screen. For example, the PDF 1.7 Reference
describing the PDF format contains about 1310 pages. See the discussion
in the following thread:

http://groups.google.com/group/microsoft.public.access/browse_frm/thread/d34aa27e14854f45

Basically, extracting text and images from a PDF file with 100% accuracy
ranges from fairly easy to very difficult depending on things like the
scope and method of compression used, the number of edits made and
whether or not PDF Linearization optimization was employed by the
program used to create the PDF file. For anything past "somewhat easy"
I recommend not using Access to perform the extraction from the data
streams even though Access theoretically has enough capability to
perform the task. I agree that image and text data can be extracted
from a screen capture (or try a simple copy/paste for text data), but I
consider those methods, especially the "lossy" OCR, to be last resorts.
I think I remember seeing a free software tool that can split a PDF
file into individual one page PDF files. Googling... Perhaps it was:

http://www.pdfhacks.com/pdftk/

Using something like that could possibly break a complex problem down to
smaller pieces that may be more amenable to data extraction. If all
else fails, there are likely many commercial software packages that can
extract data from PDF files and that cost under $100.00.

James A. Fortune
MPAP...@FortuneJames.com

Jeff Boyce

unread,

Apr 1, 2009, 1:41:57 PM4/1/09

to

Thanks for the clarifications ... that may just satisfy my self-imposed
"learn one new thing each day" requirement!

Jeff B.

"James A. Fortune" <MPAP...@FortuneJames.com> wrote in message
news:OzgVY3ms...@TK2MSFTNGP04.phx.gbl...