Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Errors with PyPdf

466 views
Skip to first unread message

flebber

unread,
Sep 26, 2010, 7:10:29 PM9/26/10
to
I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.

http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

This is my error.

>>>

Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'
>>>

MRAB

unread,
Sep 26, 2010, 7:35:20 PM9/26/10
to pytho...@python.org
On 27/09/2010 00:10, flebber wrote:
> I was trying to use Pypdf following a recipe from the Activestate
> cookbooks. However I cannot get it too work. Unsure if it is me or it
> is beacuse sets are deprecated.
>
The 'sets' module pre-dates the built-in 'set' class. The warning is
just to inform you that the module will be removed in due course (it's
still in Python 2.7, but not Python 3), so you can still use it in
those versions.

You put the file in C:\, but you didn't tell Python where it is. You
gave just the filename "Components-of-Dot-NET.pdf", and it's looking in
the current directory, which probably isn't C:\.

Try providing the full pathname:

print
getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", "ignore")

w.g.s...@gmail.com

unread,
Sep 26, 2010, 7:38:39 PM9/26/10
to
On Sep 26, 7:10 pm, flebber <flebber.c...@gmail.com> wrote:
> I was trying to use Pypdf following a recipe from the Activestate
> cookbooks. However I cannot get it too work. Unsure if it is me or it
> is beacuse sets are deprecated.
>
> I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
> NET.pdf" You could use anything I was just testing with it.
>
> I was using the last script on that page that was most recently
> updated. I am using python 2.6.
>
> http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co...
---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
> NET.pdf'
>
>
>
>
Looks like a issue with finding the file.
how do you pass the path?

flebber

unread,
Sep 26, 2010, 8:39:28 PM9/26/10
to
On Sep 27, 9:38 am, "w.g.sned...@gmail.com" <w.g.sned...@gmail.com>
wrote:

okay thanks I thought that when I set content here

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"

that i was defining where it is.

but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")


flebber

unread,
Sep 26, 2010, 10:08:28 PM9/26/10
to

I have found far more advanced scripts searching around. But will have
to keep trying as I cannot get an output file or specify the path.

Edit very strangely whilst searching for examples I found my own post
just written here ranking number 5 on google within 2 hours. Bizzare.

http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf.aspx

Replicates our thread as thiers. I was searching ggole with "pypdf
return to txt file"

flebber

unread,
Sep 26, 2010, 10:19:40 PM9/26/10
to
> http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf...

>
> Replicates our thread as thiers. I was searching ggole with "pypdf
> return to txt file"

Traceback (most recent call last):
File "C:/Python26/Pdfread", line 16, in <module>
open('x.txt', 'w').write(content)
NameError: name 'content' is not defined
>>>

When i use.

import pyPdf

def getPDFContent(path):
content = "C:\Components-of-Dot-NET.txt"


# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

open('x.txt', 'w').write(content)

MRAB

unread,
Sep 26, 2010, 10:49:21 PM9/26/10
to pytho...@python.org

That simply binds to a local name; 'content' is a local variable in the
function 'getPDFContent'.

> # Load PDF into pyPDF
> pdf = pyPdf.PdfFileReader(file(path, "rb"))

You're opening a file whose path is in 'path'.

> # Iterate pages
> for i in range(0, pdf.getNumPages()):
> # Extract text from page and add to content
> content += pdf.getPage(i).extractText() + "\n"

That appends to 'content'.

> # Collapse whitespace

'content' now contains the text of the PDF, starting with
r"C:\Components-of-Dot-NET.pdf".

> content = " ".join(content.replace(u"\xa0", " ").strip().split())
> return content
>
> print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
> "ignore")
>

Outputting to a .txt file is simple: open the file for writing using
'open', write the string to it, and then close it.

flebber

unread,
Sep 26, 2010, 10:56:40 PM9/26/10
to

Thats what I was trying to do with

open('x.txt', 'w').write(content)

the rest of the script works it wont output the tect though

Tim Roberts

unread,
Sep 26, 2010, 11:12:14 PM9/26/10
to
flebber <flebbe...@gmail.com> wrote:
>
>okay thanks I thought that when I set content here
>
>def getPDFContent(path):
> content = "C:\Components-of-Dot-NET.pdf"

You have a backslash problem here. You need need to say:
content = "C:\\Components-of-Dot-NET.pdf"
or
content = "C:/Components-of-Dot-NET.pdf"
or
content = "C:/Components-of-Dot-NET.pdf"
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Dave Angel

unread,
Sep 27, 2010, 12:46:18 AM9/27/10
to flebber, pytho...@python.org

On 2:59 PM, flebber wrote:
> <snip>


> Traceback (most recent call last):
> File "C:/Python26/Pdfread", line 16, in<module>
> open('x.txt', 'w').write(content)
> NameError: name 'content' is not defined
> When i use.
>
> import pyPdf
>
> def getPDFContent(path):

> content =C:\Components-of-Dot-NET.txt"


> # Load PDF into pyPDF

> pdf =yPdf.PdfFileReader(file(path, "rb"))


> # Iterate pages
> for i in range(0, pdf.getNumPages()):
> # Extract text from page and add to content

> content +=df.getPage(i).extractText() + "\n"


> # Collapse whitespace
> content = ".join(content.replace(u"\xa0", " ").strip().split())
> return content
>
> print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
> "ignore")
> open('x.txt', 'w').write(content)
>

There's no global variable content, that was local to the function. So
it's lost when the function exits. it does return the value, but you
give it to print, and don't save it anywhere.

data = getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")

outfile = open('x.txt', 'w')
outfile.write(data)

close(outfile)

I used a different name to emphasize that this is *not* the same
variable as content inside the function. In this case, it happens to
have the same value. And if you used the same name, you could be
confused about which is which.


DaveA

flebber

unread,
Sep 27, 2010, 10:19:34 AM9/27/10
to

Thank You everyone.

0 new messages