I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.
I was using the last script on that page that was most recently
updated. I am using python 2.6.
http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/
import pyPdf
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
This is my error.
>>>
Warning (from warnings module):
File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 15, in <module>
print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
File "C:/Python26/Pdfread", line 6, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-
NET.pdf'
>>>
You put the file in C:\, but you didn't tell Python where it is. You
gave just the filename "Components-of-Dot-NET.pdf", and it's looking in
the current directory, which probably isn't C:\.
Try providing the full pathname:
print
getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", "ignore")
okay thanks I thought that when I set content here
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
that i was defining where it is.
but yeah I updated script to below and it works. That is the contents
are displayed to the interpreter. How do I output to a .txt file?
import pyPdf
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.pdf"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
I have found far more advanced scripts searching around. But will have
to keep trying as I cannot get an output file or specify the path.
Edit very strangely whilst searching for examples I found my own post
just written here ranking number 5 on google within 2 hours. Bizzare.
http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf.aspx
Replicates our thread as thiers. I was searching ggole with "pypdf
return to txt file"
Traceback (most recent call last):
File "C:/Python26/Pdfread", line 16, in <module>
open('x.txt', 'w').write(content)
NameError: name 'content' is not defined
>>>
When i use.
import pyPdf
def getPDFContent(path):
content = "C:\Components-of-Dot-NET.txt"
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
open('x.txt', 'w').write(content)
That simply binds to a local name; 'content' is a local variable in the
function 'getPDFContent'.
> # Load PDF into pyPDF
> pdf = pyPdf.PdfFileReader(file(path, "rb"))
You're opening a file whose path is in 'path'.
> # Iterate pages
> for i in range(0, pdf.getNumPages()):
> # Extract text from page and add to content
> content += pdf.getPage(i).extractText() + "\n"
That appends to 'content'.
> # Collapse whitespace
'content' now contains the text of the PDF, starting with
r"C:\Components-of-Dot-NET.pdf".
> content = " ".join(content.replace(u"\xa0", " ").strip().split())
> return content
>
> print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
> "ignore")
>
Outputting to a .txt file is simple: open the file for writing using
'open', write the string to it, and then close it.
Thats what I was trying to do with
open('x.txt', 'w').write(content)
the rest of the script works it wont output the tect though
You have a backslash problem here. You need need to say:
content = "C:\\Components-of-Dot-NET.pdf"
or
content = "C:/Components-of-Dot-NET.pdf"
or
content = "C:/Components-of-Dot-NET.pdf"
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.
On 2:59 PM, flebber wrote:
> <snip>
> Traceback (most recent call last):
> File "C:/Python26/Pdfread", line 16, in<module>
> open('x.txt', 'w').write(content)
> NameError: name 'content' is not defined
> When i use.
>
> import pyPdf
>
> def getPDFContent(path):
> content =C:\Components-of-Dot-NET.txt"
> # Load PDF into pyPDF
> pdf =yPdf.PdfFileReader(file(path, "rb"))
> # Iterate pages
> for i in range(0, pdf.getNumPages()):
> # Extract text from page and add to content
> content +=df.getPage(i).extractText() + "\n"
> # Collapse whitespace
> content = ".join(content.replace(u"\xa0", " ").strip().split())
> return content
>
> print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
> "ignore")
> open('x.txt', 'w').write(content)
>
There's no global variable content, that was local to the function. So
it's lost when the function exits. it does return the value, but you
give it to print, and don't save it anywhere.
data = getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii",
"ignore")
outfile = open('x.txt', 'w')
outfile.write(data)
close(outfile)
I used a different name to emphasize that this is *not* the same
variable as content inside the function. In this case, it happens to
have the same value. And if you used the same name, you could be
confused about which is which.
DaveA
Thank You everyone.