from pdftools.pdffile import PDFDocument
from pdftools.pdftext import Text
def contents_to_text (contents):
for item in contents:
if isinstance (item, type ([])):
for i in contents_to_text (item):
yield i
elif isinstance (item, Text):
yield item.text
doc = PDFDocument ("/home/dave/pruebas_ficheros/carlos.pdf")
n_pages = doc.count_pages ()
text = []
for n_page in range (1, (n_pages+1)):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_contents ().contents
text.extend (contents_to_text (contents))
print "".join (text)
the problem is that on some pdf´s it generates join words and In
spanish the "acentos"
in words like: "camión" goes to --> cami/86n or
"IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange
characters
if someone knows how to use the pdftools and can help me it makes me
very happy.
Another thing is that I can see the letters readden from .pdf on the
screen, but I do not know how to create a file and save this
information inside the file a .txt
Sorry for my english.
Thanks for all.
If you have 'xpdf' installed in your system,
'pdftotext' command will be available in your system.
Now to convert a pdf to text from Python use system call.
For example:
import os
os.system("pdftotext -layout my_pdf_file.pdf")
This will create 'my_pdf_file.txt' file.
Regards,
Baiju M
http://glyphandcog.com/Xpdf.html
Install it and then try what Baiju said, should work.
I've used it, its good, that's why I say it should work. If any
problems, post here again.
-------------------------------------------------------------------------------------------
Vasudev Ram
Independent software consultant
Personal site: http://www.geocities.com/vasudevram
PDF conversion tools: http://sourceforge.net/projects/xtopdf
-------------------------------------------------------------------------------------------
[...]
> for n_page in range (1, (n_pages+1)):
> print "Page", n_page
> page = doc.read_page (n_page)
> contents = page.read_contents ().contents
> text.extend (contents_to_text (contents))
>
> print "".join (text)
>
> the problem is that on some pdf´s it generates join words and In
> spanish the "acentos"
> in words like: "camión" goes to --> cami/86n or
> "IMPLEMENTACIÓN" -----> "IMPLEMENTACI?" give strange
> characters
pdftools just extracts the textual data in the file and stores it in
Text instances - it doesn't try to interpret or decode the text. I'd
like to fix the library so that it does try and decode the text
properly and put it into unicode strings, but I don't have the time
right now.
Remember that text can be stored in PDF files in many different
ways, and that the text cannot always be extracted in its original
form.
> if someone knows how to use the pdftools and can help me it makes me
> very happy.
>
> Another thing is that I can see the letters readden from .pdf on the
> screen, but I do not know how to create a file and save this
> information inside the file a .txt
You need to do something like this:
f = open("myfilename", "w").write("".join (text))
> Sorry for my english.
Don't worry about it. It's much better than my Spanish will ever be.
Sorry I couldn't give you more help with this. You may find that the
other tools mentioned by people in this thread will do what you
need better than pdftools can at the moment.
David
import os
os.system("pdftotext -layout my_pdf_file.pdf")
#This will create 'my_pdf_file.txt' file.