I want to compare two PDF or WORD files. Any Help?
thx
> I want to compare two PDF or WORD files.
Could you be more precise, please?
+ Do you only want to compare PDF-PDF or Word-Word? Or do
you want to be able to do PDF-Word?
+ In either case, are you only bothered about the text, or
is the formatting significant?
+ If it's only text, then use whatever method you want to
extract the text (antiword, ghostscript, COM automation,
xpdf, etc.) and then use the difflib module, or some external
diff tool.
+ If you want a structure/format comparison, you're into quite
difficult territory, I believe. It's easy enough to convert a
Word Doc to PDF if that were needed but PDFs are notoriously
difficult to disentangle, altho' relatively straightforward to
build. There's pdftools
(http://www.boddie.org.uk/david/Projects/Python/pdftools/)
which I can't say I've tried, but even once you've got the document
object into Python, I don't imagine it'll be easy to compare.
+ To do Word-Word comparison, there's more hope on the horizon
(if that's the metaphor I want). Word has built-in comparison
functionality, and recent versions of TortoiseSVN, for example
include a script which will automate Word to do the right thing.
Which is, essentially, one doc, and call its .Compare method
against the other.
TJG
________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________
Thanks for your reply!
I want to compare PDF-PDF files and WORD-WORD files.
It seems that the right way is :
First, extract text from PDF file or Word file.
Then, use Difflib to compare these text files.
Would you please give me some more information about the external diff
tools?
Thx!
One more thing:
There some Python scripts that can extract text from PDF or WORD file?
Thx
OK. Well, that's clear enough.
> It seems that the right way is :
> First, extract text from PDF file or Word file.
> Then, use Difflib to compare these text files.
When you say "it seems that the right way is..." I'll
assume that this way meets your requirements. It
wouldn't be the right way if, for example, you
wanted to treat different header levels as different,
or to consider embedded graphics as significant etc.
> Would you please give me some more information
> about the external diff tools?
Well, I could mention the name of the ones
which I might use (WinMerge and GNU diff),
but I'm sure there are many of then around
the place, and you're far better off doing this:
http://www.google.co.uk/search?q=diff+tools
In case you didn't realise, the "difflib" I
referred to is a Python module from the standard
library:
<screendump>
Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> `difflib`
"<module 'difflib' from 'c:\\python24\\lib\\difflib.pyc'>"
>>>
</screendump>
> There some Python scripts that can extract text
> from PDF or WORD file?
Well, I'm sure there are, but my honest opinion is that,
unless you've got some compelling reason to do this in
Python, you're better off using, say:
+ antiword: http://www.winfield.demon.nl/
+ pdf2text from xpdf: http://www.foolabs.com/xpdf/home.html
If you really wanted to go with Python (for the learning
experience, if nothing else) then the most obvious candidates
are:
+ Word: use the pywin32 modules to automate Word and save the document
as text:
Something like this (assumes doc called c:\temp\test.doc exists):
<code>
import win32com.client
word = win32com.client.gencache.EnsureDispatch ("Word.Application")
doc = word.Documents.Open (FileName="c:/temp/test.doc")
doc.SaveAs (FileName="c:/temp/test2.txt",
FileFormat=win32com.client.constants.wdFormatText)
word.Quit ()
del word
text = open ("c:/temp/test2.txt").read ()
print text
</code>
+ PDF: David Boddie's pdftools looks like about the only possibility:
(ducks as a thousand people jump on him and point out the alternatives)
http://www.boddie.org.uk/david/Projects/Python/pdftools/
Something like this might do the business. I'm afraid I've
no idea how to determine where the line-breaks are. This
was the first time I'd used pdftools, and the fact that
I could do this much is a credit to its usability!
<code>
from pdftools.pdffile import PDFDocument
from pdftools.pdftext import Text
def contents_to_text (contents):
for item in contents:
if isinstance (item, type ([])):
for i in contents_to_text (item):
yield i
elif isinstance (item, Text):
yield item.text
doc = PDFDocument ("c:/temp/test.pdf")
n_pages = doc.count_pages ()
text = []
for n_page in range (1, n_pages+1):
print "Page", n_page
page = doc.read_page (n_page)
contents = page.read_contents ().contents
text.extend (contents_to_text (contents))
print "".join (text)
</code>
> + PDF: David Boddie's pdftools looks like about the only possibility:
> (ducks as a thousand people jump on him and point out the alternatives)
I might as well do that! Here are a couple of alternatives:
http://www.sourceforge.net/projects/pdfplayground
http://www.adaptive-enterprises.com.au/~d/software/pdffile/
Both of these are arguably more "Pythonic" than my solution, and
the first is also able to write out modified files.
Cameron Laird also maintains a page about PDF conversion tools:
http://phaseit.net/claird/comp.text.pdf/PDF_converters.html
> http://www.boddie.org.uk/david/Projects/Python/pdftools/
>
> Something like this might do the business. I'm afraid I've
> no idea how to determine where the line-breaks are. This
> was the first time I'd used pdftools, and the fact that
> I could do this much is a credit to its usability!
Thanks for the compliment! The read_text method in the PDFContents
class also lets you extract text from a given page in a document, but
you have to remember that text in PDF files isn't always composed as
a series of lines or paragraphs, and often doesn't even contain
whitespace characters.
David
So if I want to use these tools: antiword,pdf2text, can I pack these
tools and python script into a windows EXE file? I know there is open
source tool which can pack python script and libs and generate the
windows EXE file.
Yes, this approach can't handle the pictures in the PDF/WORD file.
There is a way to play around it? maybe it's very hard.
Regards
I'm not especially qualified to answer this, but I
think the answer's Yes. I think that you can just
tell py2exe that the executables and DLLs of the
other products are data files for the Python one.
Best look at the py2exe site and mailing list for
further info. An alternative is just to use an
installer to package the whole thing in the usual
Windows way.
> Yes, this approach can't handle the pictures in
> the PDF/WORD file. There is a way to play around
> it? maybe it's very hard.
I'm not even sure how I'd go about it conceptually.
How *do* you compare two pictures? Do you really
want to do this?
BTW, don't forget that if you're comparing Word with
Word, you can use its inbuilt comparison ability,
which just needs COM automation. (Don't know how
that takes care of picture either, but if Word's
own Compare can't, no-one else has got a chance).
I will have a try,maybe this weekend and let you know the result.