Hello All,
Can anyone please suggest me if there any python modules available to
convert PDF document to MSWord documents. If not then can you please
suggest how can i acheive this.
Many thanks in advance,
Regards
Deb
======
What you ask is quite difficult. My understanding is that PDF files
are simply Postscript files with some special wrapping. Depending on
the nature of the PDF (is it encrypted, are there other special
provisions?) you may be able to strip the raw text from the file and
create and RTF file from it. However you will lose all formatting in
this case. If the formatting is "standard" across all the PDFs you may
be able to infer from the text something that will allow you to
replace some or all of it.
--
Stand Fast,
tjg.
No python modules, but:
- feeding the subject line to google brings some sponsored links that
claim to solve your problem
- http://www.quiss.org/swftools/ has a tool to convert PDF to Flash, so
there must be some code to detect Text, Fonts etc.
Daniel
Pdf2swf is based on xpdf (http://www.foolabs.com/xpdf).
Another tool, that is also based on xpdf, is pdftohtml
(http://pdftohtml.sourceforge.net/). It can convert pdf to html (using
absolute CSS positioning) or to xml. I don't know if there is any rtf
or Word writers in Python, but in the previous VB life I programmed a
simple Word macro that would open HTML page and save it as .doc
document. It was the most easy way to get all images embedded and
formatting correctly done. Don't know, however, how it will handle
absolute positioning.
Another possible option is to convert PDF to PS format, and than use
pstoedit (http://www.pstoedit.net/pstoedit) with shareware RTF plugin
mentioned on that page. Don't have any experience with this option.
Ksenia.
I think that there's no specification of doc format. Pdf and doc are also
different class of formats. So you can extract text (with ghostscript
frontend ps2ascii and hope in right encoding), and pictures. Typesetting
of word document is your work.
Maybe conversion pdf to html and import of html to word can be better
way - but again, you go from stronger language to weaker.
Jan
> ----- Original Message -----
> From: JEET <hjee...@yahoo.com>
> Date: Tue, 28 Sep 2004 17:13:17 +0100 (BST)
> Subject: Using python to convert PDF document to MSWord documents
> To: pytho...@python.org
>
>
>
>
> Hello All,
>
> Can anyone please suggest me if there any python modules available to
> convert PDF document to MSWord documents. If not then can you please
> suggest how can i acheive this.
>
> Many thanks in advance,
>
One of the problems with such a module would be that PDF is primarily a
display format, and so the structure of the file doesn't necessarily
conform with the structure of the document.
regards
Steve
ConvertZone Support team
ConvertZone Software Co,.ltd
http://www.convertzone.com
sup...@convertzone.com
************************************************************
ConvertZone provides office(PDF, Word, Excel, PowerPoint, AutoCAD etc),
video(DVD, VCD, SVCD etc), audio(MP3, WAV, MIDI etc), image(JPG, GIF,
TIF, BMP etc) file converter.
************************************************************