> Are there any perl modules that will allow you to convert MS-Word docs
> to plain text?
You can use Win32::OLE to do this.
Opposite to earlier formats the DOCX format is an open XML format and
information about it is available on the Microsoft website. I don't know
if someone already wrote a parser for it, but at least it should be
possible now.
jue
The docx format only applies to Word 2007. Microsoft have also made their
binary formats for other versions of Word public:
http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx
But deciphering these is only necessary if you want to read the files on a
non-Windows system or you don't have a working copy of Microsoft Word. If
you have Microsoft Word available, the most sensible thing to do is to use
Win32::OLE. Here is a quick program to save a Word file as text:
#! perl
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft.Word';
my $dir = "C:/Documents and Settings/bkb/My Documents/scripts/tests/";
my $doc = $dir."test2.doc";
my $txt = $dir."test.txt";
my $word = Win32::OLE->new('Word.Application', 'Quit')
or die "Word problem: ",Win32::OLE->LastError();
my $document = $word->Documents->Open($doc)
or die "Word problem: ",Win32::OLE->LastError();
$document->SaveAs($txt, wdFormatText);
AFAIK there is no integrated Perl solution for this, but there
are several Perl bridge to external software doing this, such
as:
- Win32::OLE, if you happen to do this in a MS Windows system.
- OpenOffice::UNO supposedly let you control OpenOffice to do
most anything, including open Word document and save it as text
file or extract the text directly. I used OpenOffice UNO from
Java before, not sure how much of UNO implemented in the Perl
module.
- SWISH::Filters use external commant catdoc to extract the text
out of MS Word documents.
--
(stephan paul) Arif Sahari Wibowo
_____ _____ _____ _____
/____ /____/ /____/ /____
_____/ / / / _____/ http://www.arifsaha.com/
I wonder how the result compares to the (non-perl, I think)
program "antiword"?
Any experiences?
David
These two also, anyone compared to "antiword"?
Thanks
David
On May 29, 9:54 pm, dkco...@panix.com (David Combs) wrote:
> In article <alpine.OSX.1.10.0805170909340.5...@imac2006.local>,
> In article <g09frv$9i$1...@ml.accsnet.ne.jp>,
> Ben Bullock <benkasmi...@gmail.com> wrote:
>>But deciphering these is only necessary if you want to read the
>>files on a non-Windows system or you don't have a working copy of
>>Microsoft Word. If you have Microsoft Word available, the most
>>sensible thing to do is to use Win32::OLE. Here is a quick program
>>to save a Word file as text:
> I wonder how the result compares to the (non-perl, I think)
> program "antiword"?
If antiword does something drastically different from Microsoft Word
itself, then that would be a bug. But the script I posted has faults
such as the failure to save the text in text boxes. To do that requires
further work.