Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Convert MS-Word to plain text

15 views
Skip to first unread message

backpack

unread,
May 9, 2008, 1:29:33 PM5/9/08
to
Are there any perl modules that will allow you to convert MS-Word docs
to plain text?

Ben Bullock

unread,
May 9, 2008, 7:09:40 PM5/9/08
to
On Fri, 09 May 2008 10:29:33 -0700, backpack wrote:

> Are there any perl modules that will allow you to convert MS-Word docs
> to plain text?

You can use Win32::OLE to do this.

Jürgen Exner

unread,
May 9, 2008, 9:10:36 PM5/9/08
to
backpack <curt...@gmail.com> wrote:
>Are there any perl modules that will allow you to convert MS-Word docs
>to plain text?

Opposite to earlier formats the DOCX format is an open XML format and
information about it is available on the Microsoft website. I don't know
if someone already wrote a parser for it, but at least it should be
possible now.

jue

Ben Bullock

unread,
May 12, 2008, 9:14:38 AM5/12/08
to

"Jurgen Exner" <jurg...@hotmail.com> wrote in message
news:n7t924pig28pcl1kg...@4ax.com...

The docx format only applies to Word 2007. Microsoft have also made their
binary formats for other versions of Word public:

http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

But deciphering these is only necessary if you want to read the files on a
non-Windows system or you don't have a working copy of Microsoft Word. If
you have Microsoft Word available, the most sensible thing to do is to use
Win32::OLE. Here is a quick program to save a Word file as text:
#! perl
use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft.Word';
my $dir = "C:/Documents and Settings/bkb/My Documents/scripts/tests/";
my $doc = $dir."test2.doc";
my $txt = $dir."test.txt";
my $word = Win32::OLE->new('Word.Application', 'Quit')
or die "Word problem: ",Win32::OLE->LastError();
my $document = $word->Documents->Open($doc)
or die "Word problem: ",Win32::OLE->LastError();
$document->SaveAs($txt, wdFormatText);

S P Arif Sahari Wibowo

unread,
May 17, 2008, 9:18:52 AM5/17/08
to
On Fri, 9 May 2008, backpack wrote:
> Are there any perl modules that will allow you to convert
> MS-Word docs to plain text?

AFAIK there is no integrated Perl solution for this, but there
are several Perl bridge to external software doing this, such
as:

- Win32::OLE, if you happen to do this in a MS Windows system.

- OpenOffice::UNO supposedly let you control OpenOffice to do
most anything, including open Word document and save it as text
file or extract the text directly. I used OpenOffice UNO from
Java before, not sure how much of UNO implemented in the Perl
module.

- SWISH::Filters use external commant catdoc to extract the text
out of MS Word documents.

--
(stephan paul) Arif Sahari Wibowo
_____ _____ _____ _____
/____ /____/ /____/ /____
_____/ / / / _____/ http://www.arifsaha.com/

David Combs

unread,
May 29, 2008, 9:52:29 PM5/29/08
to
In article <g09frv$9i$1...@ml.accsnet.ne.jp>,

I wonder how the result compares to the (non-perl, I think)
program "antiword"?

Any experiences?

David


David Combs

unread,
May 29, 2008, 9:54:59 PM5/29/08
to
In article <alpine.OSX.1.10.0...@imac2006.local>,

These two also, anyone compared to "antiword"?

Thanks

David


backpack

unread,
Jun 3, 2008, 5:29:00 PM6/3/08
to
I actually just ended up using Python with the win32 extensions. The
only downside to this is that you have to work on a windows machine
with microsoft word installed. At first I did try using antiword. It
partially worked. I'm not sure exactly why it didnt convert all the
documents but it converted about a third(30,000 of them). The rest of
the documents antiword claimed were'nt word docs but they obviously
were. Maybe an unsupported version of word? And for the hell of it I
tried changing the extensions of the files to .rtf and tried using
UnRtf but that didn't work either. The point of the original post was
that I was looking for a perl module i can run in a linux environment
therefore rendering win32::ole useless. Should've specified...


On May 29, 9:54 pm, dkco...@panix.com (David Combs) wrote:
> In article <alpine.OSX.1.10.0805170909340.5...@imac2006.local>,

Ben Bullock

unread,
Jun 20, 2008, 9:25:33 AM6/20/08
to
On Fri, 30 May 2008 01:52:29 +0000, David Combs wrote:

> In article <g09frv$9i$1...@ml.accsnet.ne.jp>,
> Ben Bullock <benkasmi...@gmail.com> wrote:

>>But deciphering these is only necessary if you want to read the
>>files on a non-Windows system or you don't have a working copy of
>>Microsoft Word. If you have Microsoft Word available, the most
>>sensible thing to do is to use Win32::OLE. Here is a quick program
>>to save a Word file as text:

> I wonder how the result compares to the (non-perl, I think)
> program "antiword"?

If antiword does something drastically different from Microsoft Word
itself, then that would be a bug. But the script I posted has faults
such as the failure to save the text in text boxes. To do that requires
further work.

0 new messages