Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Code to recognize MS-Word document files?

16 views
Skip to first unread message

Grant Edwards

unread,
Mar 4, 2003, 11:26:43 AM3/4/03
to
I'm looking for a snippet of python that I can use to determine
if a file is a MS-Word document. People around here seem to
have gotten into the habit of attaching MS-Word files without a
".doc" on the name.

Even with the .doc, the mimetypes module doesn't seem to get it
right. :(

--
Grant Edwards grante Yow! Hey, LOOK!! A pair of
at SIZE 9 CAPRI PANTS!! They
visi.com probably belong to SAMMY
DAVIS, JR.!!

"Martin v. Löwis"

unread,
Mar 4, 2003, 11:51:53 AM3/4/03
to
Grant Edwards wrote:
> I'm looking for a snippet of python that I can use to determine
> if a file is a MS-Word document. People around here seem to
> have gotten into the habit of attaching MS-Word files without a
> ".doc" on the name.

The GNU file command can do this recognition, atleast partially. I'm not
aware of a Python wrapper around it, but it shouldn't be too difficult.

GNU file will report MS-Word files as "Microsoft Office Document".
Whether it is possible to infer them as "Word", I don't know.

Regards,
Martin

Grant Edwards

unread,
Mar 4, 2003, 11:59:40 AM3/4/03
to
In article <b42lj9$b2q$05$2...@news.t-online.com>, Martin v. Löwis wrote:

>> I'm looking for a snippet of python that I can use to determine
>> if a file is a MS-Word document.
>

> The GNU file command can do this recognition, atleast partially.

I should have though of that!

> I'm not aware of a Python wrapper around it, but it shouldn't
> be too difficult.

The code needs to run on a win32 platform, but I can probably
glean enough information from /usr/share/magic to figure it
out.

> GNU file will report MS-Word files as "Microsoft Office Document".
> Whether it is possible to infer them as "Word", I don't know.

Almost all of the "Office" documents that I receive are from
word, so it's good enough for a first order solution.

--
Grant Edwards grante Yow! .. my NOSE is NUMB!
at
visi.com

WP

unread,
Mar 4, 2003, 12:01:33 PM3/4/03
to
Grant Edwards wrote:
> I'm looking for a snippet of python that I can use to determine
> if a file is a MS-Word document. People around here seem to
> have gotten into the habit of attaching MS-Word files without a
> ".doc" on the name.
>
> Even with the .doc, the mimetypes module doesn't seem to get it
> right. :(
>

Extremely Bogus Hack: (probably no help, sorry!)

f=open(wordfilename,'rb')
str1=f.read(10) # arbitrary # of bytes
for i in str1:
hex = hex + ( '%02x ' % Ord(i) )
if hex='d0 cf 11 e0 a1 b1 1a e1 00 00 ':
print 'match'
// My word files all seem to start like this
// (MS Office 2000 WORD iles)
// d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00
// 00 00 00 00 00 00 00 00 00 00
// Your mileage may vary.

Warren
--
--------------------------------------
warren...@adaptivenetworks.on.ca
Toronto Ontario Canada


WP

unread,
Mar 4, 2003, 12:02:50 PM3/4/03
to
WP wrote:
>
> Extremely Bogus Hack: (probably no help, sorry!)
>
> f=open(wordfilename,'rb')
> str1=f.read(10) # arbitrary # of bytes
> for i in str1:
> hex = hex + ( '%02x ' % Ord(i) )
> if hex='d0 cf 11 e0 a1 b1 1a e1 00 00 ':
> print 'match'
I meant if hex==
This is what always happens when one types something pedantic and doesn't try
running it first.
Probably this code still doesn't run. (That's how bogus it is.)

Thomas Wana

unread,
Mar 4, 2003, 1:07:52 PM3/4/03
to
WP wrote:

> // My word files all seem to start like this
> // (MS Office 2000 WORD iles)
> // d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00
> // 00 00 00 00 00 00 00 00 00 00
> // Your mileage may vary.

MS Excel files have the same signature; it seems
that all office documents have it.

(I know that because I had to manually recover
some lost excel files from a vfat partition)

Thomas

>
> Warren

Nick Vargish

unread,
Mar 4, 2003, 1:20:40 PM3/4/03
to
gra...@visi.com (Grant Edwards) writes:

> Almost all of the "Office" documents that I receive are from
> word, so it's good enough for a first order solution.

That's bad enough, but when you start getting Excel files in e-mail,
it's time to run screaming for the hills.

We get our office phone list in Excel for some reason, and it really
gets my panties in a knot. So to speak.

Nick

--
# sigmask.py || version 0.2 || 2003-01-07 || Feed this to your Python.
print reduce(lambda x,y:x+chr(ord(y)-1),'Ojdl!Wbshjti!=obwAqbusjpu/ofu?','')

Grant Edwards

unread,
Mar 4, 2003, 1:41:00 PM3/4/03
to
In article <10467994...@newsmaster-03.atnet.at>, Thomas Wana wrote:
> WP wrote:
>
>> // My word files all seem to start like this
>> // (MS Office 2000 WORD iles)
>> // d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00
>> // 00 00 00 00 00 00 00 00 00 00
>> // Your mileage may vary.

Yup. That looks like one of the recipes in /usr/share/magic:

0 string \376\067\0\043 Microsoft Office Document
0 string \320\317\021\340\241\261 Microsoft Office Document
0 string \333\245-\0\0\0 Microsoft Office Document

The second one seems to match the files I've got laying around
also.

--
Grant Edwards grante Yow! FEELINGS are
at cascading over me!!!
visi.com

Christos TZOTZIOY

unread,
Mar 4, 2003, 6:46:06 PM3/4/03
to
On Tue, 04 Mar 2003 17:51:53 +0100, rumours say that "Martin v. Löwis"
<mar...@v.loewis.de> might have written:

>The GNU file command can do this recognition, atleast partially. I'm not
>aware of a Python wrapper around it, but it shouldn't be too difficult.

I got a module magic.py that is accessible from
<URL:http://www.sil-tec.gr/~tzot/python/>. It provides for a file_magic
function using a copy of /etc/magic (or /usr/share/magic I think in
Linux). The code needs cleaning, but is usable (the only functionality
I did not implement is offset > 0 peeking). I also have this file.py in
my win2k path:

import sys, os
from tzot.magic import file_magic
from glob import glob

for arg in sys.argv[1:]:
for filename in glob(arg):
if os.path.isdir(filename):
print "%s: folder" % filename
else:
print "%s: %s" % (filename, file_magic(filename))

Usual disclaimers apply.
--
TZOTZIOY, I speak England very best,
bo...@sil-tec.gr
(I'm a postmaster luring spammers; please spam me!
...and my users won't ever see your messages again...)

Christos TZOTZIOY

unread,
Mar 4, 2003, 6:50:43 PM3/4/03
to
On 04 Mar 2003 16:59:40 GMT, rumours say that gra...@visi.com (Grant
Edwards) might have written:

>The code needs to run on a win32 platform, but I can probably
>glean enough information from /usr/share/magic to figure it
>out.

I think all MS office documents do have first byte "0xd0" and second
byte "0xc?", where ? means I don't remember, simulating hex-talk for
"doc" (0xd0c?)

Mark Hammond

unread,
Mar 5, 2003, 1:56:51 AM3/5/03
to
Grant Edwards wrote:
> I'm looking for a snippet of python that I can use to determine
> if a file is a MS-Word document. People around here seem to
> have gotten into the habit of attaching MS-Word files without a
> ".doc" on the name.
>
> Even with the .doc, the mimetypes module doesn't seem to get it
> right. :(

This code, from ppw32, will extract information from MS Office
documents. I'm not sure what other properties are available.

Mark.

# DumpStorage.py - Dumps some user defined properties
# of a COM Structured Storage file.

import pythoncom
from win32com import storagecon # constants related to storage functions.

# These come from ObjIdl.h
FMTID_UserDefinedProperties = "{F29F85E0-4FF9-1068-AB91-08002B27B3D9}"

PIDSI_TITLE = 0x00000002
PIDSI_SUBJECT = 0x00000003
PIDSI_AUTHOR = 0x00000004
PIDSI_CREATE_DTM = 0x0000000c

def PrintStats(filename):
if not pythoncom.StgIsStorageFile(filename):
print "The file is not a storage file!"
return
# Open the file.
flags = storagecon.STGM_READ | storagecon.STGM_SHARE_EXCLUSIVE
stg = pythoncom.StgOpenStorage(filename, None, flags )

# Now see if the storage object supports Property Information.
try:
pss = stg.QueryInterface(pythoncom.IID_IPropertySetStorage)
except pythoncom.com_error:
print "No summary information is available"
return
# Open the user defined properties.
ps = pss.Open(FMTID_UserDefinedProperties)
props = PIDSI_TITLE, PIDSI_SUBJECT, PIDSI_AUTHOR, PIDSI_CREATE_DTM
data = ps.ReadMultiple( props )
# Unpack the result into the items.
title, subject, author, created = data
print "Title:", title
print "Subject:", subject
print "Author:", author
print "Created:", created.Format()

if __name__=='__main__':
import sys
if len(sys.argv)<2:
print "Please specify a file name"
else:
PrintStats(sys.argv[1])

0 new messages