Even with the .doc, the mimetypes module doesn't seem to get it
right. :(
--
Grant Edwards grante Yow! Hey, LOOK!! A pair of
at SIZE 9 CAPRI PANTS!! They
visi.com probably belong to SAMMY
DAVIS, JR.!!
The GNU file command can do this recognition, atleast partially. I'm not
aware of a Python wrapper around it, but it shouldn't be too difficult.
GNU file will report MS-Word files as "Microsoft Office Document".
Whether it is possible to infer them as "Word", I don't know.
Regards,
Martin
>> I'm looking for a snippet of python that I can use to determine
>> if a file is a MS-Word document.
>
> The GNU file command can do this recognition, atleast partially.
I should have though of that!
> I'm not aware of a Python wrapper around it, but it shouldn't
> be too difficult.
The code needs to run on a win32 platform, but I can probably
glean enough information from /usr/share/magic to figure it
out.
> GNU file will report MS-Word files as "Microsoft Office Document".
> Whether it is possible to infer them as "Word", I don't know.
Almost all of the "Office" documents that I receive are from
word, so it's good enough for a first order solution.
--
Grant Edwards grante Yow! .. my NOSE is NUMB!
at
visi.com
Extremely Bogus Hack: (probably no help, sorry!)
f=open(wordfilename,'rb')
str1=f.read(10) # arbitrary # of bytes
for i in str1:
hex = hex + ( '%02x ' % Ord(i) )
if hex='d0 cf 11 e0 a1 b1 1a e1 00 00 ':
print 'match'
// My word files all seem to start like this
// (MS Office 2000 WORD iles)
// d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00
// 00 00 00 00 00 00 00 00 00 00
// Your mileage may vary.
Warren
--
--------------------------------------
warren...@adaptivenetworks.on.ca
Toronto Ontario Canada
> // My word files all seem to start like this
> // (MS Office 2000 WORD iles)
> // d0 cf 11 e0 a1 b1 1a e1 00 00 00 00 00 00
> // 00 00 00 00 00 00 00 00 00 00
> // Your mileage may vary.
MS Excel files have the same signature; it seems
that all office documents have it.
(I know that because I had to manually recover
some lost excel files from a vfat partition)
Thomas
>
> Warren
> Almost all of the "Office" documents that I receive are from
> word, so it's good enough for a first order solution.
That's bad enough, but when you start getting Excel files in e-mail,
it's time to run screaming for the hills.
We get our office phone list in Excel for some reason, and it really
gets my panties in a knot. So to speak.
Nick
--
# sigmask.py || version 0.2 || 2003-01-07 || Feed this to your Python.
print reduce(lambda x,y:x+chr(ord(y)-1),'Ojdl!Wbshjti!=obwAqbusjpu/ofu?','')
Yup. That looks like one of the recipes in /usr/share/magic:
0 string \376\067\0\043 Microsoft Office Document
0 string \320\317\021\340\241\261 Microsoft Office Document
0 string \333\245-\0\0\0 Microsoft Office Document
The second one seems to match the files I've got laying around
also.
--
Grant Edwards grante Yow! FEELINGS are
at cascading over me!!!
visi.com
>The GNU file command can do this recognition, atleast partially. I'm not
>aware of a Python wrapper around it, but it shouldn't be too difficult.
I got a module magic.py that is accessible from
<URL:http://www.sil-tec.gr/~tzot/python/>. It provides for a file_magic
function using a copy of /etc/magic (or /usr/share/magic I think in
Linux). The code needs cleaning, but is usable (the only functionality
I did not implement is offset > 0 peeking). I also have this file.py in
my win2k path:
import sys, os
from tzot.magic import file_magic
from glob import glob
for arg in sys.argv[1:]:
for filename in glob(arg):
if os.path.isdir(filename):
print "%s: folder" % filename
else:
print "%s: %s" % (filename, file_magic(filename))
Usual disclaimers apply.
--
TZOTZIOY, I speak England very best,
bo...@sil-tec.gr
(I'm a postmaster luring spammers; please spam me!
...and my users won't ever see your messages again...)
>The code needs to run on a win32 platform, but I can probably
>glean enough information from /usr/share/magic to figure it
>out.
I think all MS office documents do have first byte "0xd0" and second
byte "0xc?", where ? means I don't remember, simulating hex-talk for
"doc" (0xd0c?)
This code, from ppw32, will extract information from MS Office
documents. I'm not sure what other properties are available.
Mark.
# DumpStorage.py - Dumps some user defined properties
# of a COM Structured Storage file.
import pythoncom
from win32com import storagecon # constants related to storage functions.
# These come from ObjIdl.h
FMTID_UserDefinedProperties = "{F29F85E0-4FF9-1068-AB91-08002B27B3D9}"
PIDSI_TITLE = 0x00000002
PIDSI_SUBJECT = 0x00000003
PIDSI_AUTHOR = 0x00000004
PIDSI_CREATE_DTM = 0x0000000c
def PrintStats(filename):
if not pythoncom.StgIsStorageFile(filename):
print "The file is not a storage file!"
return
# Open the file.
flags = storagecon.STGM_READ | storagecon.STGM_SHARE_EXCLUSIVE
stg = pythoncom.StgOpenStorage(filename, None, flags )
# Now see if the storage object supports Property Information.
try:
pss = stg.QueryInterface(pythoncom.IID_IPropertySetStorage)
except pythoncom.com_error:
print "No summary information is available"
return
# Open the user defined properties.
ps = pss.Open(FMTID_UserDefinedProperties)
props = PIDSI_TITLE, PIDSI_SUBJECT, PIDSI_AUTHOR, PIDSI_CREATE_DTM
data = ps.ReadMultiple( props )
# Unpack the result into the items.
title, subject, author, created = data
print "Title:", title
print "Subject:", subject
print "Author:", author
print "Created:", created.Format()
if __name__=='__main__':
import sys
if len(sys.argv)<2:
print "Please specify a file name"
else:
PrintStats(sys.argv[1])