Reading corrupt Excel Files with xlrd

2,643 views
Skip to first unread message

jbell

unread,
Sep 6, 2007, 2:36:46 PM9/6/07
to python-excel
Here is a list of the debug errors that I get when I try to read the
corrupted xls files with xlrd:

WARNING *** file size (109919) not 512 + multiple of sector size mem
data: 109407 bytes == 0 sectors SAT_tot_secs=538976258,
dir_first_sec_sid=538976468,
min_size_std_stream=538972192
SSAT_first_sec_sid=538976466, SSAT_tot_secs=538976257
MSAT_first_sec_sid=-2, MSAT_tot_secs=538976288
MSAT: len = 109
WARNING *** File is truncated, or OLE2 MSAT is corrupt!!
INFO: Trying to access sector 538976288 but only 0 available OLE2
stream 'directory': sector allocation table invalid entry (538976468)

When I resave the file in Excel or OpenOffice xlrd is able to read the
file.

Is there any other way to "fix" thes xls file before xlrd reads it?

Thanks in advance,


Jeff

Message has been deleted

John Machin

unread,
Sep 6, 2007, 5:58:04 PM9/6/07
to python-excel

On Sep 7, 4:36 am, jbell <nagual3....@gmail.com> wrote:
> Here is a list of the debug errors that I get when I try to read the
> corrupted xls files with xlrd:
>
> WARNING *** file size (109919) not 512 + multiple of sector size mem

There is output (one-and-a-bit lines) missing between "sector size"
and "mem data". Please copy/paste from a file; don't re-type.

> data: 109407 bytes == 0 sectors SAT_tot_secs=538976258,
> dir_first_sec_sid=538976468,
> min_size_std_stream=538972192
> SSAT_first_sec_sid=538976466, SSAT_tot_secs=538976257
> MSAT_first_sec_sid=-2, MSAT_tot_secs=538976288
> MSAT: len = 109
> WARNING *** File is truncated, or OLE2 MSAT is corrupt!!
> INFO: Trying to access sector 538976288 but only 0 available
> OLE2
> stream 'directory': sector allocation table invalid entry (538976468)

You are certainly correct when you describe it as corrupt. It looks
like something has sprayed spaces around it at some stage --
MSAT_tot_secs=538976288, which is the integer interpretation of four
space bytes:

| >>> import struct; struct.unpack('i', ' ' * 4)
| (538976288,)
| >>>

The default (and 99.9999% usual) value for min_size_std_stream is 4096
(bytes); 538972192 is 538976288 MINUS 4096. The mind boggles.

What created [or last saved] this/these file(s)? What has been
molesting it/them since?

>
> When I resave the file in Excel or OpenOffice xlrd is able to read the
> file.

You don't mention whether what you see when those programs silently
open the file is (a) empty worksheet (b) utter garbage (c) some of
what you expect but not all (d) what you expect but parts are mangled
(e) exactly and completely what you expect (f) you don't know what to
expect (g) you don't know (h) none of the above.

> Is there any other way to "fix" thes xls file before xlrd reads it?

Your subject says "files"; is "thes file" one or many?

If by "fix", you mean recover the data from the file:
(a) It may be possible that overriding the sector size and changing
one or more exceptions to warnings and blithely continuing [mimicking
the MO of MS etc] will do the trick, but I doubt it.
(b) It may be possible after examining the file to patch up the
corrupted header info with a hex editor
(c) There may be free/commercial data recovery software/services out
there ... google("fix corrupt excel file") or somesuch.

In any case, please send me a copy of one or more of your files -- at
the very least I would want to spruce up xlrd's corruption detection
and avoidance [it could for example put out more warning messages
about some of those preposterous numbers, especially the sector size
which you omitted but must be huge because the calculated number of
sectors is zero (mem data: 109407 bytes == 0 sectors)].
If I can do (a) above, I will. If (b) seems possible, I'll give you a
free quote.

HTH,
John


jbell

unread,
Sep 7, 2007, 10:14:19 AM9/7/07
to python-excel
What created [or last saved] this/these file(s)? What has been
molesting it/them since?

It is a third party program that exports an Excel report.

So far all of the files that this program exports are corrupt.

I'm trying to make a converter that extracts the data from these files
and puts it into csv format. .

> When I resave the file in Excel or OpenOffice xlrd is able to read the
> file.

You don't mention whether what you see when those programs silently
open the file is (a) empty worksheet (b) utter garbage (c) some of
what you expect but not all (d) what you expect but parts are mangled
(e) exactly and completely what you expect (f) you don't know what to
expect (g) you don't know (h) none of the above.

When I open in Excel or OpenOffice it looks like (e) exactly and
completely what you expect. When I resave in Excel or OpenOffice it
fixes the file and xlrd can then read it without any problem.

> Is there any other way to "fix" thes xls file before xlrd reads it?

Your subject says "files"; is "thes file" one or many?

At least several hundred files.

If by "fix", you mean recover the data from the file:
(a) It may be possible that overriding the sector size and changing
one or more exceptions to warnings and blithely continuing [mimicking
the MO of MS etc] will do the trick, but I doubt it.
(b) It may be possible after examining the file to patch up the
corrupted header info with a hex editor
(c) There may be free/commercial data recovery software/services out
there ... google("fix corrupt excel file") or somesuch.

In any case, please send me a copy of one or more of your files -- at
the very least I would want to spruce up xlrd's corruption detection
and avoidance [it could for example put out more warning messages
about some of those preposterous numbers, especially the sector size
which you omitted but must be huge because the calculated number of
sectors is zero (mem data: 109407 bytes == 0 sectors)].
If I can do (a) above, I will. If (b) seems possible, I'll give you a
free quote.

I've uploaded an example of a corrupt file named "compliance
corrupt.xls". Let me know if you need more examples.

Thanks,

Jeff


jbell

unread,
Sep 7, 2007, 10:38:39 AM9/7/07
to python-excel
I uploaded a second file "compliance corrupt.xls (2)" which is a more
typical example of the errors that I'm getting.

Here is the debug output for this file:

CompDoc format: version=0x0003 revision=0x003e
WARNING *** file size (72806) not 512 + multiple of sector size (512)
sec sizes 9 6 512 64
mem data: 72294 bytes == 141 sectors
SAT_tot_secs=2, dir_first_sec_sid=1, min_size_std_stream=4096
SSAT_first_sec_sid=2, SSAT_tot_secs=1
MSAT_first_sec_sid=-2, MSAT_tot_secs=0
MSAT: len = 109
[0, 128, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1]
SAT [-3, -2, -2, 4, 5, 6, 7, 8, 9, 10, 20, 12, 13, 14, 15, 16, 17, 18,
19, 3, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87,
88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 129, -3, 130, 131,
132, 133, 134, 135, 136, 137, 138, 139, 140, 141, -2, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1]

DID=0 name=u'Root Entry' etype=5 DIDs(left=-1 right=-1 root=1
parent=-1 kids=[1]) first_SID=-2 tot_size=0
DID=1 name=u'WORKBOOK' etype=2 DIDs(left=-1 right=-1 root=-1 parent=0
kids=[]) first_SID=11 tot_size=70246
DID=2 name=u'' etype=0 DIDs(left=-1 right=-1 root=-1 parent=-1
kids=[]) first_SID=0 tot_size=0
DID=3 name=u'' etype=0 DIDs(left=-1 right=-1 root=-1 parent=-1
kids=[]) first_SID=0 tot_size=0
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-
zero
SSAT []

I know this error doesn't look like the one I originally posted.

Maybe easier to fix?

Thanks,

Jeff

jbell

unread,
Sep 7, 2007, 2:18:41 PM9/7/07
to python-excel
I think I may have answered my own question, but I don't understand
why it works.

If I change line number 605 in formatting.py from:

name = unpack_unicode(data, 2, lenlen=2)

to:

name = unpack_unicode(data, 2, lenlen=1)

It is able to read the xls file and extract the data, but it still
gives warnings which is acceptable.

Thank you very much for your help.


Jeff

John Machin

unread,
Sep 7, 2007, 8:51:39 PM9/7/07
to python-excel

That file is irretrievably stuffed. In the first 24KB (approx), every
occurrence of a null byte ("\x00") has been changed to a space
("\x20").

I thought maybe it would be recoverable by throwing away the OLE
header stuff (the first 0x600 bytes) AND hoping that the remainder was
written in ascending sector order (i.e. not like a snakes-and-ladders
game, needing the (corrupted) sector allocation table to unscramble
it) AND treating it like a pre-Excel-5.0 file, but alas the molester
got way past the 0x600 point.

Contrary to what you said, when one tries to open it with UI programs,
these are the results:
Excel (and Excel viewer) say "can't access", maybe read-only or server
not responding
Openoffice opens it as a word-processing document
Gnumeric says "unsupported format"

Inspection reveals that the last saving user is recorded as "Calc"
i.e. it was last saved by the OpenOffice.org Calc program -- again
contrary to what you said.

I have added a couple of tests on sector sizes; here are the results
for your file:

WARNING: sector size (2**8201) is preposterous; assuming 512 and
continuing ...
WARNING: short stream sector size (2**8198) is preposterous; assuming
64 and continuing ...

John Machin

unread,
Sep 7, 2007, 9:34:16 PM9/7/07
to python-excel

On Sep 8, 12:38 am, jbell <nagual3....@gmail.com> wrote:
> I uploaded a second file "compliance corrupt.xls (2)" which is a more
> typical example of the errors that I'm getting.
>
> Here is the debug output for this file:
>
> CompDoc format: version=0x0003 revision=0x003e
> WARNING *** file size (72806) not 512 + multiple of sector size (512)
> sec sizes 9 6 512 64
> mem data: 72294 bytes == 141 sectors
> SAT_tot_secs=2, dir_first_sec_sid=1, min_size_std_stream=4096
> SSAT_first_sec_sid=2, SSAT_tot_secs=1
> MSAT_first_sec_sid=-2, MSAT_tot_secs=0
> MSAT: len = 109

[snip utterly irrelevant details of OLE compound document header guff]

> I know this error doesn't look like the one I originally posted.

What error? I see no error here, only two warnings.

As I asked earlier, post the ACTUAL untrammeled output that you got.
This would have been something very much like this:

WARNING *** file size (72806) not 512 + multiple of sector size (512)

WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-
zero

*** Open failed ***
Traceback (most recent call last):
File "\python25\scripts\runxlrd.py", line 284, in main
formatting_info=fmt_opt,
File "c:\python25\Lib\site-packages\xlrd\__init__.py", line 386, in
open_workbook
bk.parse_globals()
File "c:\python25\Lib\site-packages\xlrd\__init__.py", line 1284, in
parse_globals
self.handle_style(data)
File "c:\python25\Lib\site-packages\xlrd\formatting.py", line 605,
in handle_style


name = unpack_unicode(data, 2, lenlen=2)

File "c:\python25\Lib\site-packages\xlrd\biffh.py", line 241, in
unpack_unicode
options = ord(data[pos])
IndexError: string index out of range

The warnings are indicative of your third-party software inheriting
sloppiness from the perl spreadsheet package :-)

What's causing xlrd to crash here is:

There are 6 built-in style records that appear by convention in any
spreadsheet.
Their data is 4 bytes of the form 0x8fff 0xnn 0xvv
where 0xfff is an XF number (yours are all zero, which is silly)
and 0xnn is one of 3,6,4,7,5,0 [0 means Normal]
and 0xvv is a level number (255 by convention; yours are zero)
and the 0x8000 bit being set indicates that the style is built-in.

However your 6th record, intended to be the Normal one, the most
important one, contains 0x0000 0x0000.
This actually indicates a user-defined style (not built-in), with a
zero XF number (silly), and a zero length name (pointless) -- but the
0x00 length is not followed by an options byte.

I was under the impression that a Unicode string representation in an
XLS file had to have a options byte anyway (however pointless it might
be), hence there is no soft failure in the unpack_unicode function.

So much for the diagnosis. For a range of palliatives and cures, see
my response to your next message.

John Machin

unread,
Sep 7, 2007, 10:06:05 PM9/7/07
to python-excel
On Sep 8, 4:18 am, jbell <nagual3....@gmail.com> wrote:
> I think I may have answered my own question, but I don't understand
> why it works.

Uh-huh. You have waved a dead chicken at the volcano and it has
stopped erupting :-)

>
> If I change line number 605 in formatting.py from:
>
> name = unpack_unicode(data, 2, lenlen=2)
>
> to:
>
> name = unpack_unicode(data, 2, lenlen=1)

The name-length field on a user-defined STYLE record is TWO bytes, not
ONE. What you have done "works" only by accident, and only in the case
where the name is zero-length, which as I said is pointless -- the
whole purpose of a STYLE record is to supply a name by which a user
can refer to a style aka XF.

Here are the fixes that I'm making:

1. zero-length Unicode allowed not to have an option byte:

def unpack_unicode(data, pos, lenlen=2):
"Return unicode_strg"
nchars = unpack('<' + 'BH'[lenlen-1], data[pos:pos+lenlen])[0]
if not nchars:
# Ambiguous whether 0-length string should have an "options"
byte.
# Avoid crash if missing.
return u""
pos += lenlen


options = ord(data[pos])

and a similar change to the unpack_unicode_update_pos function.

2. Warn about user-defined styles with no names:

# user-defined style
if bv >= 80:


name = unpack_unicode(data, 2, lenlen=2)

else:
name = unpack_string(data, 2, book.encoding, lenlen=1)
if not name:
print >> book.logfile, \
"*** A user-defined style has a zero-length name"

3. First line says it all:-)

SPECIAL_KLUDGE_FOR_JEFF_BELL = 1

def handle_style(book, data):
blah = DEBUG or book.verbosity >= 2
bv = book.biff_version
xf_index, built_in_id, level = unpack('<HBB', data[:4])
if (SPECIAL_KLUDGE_FOR_JEFF_BELL
and data == "\0\0\0\0"
and "Normal" not in book.style_name_map):
built_in = 1
built_in_id = 0
xf_index = 0
name = "Normal"
level = 255
elif xf_index & 0x8000:
# built-in style

HTH,
John

jbell

unread,
Sep 10, 2007, 2:49:00 PM9/10/07
to python-excel

On Sep 7, 10:06 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Sep 8, 4:18 am, jbell <nagual3....@gmail.com> wrote:
>
> > I think I may have answered my own question, but I don't understand
> > why it works.
>
> Uh-huh. You have waved a dead chicken at the volcano and it has
> stopped erupting :-)

I tried your suggestions and xlrd can now read the files.

Thanks again for your great help.


Jeff

Reply all
Reply to author
Forward
0 new messages