Thanks for the testimonial :-)
> While handling a spreadsheet that is sent to me that has emails pasted
> into cells by some arcane process, I get problems with an error
> opening the worksheet. I diagnosed the problem fairly quickly to be
> dodgy unicode in the data
No "unicode" is expected. I presume that you mean UTF-16. Please share
the details of your diagnosis, including what leads you to point the
finger at the data rather than (1) xlrd's failure to understand the data
layout (2) the file-creating software's failure ditto.
Random musings while awaiting evidence: Note that while it's very likely
that random_gibberish.decode('utf8') will raise an exception, that's not
the case for UTF-16 (provided the length in bytes is even!). All there
is to complain about with UTF-16 is a malformed surrogate pair. Perhaps
a valid surrogate pair has been spilt over CONTINUE records ... how does
Excel handle that?
> and a friend pointed out the fix to be to
> change __init__.py in xlrd:
> def unpack_SST_table(datatab, nstrings):
> "Return list of strings"
> :
> :
> while 1:
> charsneed = nchars - charsgot
> if options & 0x01:
> # Uncompressed UTF-16
> charsavail = local_min((datalen - pos) >> 1,
> charsneed)
> rawstrg = data[pos:pos+2*charsavail]
> # if DEBUG: print "SST U16: nchars=%d pos=%d rawstrg=
> %r" % (nchars, pos, rawstrg)
> try:
> accstrg += unicode(rawstrg,
> "utf_16_le",errors='replace') #<------------here
> :
> adding the "errors='replace'" stopped the unicode exception being
> raised when the dodgy unicode is found in the spreadsheet.
Please send a copy of the offending file. Please supply the output that
you got (error message and traceback). What version of xlrd are you
running? What Python? What OS? What software created the file? What is
the actual data in that SST entry? What is the expected data? What
happens when you try to open a copy of the file with (1) Excel (2)
OpenOffice.org's calc (3) Gnumeric: (a) crash (b) opens OK, see nonsense
(c) opens OK, see expected data (d) other? Can you open the file with
one of those UIs, save it, and get an xlrd-readable file? Is this the
only file with this problem? Is the first piece of "dodgy unicode" the
ONLY piece? Does the file or the data originate from an environment
(e.g. Hong Kong) where there's a strong possibility of code points
outside the BMP (and thus requiring surrogates to be expressed in UTF-16)?
> I guess the
> point of this message is that was there a particular reason for
> leaving it as "strict" in this function,
Of course there's a particular reason for "strict": it's so that an
exception is raised if the file is invalid (or I've missed some nuance
of how SST tables are split over CONTINUE records) rather than hushing
it up and pretending all is OK.
> could you add some way of
> choosing the mode (strict, replace, remove) by an init parameter,
Yes, of course, but "could" doesn't imply "would", based on the evidence
so far.
Cheers,
John
and the two-byte sequence there represents a high surrogate and the next
two-byte sequence doesn't represent a low surrogate so splat ...
>
> I used the term "unicode" because of the penultimate line in the
> traceback - I'm not very savvy on character encoding, so I apologise
> if I used the wrong term!
> I assumed a wierd data problem, because I thought that failing to open
> the workbook would be a major problem you'd have come across before if
> it wasn't data related, and because its failing while unpacking the
> SST.
>
>> Please send a copy of the offending file.
> I've sent a file that causes this to you by email since it's 188k and
> potentially commercially sensitive.
>
>> What version of xlrd are you running?
> 0.7.1
>
>> What Python?
> 2.5.4
>
>> What OS?
> The above error was on WinXP but the same thing happens on the Linux
> server
>
>> What software created the file?
> Excel, probably from MS Office 2003 as that's what the company has
>
>> What is the actual data in that SST entry?
> I printed it out and 'rawstrg' is huge, but you can get it easy enough
> from the test sheet I've mailed you!
Yes, at 7736 characters, it's preposterously huge in the context of
being meaningfully displayed as the contents of a single cell by Excel.
After you printed it out, did you read the printout?
When I look at repr(rawstrg), I see that it starts off like this:
'p\x00r\x00o\x00o\x00f\x00 \x00a\x00r\x00e\x00 \x00l\x00o\x00w\x00'
That is the UTF-16LE representation of the ASCII characters 'proof are
low'. The understandable part concludes with a URL that identifies your
organisation.
The remainder looks like random rubbish from a process's memory. Here
are some short strings that you could have observed in the
allegedly-UTF-16-encoded 'rawstrg':
PLE\x00ASE NOTE\xa0: THE
IMPORTF
BODY{\xdd\x8427\x8aqHTML\x87\xb0\x02}
The likelihood of taking meaningful unicode text, encoding it in UTF-16,
and producing bytes that are meaningful ASCII text is in
monkeys/typwriters/works-of-Shakespeare territory.
Of course decoding that under the impression that it is a UTF-16
encoding of something meaningful is not a good idea:
>>> import unicodedata as ucd
>>> [ucd.name(u) for u in 'IMPORT'.decode('utf_16_le')]
['CJK UNIFIED IDEOGRAPH-4D49', 'CJK UNIFIED IDEOGRAPH-4F50', 'CJK
NIFIED IDEOGRAPH-5452']
>>> [ucd.name(u) for u in 'MPORTF'.decode('utf_16_le')]
['CJK UNIFIED IDEOGRAPH-504D', 'CJK UNIFIED IDEOGRAPH-524F', 'CJK
UNIFIED IDEOGRAPH-4654']
>>>
>
>> What happens when you try to open a copy of the file with (1) Excel (2)
>> OpenOffice.org's calc (3) Gnumeric: (a) crash (b) opens OK, see nonsense
>> (c) opens OK, see expected data (d) other?
> I've only got access to Excel, and it opens and shows the data without
> complaining
You didn't distinguish between "see nonsense" and "see expected data".
If you expect to see nonsense, please say so explicitly.
The offending SST entry (index 552) ends up in cell T4. When I select
cell T4, it shows normal-looking emailly stuff, with one row of CJK
ideographs, Hangul syllables, etc at the bottom of the screen,
presumably all it can/will display.
The SST entry with index 558 doesn't cause an exception, but it also has
rubbish at the end; it's in cell T6, and this time there's a lot more
pseudo-East-Asian text visible at the end.
You may wish to investigate what is muddying the water upstream; looks
like a buffer overflow problem to me.
>> Can you open the file with one of those UIs, save it, and get an xlrd-readable file? Is this the
>> only file with this problem?
> I've had the problem in other files where the problem has gone as
> people have edited, but this is the one that always fails, because the
> data is put into it and left there untouched.
>
>> Is the first piece of "dodgy unicode" the
>> ONLY piece? Does the file or the data originate from an environment
>> (e.g. Hong Kong) where there's a strong possibility of code points
>> outside the BMP (and thus requiring surrogates to be expressed in UTF-16)?
> The sheet contains content extracted from emails from all over the
> world, so I'd say that was a yes.
As it turned out, the problem emanated from the cp1252 comfort zone; it
was nothing to do with failed handling of intentional surrogates.
> If there is only one in the file or
> more I cannot say, as it fails on opening the sheet and therefore goes
> no further, unless I hack it to ignore the errors in which case I
> never see them :)
Instead of enumerating failure modes, you might have considered doing
something that would answer the question:
# pseudocode
try:
x = blah.decode('utf_16_le')
except:
print_useful_debugging_info()
x = blah.decode('utf_16_le', 'replace')
# and keep on going
That was the only exception-causing problem; one more record had rubbish
in it.
By the way, I suggest that you don't hack the installed copy of xlrd,
but instead monkey-patch it:
=== your_app.py ===
from struct import unpack
def my_unpack_SST_table(datatab, nstrings):
# your hacked version
import xlrd
xlrd.unpack_SST_table = my_unpack_SST_table
book = xlrd.open_workbook(...)
I'm sending you a script sst_debug.py which monkey-patches a debug
version of that function into xlrd, plus the results of running it on
your test file.
>> Of course there's a particular reason for "strict": it's so that an
>> exception is raised if the file is invalid (or I've missed some nuance
>> of how SST tables are split over CONTINUE records) rather than hushing
>> it up and pretending all is OK.
>>
> Well hopefully if it is the latter the sheet I've sent you will help
> you decide!
It's very definitely crappy data.
Cheers,
John
>
> Thanks for all the effort you've put in. I agree that patching xlrd
> itself is probably not a good idea in the long term, and the above
> technique is what I will do in the app, because it's likely that at
> some point the server will get upgraded/rebuilt and it'll stop working
> again. I'll just have to make sure I keep an eye out for changes in
> unpack_SST_table (although I guess by now that's unlikely).
I'll probably change it a little, but only to provide more information
before re-raising the exception.
> I'm not
> sure how the data gets into the spreadsheets, so a buffer overflow
> problem is entirely possible. I'll let the guys "upstream" know.
> Personally I think handling data from emails via Access and a
> spreadsheet seems a little overcomplicated but it wasn't my
> decision :)
The bits that boggle me are (1) putting a whole email message in one
cell in a spreadsheet (2) how come the users don't notice all the
pseudo-East-Asian gumpf at the end of the messages
> Given that it's crummy data, then my "feature request" is probably no
> longer required, I should get the data sorted out...
Yep, the data can be cruddy even when an exception is not raised.
Good luck with "upstream".
Cheers,
John