Unicode conversion in xlrd

IanKRolfe

unread,

Nov 10, 2009, 8:06:58 AM11/10/09

to python-excel

First of all, thanks for xlrd, it's saved the day for me, I would
never have believed that after a couple of days experimenting I would
find extracting data from an Excel spreadsheet could be so easy!!

While handling a spreadsheet that is sent to me that has emails pasted
into cells by some arcane process, I get problems with an error
opening the worksheet. I diagnosed the problem fairly quickly to be
dodgy unicode in the data and a friend pointed out the fix to be to
change __init__.py in xlrd:

def unpack_SST_table(datatab, nstrings):
"Return list of strings"
:
:
while 1:
charsneed = nchars - charsgot
if options & 0x01:
# Uncompressed UTF-16
charsavail = local_min((datalen - pos) >> 1,
charsneed)
rawstrg = data[pos:pos+2*charsavail]
# if DEBUG: print "SST U16: nchars=%d pos=%d rawstrg=
%r" % (nchars, pos, rawstrg)
try:
accstrg += unicode(rawstrg,
"utf_16_le",errors='replace') #<------------here
:
adding the "errors='replace'" stopped the unicode exception being
raised when the dodgy unicode is found in the spreadsheet. I guess the
point of this message is that was there a particular reason for
leaving it as "strict" in this function, could you add some way of
choosing the mode (strict, replace, remove) by an init parameter, or
is there a better way to do this that I've missed? With the "hack" in
place everything is fine, but I can see that if someone upgrades xlrd
on the server (it's in a Django web application) it's going to get
broken again!

Ta, Ian

John Machin

unread,

Nov 10, 2009, 4:51:52 PM11/10/09

to python...@googlegroups.com

On 11/11/2009 12:06 AM, IanKRolfe wrote:
> First of all, thanks for xlrd, it's saved the day for me, I would
> never have believed that after a couple of days experimenting I would
> find extracting data from an Excel spreadsheet could be so easy!!

Thanks for the testimonial :-)

> While handling a spreadsheet that is sent to me that has emails pasted
> into cells by some arcane process, I get problems with an error
> opening the worksheet. I diagnosed the problem fairly quickly to be
> dodgy unicode in the data

No "unicode" is expected. I presume that you mean UTF-16. Please share
the details of your diagnosis, including what leads you to point the
finger at the data rather than (1) xlrd's failure to understand the data
layout (2) the file-creating software's failure ditto.

Random musings while awaiting evidence: Note that while it's very likely
that random_gibberish.decode('utf8') will raise an exception, that's not
the case for UTF-16 (provided the length in bytes is even!). All there
is to complain about with UTF-16 is a malformed surrogate pair. Perhaps
a valid surrogate pair has been spilt over CONTINUE records ... how does
Excel handle that?

> and a friend pointed out the fix to be to
> change __init__.py in xlrd:

> def unpack_SST_table(datatab, nstrings):
> "Return list of strings"
> :
> :
> while 1:
> charsneed = nchars - charsgot
> if options & 0x01:
> # Uncompressed UTF-16
> charsavail = local_min((datalen - pos) >> 1,
> charsneed)
> rawstrg = data[pos:pos+2*charsavail]
> # if DEBUG: print "SST U16: nchars=%d pos=%d rawstrg=
> %r" % (nchars, pos, rawstrg)
> try:
> accstrg += unicode(rawstrg,
> "utf_16_le",errors='replace') #<------------here
> :
> adding the "errors='replace'" stopped the unicode exception being
> raised when the dodgy unicode is found in the spreadsheet.

Please send a copy of the offending file. Please supply the output that
you got (error message and traceback). What version of xlrd are you
running? What Python? What OS? What software created the file? What is
the actual data in that SST entry? What is the expected data? What
happens when you try to open a copy of the file with (1) Excel (2)
OpenOffice.org's calc (3) Gnumeric: (a) crash (b) opens OK, see nonsense
(c) opens OK, see expected data (d) other? Can you open the file with
one of those UIs, save it, and get an xlrd-readable file? Is this the
only file with this problem? Is the first piece of "dodgy unicode" the
ONLY piece? Does the file or the data originate from an environment
(e.g. Hong Kong) where there's a strong possibility of code points
outside the BMP (and thus requiring surrogates to be expressed in UTF-16)?

> I guess the
> point of this message is that was there a particular reason for
> leaving it as "strict" in this function,

Of course there's a particular reason for "strict": it's so that an
exception is raised if the file is invalid (or I've missed some nuance
of how SST tables are split over CONTINUE records) rather than hushing
it up and pretending all is OK.

> could you add some way of
> choosing the mode (strict, replace, remove) by an init parameter,

Yes, of course, but "could" doesn't imply "would", based on the evidence
so far.

Cheers,
John

IanKRolfe

unread,

Nov 11, 2009, 9:16:30 AM11/11/09

to python-excel

On Nov 10, 9:51 pm, John Machin <sjmac...@lexicon.net> wrote:
> No "unicode" is expected. I presume that you mean UTF-16. Please share
> the details of your diagnosis, including what leads you to point the
> finger at the data rather than (1) xlrd's failure to understand the data
> layout (2) the file-creating software's failure ditto.

Ok, here's what happens:
U:\Feedback>python
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import xlrd
>>> wb=xlrd.open_workbook('Test5.xls',on_demand=True, formatting_info=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python25\lib\site-packages\xlrd\__init__.py", line 454, in
open_workbook
bk.parse_globals()
File "C:\python25\lib\site-packages\xlrd\__init__.py", line 1473, in
parse_globals
self.handle_sst(data)
File "C:\python25\lib\site-packages\xlrd\__init__.py", line 1446, in
handle_sst
self._sharedstrings = unpack_SST_table(strlist, uniquestrings)
File "C:\python25\lib\site-packages\xlrd\__init__.py", line 1684, in
unpack_SST_table
accstrg += unicode(rawstrg, "utf_16_le")
File "c:\python25\lib\encodings\utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position
786-787: illegal encoding
>>>

I used the term "unicode" because of the penultimate line in the
traceback - I'm not very savvy on character encoding, so I apologise
if I used the wrong term!
I assumed a wierd data problem, because I thought that failing to open
the workbook would be a major problem you'd have come across before if
it wasn't data related, and because its failing while unpacking the
SST.

> Please send a copy of the offending file.

I've sent a file that causes this to you by email since it's 188k and
potentially commercially sensitive.

> What version of xlrd are you running?

0.7.1

> What Python?
2.5.4

> What OS?
The above error was on WinXP but the same thing happens on the Linux
server

> What software created the file?

Excel, probably from MS Office 2003 as that's what the company has

> What is the actual data in that SST entry?

I printed it out and 'rawstrg' is huge, but you can get it easy enough
from the test sheet I've mailed you!

> What happens when you try to open a copy of the file with (1) Excel (2)
> OpenOffice.org's calc (3) Gnumeric: (a) crash (b) opens OK, see nonsense
> (c) opens OK, see expected data (d) other?

I've only got access to Excel, and it opens and shows the data without
complaining

> Can you open the file with one of those UIs, save it, and get an xlrd-readable file? Is this the
> only file with this problem?

I've had the problem in other files where the problem has gone as
people have edited, but this is the one that always fails, because the
data is put into it and left there untouched.

> Is the first piece of "dodgy unicode" the
> ONLY piece? Does the file or the data originate from an environment
> (e.g. Hong Kong) where there's a strong possibility of code points
> outside the BMP (and thus requiring surrogates to be expressed in UTF-16)?

The sheet contains content extracted from emails from all over the
world, so I'd say that was a yes. If there is only one in the file or
more I cannot say, as it fails on opening the sheet and therefore goes
no further, unless I hack it to ignore the errors in which case I
never see them :)

> Of course there's a particular reason for "strict": it's so that an
> exception is raised if the file is invalid (or I've missed some nuance
> of how SST tables are split over CONTINUE records) rather than hushing
> it up and pretending all is OK.
>

Well hopefully if it is the latter the sheet I've sent you will help
you decide!

Ian

John Machin

unread,

Nov 11, 2009, 8:09:01 PM11/11/09

to python...@googlegroups.com

and the two-byte sequence there represents a high surrogate and the next
two-byte sequence doesn't represent a low surrogate so splat ...

>
> I used the term "unicode" because of the penultimate line in the
> traceback - I'm not very savvy on character encoding, so I apologise
> if I used the wrong term!
> I assumed a wierd data problem, because I thought that failing to open
> the workbook would be a major problem you'd have come across before if
> it wasn't data related, and because its failing while unpacking the
> SST.
>
>> Please send a copy of the offending file.
> I've sent a file that causes this to you by email since it's 188k and
> potentially commercially sensitive.
>
>> What version of xlrd are you running?
> 0.7.1
>
>> What Python?
> 2.5.4
>
>> What OS?
> The above error was on WinXP but the same thing happens on the Linux
> server
>
>> What software created the file?
> Excel, probably from MS Office 2003 as that's what the company has
>
>> What is the actual data in that SST entry?
> I printed it out and 'rawstrg' is huge, but you can get it easy enough
> from the test sheet I've mailed you!

Yes, at 7736 characters, it's preposterously huge in the context of
being meaningfully displayed as the contents of a single cell by Excel.
After you printed it out, did you read the printout?

When I look at repr(rawstrg), I see that it starts off like this:

'p\x00r\x00o\x00o\x00f\x00 \x00a\x00r\x00e\x00 \x00l\x00o\x00w\x00'

That is the UTF-16LE representation of the ASCII characters 'proof are
low'. The understandable part concludes with a URL that identifies your
organisation.

The remainder looks like random rubbish from a process's memory. Here
are some short strings that you could have observed in the
allegedly-UTF-16-encoded 'rawstrg':

PLE\x00ASE NOTE\xa0: THE

IMPORTF

BODY{\xdd\x8427\x8aqHTML\x87\xb0\x02}

The likelihood of taking meaningful unicode text, encoding it in UTF-16,
and producing bytes that are meaningful ASCII text is in
monkeys/typwriters/works-of-Shakespeare territory.

Of course decoding that under the impression that it is a UTF-16
encoding of something meaningful is not a good idea:

>>> import unicodedata as ucd
>>> [ucd.name(u) for u in 'IMPORT'.decode('utf_16_le')]
['CJK UNIFIED IDEOGRAPH-4D49', 'CJK UNIFIED IDEOGRAPH-4F50', 'CJK
NIFIED IDEOGRAPH-5452']
>>> [ucd.name(u) for u in 'MPORTF'.decode('utf_16_le')]
['CJK UNIFIED IDEOGRAPH-504D', 'CJK UNIFIED IDEOGRAPH-524F', 'CJK
UNIFIED IDEOGRAPH-4654']
>>>

>
>> What happens when you try to open a copy of the file with (1) Excel (2)
>> OpenOffice.org's calc (3) Gnumeric: (a) crash (b) opens OK, see nonsense
>> (c) opens OK, see expected data (d) other?
> I've only got access to Excel, and it opens and shows the data without
> complaining

You didn't distinguish between "see nonsense" and "see expected data".
If you expect to see nonsense, please say so explicitly.

The offending SST entry (index 552) ends up in cell T4. When I select
cell T4, it shows normal-looking emailly stuff, with one row of CJK
ideographs, Hangul syllables, etc at the bottom of the screen,
presumably all it can/will display.

The SST entry with index 558 doesn't cause an exception, but it also has
rubbish at the end; it's in cell T6, and this time there's a lot more
pseudo-East-Asian text visible at the end.

You may wish to investigate what is muddying the water upstream; looks
like a buffer overflow problem to me.

>> Can you open the file with one of those UIs, save it, and get an xlrd-readable file? Is this the
>> only file with this problem?
> I've had the problem in other files where the problem has gone as
> people have edited, but this is the one that always fails, because the
> data is put into it and left there untouched.
>
>> Is the first piece of "dodgy unicode" the
>> ONLY piece? Does the file or the data originate from an environment
>> (e.g. Hong Kong) where there's a strong possibility of code points
>> outside the BMP (and thus requiring surrogates to be expressed in UTF-16)?
> The sheet contains content extracted from emails from all over the
> world, so I'd say that was a yes.

As it turned out, the problem emanated from the cp1252 comfort zone; it
was nothing to do with failed handling of intentional surrogates.

> If there is only one in the file or
> more I cannot say, as it fails on opening the sheet and therefore goes
> no further, unless I hack it to ignore the errors in which case I
> never see them :)

Instead of enumerating failure modes, you might have considered doing
something that would answer the question:
# pseudocode
try:
x = blah.decode('utf_16_le')
except:
print_useful_debugging_info()
x = blah.decode('utf_16_le', 'replace')
# and keep on going

That was the only exception-causing problem; one more record had rubbish
in it.

By the way, I suggest that you don't hack the installed copy of xlrd,
but instead monkey-patch it:

=== your_app.py ===
from struct import unpack

def my_unpack_SST_table(datatab, nstrings):
# your hacked version

import xlrd
xlrd.unpack_SST_table = my_unpack_SST_table

book = xlrd.open_workbook(...)

I'm sending you a script sst_debug.py which monkey-patches a debug
version of that function into xlrd, plus the results of running it on
your test file.

>> Of course there's a particular reason for "strict": it's so that an
>> exception is raised if the file is invalid (or I've missed some nuance
>> of how SST tables are split over CONTINUE records) rather than hushing
>> it up and pretending all is OK.
>>
> Well hopefully if it is the latter the sheet I've sent you will help
> you decide!

It's very definitely crappy data.

Cheers,
John

IanKRolfe

unread,

Nov 12, 2009, 5:14:27 AM11/12/09

to python-excel

On 12 Nov, 01:09, John Machin <sjmac...@lexicon.net> wrote:
>
> === your_app.py ===
> from struct import unpack
>
> def my_unpack_SST_table(datatab, nstrings):
> # your hacked version
>
> import xlrd
> xlrd.unpack_SST_table = my_unpack_SST_table
>
> book = xlrd.open_workbook(...)
>
> I'm sending you a script sst_debug.py which monkey-patches a debug
> version of that function into xlrd, plus the results of running it on
> your test file.
>

Thanks for all the effort you've put in. I agree that patching xlrd
itself is probably not a good idea in the long term, and the above
technique is what I will do in the app, because it's likely that at
some point the server will get upgraded/rebuilt and it'll stop working
again. I'll just have to make sure I keep an eye out for changes in
unpack_SST_table (although I guess by now that's unlikely). I'm not
sure how the data gets into the spreadsheets, so a buffer overflow
problem is entirely possible. I'll let the guys "upstream" know.
Personally I think handling data from emails via Access and a
spreadsheet seems a little overcomplicated but it wasn't my
decision :)

Given that it's crummy data, then my "feature request" is probably no
longer required, I should get the data sorted out...

John Machin

unread,

Nov 12, 2009, 6:37:59 AM11/12/09

to python...@googlegroups.com

On 12/11/2009 9:14 PM, IanKRolfe wrote:

>
> Thanks for all the effort you've put in. I agree that patching xlrd
> itself is probably not a good idea in the long term, and the above
> technique is what I will do in the app, because it's likely that at
> some point the server will get upgraded/rebuilt and it'll stop working
> again. I'll just have to make sure I keep an eye out for changes in
> unpack_SST_table (although I guess by now that's unlikely).

I'll probably change it a little, but only to provide more information
before re-raising the exception.

> I'm not
> sure how the data gets into the spreadsheets, so a buffer overflow
> problem is entirely possible. I'll let the guys "upstream" know.
> Personally I think handling data from emails via Access and a
> spreadsheet seems a little overcomplicated but it wasn't my
> decision :)

The bits that boggle me are (1) putting a whole email message in one
cell in a spreadsheet (2) how come the users don't notice all the
pseudo-East-Asian gumpf at the end of the messages

> Given that it's crummy data, then my "feature request" is probably no
> longer required, I should get the data sorted out...

Yep, the data can be cruddy even when an exception is not raised.

Good luck with "upstream".

Cheers,
John

Reply all

Reply to author

Forward