[xlwt]: Error in opening Excel file with special chars

383 views
Skip to first unread message

jacky wang

unread,
Apr 10, 2012, 7:33:45 AM4/10/12
to python-excel
Hi,

I'm using xlwt to generate excel files, and it's a great lib. But
recently I meet some problem when opening excel files which contain
some special characters.

OS: Ubuntu 10.10
Python version: 2.6.6
xlwt version: 0.7.2 (installed from the official repository)

Sample code like this:

#!/usr/bin/python
import xlwt
encoding = 'utf_16_le'
wb = xlwt.Workbook(encoding=encoding)
ws = wb.add_sheet("demo")
#Mathematical Alphanumeric Symbols
uc = u" & ".join([unichr(0x1D400 + i) for i in xrange(5)])
uc = u' TEST ' + uc
ws.write(0, 0, uc)

#Miscellaneous Symbols and Pictographs
uc = u" & ".join([unichr(0x1f300 + i) for i in xrange(5)])
uc = u' TEST ' + uc
ws.write(0, 1, uc)
wb.save("test.xls")


When I open the file by using MS Excel (both in OSX and Windows
platforms), I get a warning about "Excel can not open test.xls because
some content is unreadable ...", and need to repair the file. After
choosing "Open and Repair", it actually doesn't work: the cell
content is actually empty.

I tried with Open Office (v3.2.0), it seems a little bit better
thans MS Excel, but only the first cell displaying well, and the
second one doesn't work: just some weird chars displaying, even the
'TEST ' is not displayed at beginning in the cell.

Could someone help me for this?

Regards,
Jacky

John Machin

unread,
Apr 10, 2012, 6:59:49 PM4/10/12
to python...@googlegroups.com
I can't reproduce this. I had to do an end-run around Python narrow-unicode deficiencies on Windows, but both Excel 2003 and Excel 2007 opened the resultant file happily. Funny characters means no font support. A1 displayed OK for me in Excel 2007. Lose the 'encoding='' thing (see comments below).

Please send me a copy of your output file immediately after creation by xlwt.

My code follows:

import sys

#  File "sjm_test.py", line 21, in <module>
#    uc = u" & ".join([unichr(0x1D400 + i) for i in xrange(5)])
# ValueError: unichr() arg not in range(0x10000) (narrow Python build)

try:
    unichr(0x10000)
    unichr_dammit = unichr
except ValueError:
    assert sys.maxunicode == 0xFFFF
    def unichr_dammit(ordinal):
        if ordinal <= 0xFFFF:
            return unichr(ordinal)
        # ordinal == 0x10000 + (HS - 0xD800) * 0x400 + LS - 0xDC00
        hi, lo = divmod(ordinal - 0x10000, 0x400)
        return unichr(hi + 0xD800) + unichr(lo + 0xDC00)
   
x = 0x1D404; u = u'\U0001D404'; print repr(u), repr(unichr_dammit(x))
x = 0x1F304; u = u'\U0001F304'; print repr(u), repr(unichr_dammit(x))
   
import xlwt
# encoding = 'utf_16_le'
# wb = xlwt.Workbook(encoding=encoding)
# that encoding is for DEcoding INPUT text
# it makes a sheet name that looks like Han characters instead of "demo"!!
wb = xlwt.Workbook()

ws = wb.add_sheet("demo")
#Mathematical Alphanumeric Symbols
uc = u" & ".join([unichr_dammit(0x1D400 + i) for i in xrange(5)])

uc = u' TEST ' + uc
ws.write(0, 0, uc)

#Miscellaneous Symbols and Pictographs
uc = u" & ".join([unichr_dammit(0x1f300 + i) for i in xrange(5)])

jacky wang

unread,
Apr 11, 2012, 3:38:11 AM4/11/12
to python-excel


On Apr 11, 6:59 am, John Machin <sjmac...@lexicon.net> wrote:
> I can't reproduce this. I had to do an end-run around Python narrow-unicode
> deficiencies on Windows, but both Excel 2003 and Excel 2007 opened the
> resultant file happily. Funny characters means no font support. A1
> displayed OK for me in Excel 2007. Lose the 'encoding='' thing (see
> comments below).
>
> Please send me a copy of your output file immediately after creation by
> xlwt.
>
> My code follows:
>
> import sys
>
> #  File "sjm_test.py", line 21, in <module>
> #    uc = u" & ".join([unichr(0x1D400 + i) for i in xrange(5)])
> # ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Forget to mention, my python build is UCS4 version:

In [1]: import sys

In [2]: sys.maxunicode
Out[2]: 1114111

In [3]: int('0x10FFFF', 16)
Out[3]: 1114111

according to: http://docs.python.org/library/functions.html#unichr
It seems your code works well for my environment as well.

that's weird:

From the code above, seems the trick is the unichr and your
unichr_dammit function. Considering I'm using the UCS4 build, it won't
be a problem for unichr() to support unicode larger than 0x10000 (and
also it didn't raise a value error in my environment), so it's only
the trick that your unichr_dammit implementation is different with the
buildin unichr.

I will dig into the python source code of unichr to understand
what's the difference.

I can not include attachment since using web interface, but I will
send u another email with the corrupt excel file.

Thanks very much!

BR
Jacky Wang

John Machin

unread,
Apr 11, 2012, 5:35:29 PM4/11/12
to python...@googlegroups.com

unichr is not the problem, it's a sideshow.

>
> I can not include attachment since using web interface, but I will
> send u another email with the corrupt excel file.

Received. Thanks.

The problem is that each string written needs to be prefixed by a length
field. This is variously described as "number of characters" (OOo) and
"the count of characters in the string" (MS). MS goes on to refer to these
characters as "double-byte" characters.

What is actually required is:

len(unicode_string.encode('utf_16_le')) // 2

Currently xlwt uses len(unicode_string)

On a narrow-unicode Python, len(u'\U0001D400') == 2 which is
(accidentally) the right answer. On a wide-unicode build, the result is 1,
which results in a corrupted SST (shared string table) in the file.

This will require changes to xlwt. ETA: this weekend.

Cheers,
John

jacky wang

unread,
Apr 12, 2012, 6:06:25 AM4/12/12
to python-excel


On Apr 12, 5:35 am, "John Machin" <sjmac...@lexicon.net> wrote:
> On Wed, April 11, 2012 5:38 pm, jacky wang wrote:
>
> > On Apr 11, 6:59 am, John Machin <sjmac...@lexicon.net> wrote:
> > It seems your code works well for my environment as well.
>
> > that's weird:
>
> >    From the code above, seems the trick is the unichr  and your
> > unichr_dammit function. Considering I'm using the UCS4 build, it won't
> > be a problem for unichr() to support unicode larger than 0x10000 (and
> > also it didn't raise a value error in my environment), so it's only
> > the trick that your unichr_dammit implementation is different with the
> > buildin unichr.
>
> >   I will dig into the python source code of unichr to understand
> > what's the difference.
>
> unichr is not the problem, it's a sideshow.
>
>
>
> > I can not include attachment since using web interface, but I will
> > send u another email with the corrupt excel file.
>
> Received. Thanks.


welcome.

>
> The problem is that each string written needs to be prefixed by a length
> field. This is variously described as "number of characters" (OOo) and
> "the count of characters in the string" (MS). MS goes on to refer to these
> characters as "double-byte" characters.
>
> What is actually required is:
>
>     len(unicode_string.encode('utf_16_le')) // 2
>
> Currently xlwt uses len(unicode_string)
>
> On a narrow-unicode Python, len(u'\U0001D400') == 2 which is
> (accidentally) the right answer. On a wide-unicode build, the result is 1,
> which results in a corrupted SST (shared string table) in the file.
>

Cool, thanks for the explanation!
Reply all
Reply to author
Forward
0 new messages