Error in XLS created with xlwt

575 views
Skip to first unread message

Ketil Froyn

unread,
May 8, 2012, 4:33:46 AM5/8/12
to python...@googlegroups.com
Hi,

I have written a python script that pulls content out of an sqlite3
database and writes this into an Excel spreadsheet using xlwt.

I discovered that information in a column in the xls was missing after
a certain point. I decided that this could be caused either by
problems reading the sqlite file, or from writing the xls, so I
decided to write all the data to a CSV as well using pythons "csv"
module.

After testing this, I see that all the expected data is in the csv,
but it's still missing from the xls. Looking more closely at the xls,
I see that one of the cells in the relevant column in the xls has lots
and lots of chinese characters, and way more data than this cell
should have. Before this cell, everything seems to be ok, but after
this cell, lots of data is missing, though some cells have data.

Has nyone else experienced anything like this? Unfortunately, my case
has quite a lot of data, and I'm not at liberty to share it. But if
anyone has any pointers on what I can do to isolate the problem, which
could let me create a test case that I can share or fix, I'd be happy
to give that a stab.

I am using the xlwt that is bundled with my Ubuntu distribution
(0.7.2-2), but after checking out the lates subversion code and
reading the changelog, I didn't see that anything relevant to this
appeared to be fixed. I'll go ahead and try the latest version as
well, just in case I missed something.

Cheers, Ketil

John Machin

unread,
May 8, 2012, 6:59:17 AM5/8/12
to python...@googlegroups.com


On Tuesday, May 8, 2012 6:33:46 PM UTC+10, Ketil Froyn wrote:
Hi,

I have written a python script that pulls content out of an sqlite3
database and writes this into an Excel spreadsheet using xlwt.

I discovered that information in a column in the xls was missing after
a certain point.

Discovered how: opening it with (what version of) Excel or (Open|Libre)Office Calc or Gnumeric or xlrd?

 
I decided that this could be caused either by
problems reading the sqlite file, or from writing the xls, so I
decided to write all the data to a CSV as well using pythons "csv"
module.

After testing this, I see that all the expected data is in the csv,
but it's still missing from the xls.

Using the Python csv module doesn't always give you 100% of your data:

| >>> import csv
| >>> import StringIO
| >>> f = StringIO.StringIO()
| >>> wtr = csv.writer(f)
| >>> wtr.writerow(['x', 'abc\x00def', 'y'])
| >>> f.getvalue()
| 'x,abc,y\r\n'

 
Looking more closely at the xls,
I see that one of the cells in the relevant column in the xls has lots
and lots of chinese characters, and way more data than this cell
should have. Before this cell, everything seems to be ok, but after
this cell, lots of data is missing, though some cells have data.

Has nyone else experienced anything like this?

It is possible for plain old ASCII data to look like CJK characters if it's encoded in UTF-16 and you insert or delete an odd number of bytes, or decode starting at an odd byte offset:

| >>> import unicodedata as ucd
| >>> ucd.name('c\x00'.decode('utf_16_le'))
| 'LATIN SMALL LETTER C'
| >>> ucd.name('\x00c'.decode('utf_16_le'))
| 'CJK UNIFIED IDEOGRAPH-6300'

Unfortunately, my case
has quite a lot of data, and I'm not at liberty to share it.

How much is "quite a lot"? What about an NDA? Please be aware that if the problem turns out to be in xlwt writing the shared string table, that can't be debugged without access to the whole file.
 
But if
anyone has any pointers on what I can do to isolate the problem, which
could let me create a test case that I can share or fix, I'd be happy
to give that a stab.

Assuming you are writing to the xlwt workbook like this:

    sheet.write(rowx, colx, value)

and the bad column is badcolx and the big lump of "Chinese characters" appears in badrowx, precede your write line by:

    if -2 <= rowx - badrowx <= 5 and -1 <= colx - badcolx <= 1:
        print rowx, colx, type(value), len(value) if isinstance(value, basestring) else "?", repr(value)

Do you see what you expect? What language are you expecting?

Please consider sharing your code. Of particular interest will be what manipulations are done between getting the data from sqlite3 and writing it with xlwt. Do you do workbook = xlwt.Workbook(encoding="something") ? Do you decode or encode anywhere in your code?


I am using the xlwt that is bundled with my Ubuntu distribution
(0.7.2-2)

What is the -2 on the end?
 
, but after checking out the lates subversion code and

xlwt is on github now ...
 

Ketil Froyn

unread,
May 8, 2012, 9:07:04 AM5/8/12
to python...@googlegroups.com
Thanks for getting back to me so fast! Seems some weird unicode
characters in different cells were responsible, more details below...

On Tue, May 8, 2012 at 12:59 PM, John Machin <sjma...@lexicon.net> wrote:
>
> On Tuesday, May 8, 2012 6:33:46 PM UTC+10, Ketil Froyn wrote:
>>
>> Hi,
>>
>> I have written a python script that pulls content out of an sqlite3
>> database and writes this into an Excel spreadsheet using xlwt.
>>
>> I discovered that information in a column in the xls was missing after
>> a certain point.
>
> Discovered how: opening it with (what version of) Excel or
> (Open|Libre)Office Calc or Gnumeric or xlrd?

Excel and LibreOffice. Actually, Excel reports that it finds
unreadable content, and fixing it removes lots of data. LibreOffice
apparently handles this silently. LibreOffice shows all the chinese
symbols (which does contain proper text), while the field is just
blank in Excel after it has "corrected" the data.

>> I decided that this could be caused either by
>> problems reading the sqlite file, or from writing the xls, so I
>> decided to write all the data to a CSV as well using pythons "csv"
>> module.
>>
>> After testing this, I see that all the expected data is in the csv,
>> but it's still missing from the xls.
>
> Using the Python csv module doesn't always give you 100% of your data:

This isn't the problem. I can also view the DB directly, and see data
there that was missing from the xls. All the data I've checked is
available in the csv, so it appears my sqlite reading is ok.

>> Looking more closely at the xls,
>> I see that one of the cells in the relevant column in the xls has lots
>> and lots of chinese characters, and way more data than this cell
>> should have. Before this cell, everything seems to be ok, but after
>> this cell, lots of data is missing, though some cells have data.
>>
>> Has nyone else experienced anything like this?
>
> It is possible for plain old ASCII data to look like CJK characters if it's
> encoded in UTF-16 and you insert or delete an odd number of bytes, or decode
> starting at an odd byte offset:
>
> | >>> import unicodedata as ucd
> | >>> ucd.name('c\x00'.decode('utf_16_le'))
> | 'LATIN SMALL LETTER C'
> | >>> ucd.name('\x00c'.decode('utf_16_le'))
> | 'CJK UNIFIED IDEOGRAPH-6300'
>
>> Unfortunately, my case
>> has quite a lot of data, and I'm not at liberty to share it.
>
> How much is "quite a lot"? What about an NDA? Please be aware that if the
> problem turns out to be in xlwt writing the shared string table, that can't
> be debugged without access to the whole file.

My sqlite file is about 100MB, and in the resulting xls (spread across
multiple sheets) there's about 230k rows. In some cases I have to
split data across multiple sheets because of the 65k limitation in
xlwt. I also have to split/truncate cells if there's more than 32k
characters.

>> But if
>> anyone has any pointers on what I can do to isolate the problem, which
>> could let me create a test case that I can share or fix, I'd be happy
>> to give that a stab.
>
>
> Assuming you are writing to the xlwt workbook like this:
>
>     sheet.write(rowx, colx, value)
>
> and the bad column is badcolx and the big lump of "Chinese characters"
> appears in badrowx, precede your write line by:
>
>     if -2 <= rowx - badrowx <= 5 and -1 <= colx - badcolx <= 1:
>         print rowx, colx, type(value), len(value) if isinstance(value,
> basestring) else "?", repr(value)

Yes, that's how I write data to the workbook. I added your print to
everything and saved it all as a debug log. The first cell that fails,
and that shows all the chinese characters in LibreOffice, is a plain
string with 55 characters, matching [a-z0-9\s.?]. There are some other
strings in the data that aren't plain ascii, but they are all utf-8
(otherwise csv would barf). However, there were a couple of odd
unicode characters embedded, and one was two rows before the error,
and in the same column. They were U+1F366 and U+1F48B. I removed those
from a copy of the sqlite database and tried again, and now my
spreadsheet loads without issues! Thanks for helping me nail the
issue, but do you know if there is a way to fix this? I create my
workbook with:

xlwt.Workbook(encoding='utf8')

> Do you see what you expect? What language are you expecting?

English and norwegian. I usually see what I expect.

> Please consider sharing your code. Of particular interest will be what
> manipulations are done between getting the data from sqlite3 and writing it
> with xlwt. Do you do workbook = xlwt.Workbook(encoding="something") ? Do you
> decode or encode anywhere in your code?

I replace blobs from the DB with text. I also encode all text as
utf-8, because the csv module needs that. I didn't encode or decode
before, but had the same issue before.

>> I am using the xlwt that is bundled with my Ubuntu distribution
>> (0.7.2-2)
>
> What is the -2 on the end?

Ubuntu's numbering scheme, not sure why they tack that on there.

>> , but after checking out the lates subversion code and
>
> xlwt is on github now ...

Ah, that explains why I didn't find a tag for version 0.7.4! Thanks,
I'll fetch that.

Cheers, Ketil

Ketil Froyn

unread,
May 8, 2012, 9:24:20 AM5/8/12
to python...@googlegroups.com
By the way, my code does:

print "DEBUG: ", row, col, type(e), len(e) if isinstance(e,
basestring) else "?", repr(e)
ws_t.write(row, col, e)
cvsrow.append(e) # later added to the csv with tablesWriter.writerow(cvsrow)

The unicode character U+1F366 is shown in my debug output as a string
of length 4: "\xf0\x9f\x8d\xa6"
The unicode character U+1F48B is shown in my debug output as a string
of length 4: "\xf0\x9f\x92\x8b"

Both are correct utf8 encodings, according to http://www.utf8-chartable.de/

Cheers, Ketil
--
Ketil Froyn
ke...@froyn.name
http://ketil.froyn.name/

Ketil Froyn

unread,
May 8, 2012, 11:34:25 AM5/8/12
to python...@googlegroups.com
Argh, never mind. This is fixed in the latest version from git. This
is the last commit, and I'm guessing this was what caused my problem:

commit ce581c1cd8063feb2b5a1c59962d849c8562b124
Author: sjmachin <sjma...@lexicon.net>
Date: Sun Apr 15 05:13:13 2012 -0700

Combination of character(s) outside the BMP (ord > 0xFFFF) with a
wide-unicode Python build was causing a corrupt SST (shared string
table). Rich text still needs investigation.

Now my (newly written) test case, and my original problem case, seem
to be working fine.

Thanks for the help anyway!

Cheers, Ketil

Yizhe Wang

unread,
Jun 20, 2016, 11:35:20 AM6/20/16
to python-excel
Hi Ketil,

I'm quite new to Python and what I'm doing is fetch HTML data to excel using xlwt.

I encountered exactly same problem with you - data in some columns with Chinese characters just missing starting from a certain point.

May I know how did you fix this by applying the latest version from git? Can you please elaborate it a bit as the problem has confused me for quite a while? Thank a lot.

Adrian Klaver

unread,
Jun 20, 2016, 5:37:01 PM6/20/16
to python...@googlegroups.com
On 06/20/2016 08:34 AM, Yizhe Wang wrote:
> Hi Ketil,
>
> I'm quite new to Python and what I'm doing is fetch HTML data to excel
> using xlwt.

What version of xlwt are you using?

>
> I encountered exactly same problem with you - data in some columns with
> Chinese characters just missing starting from a certain point.
>
> May I know how did you fix this by applying the latest version from git?
> Can you please elaborate it a bit as the problem has confused me for
> quite a while? Thank a lot.

This commit is over four years old, so I would expect any recent version
of xlwt to contain the fix, hence my question about what version of xlwt
you are using.

>
>
> On Tuesday, May 8, 2012 at 11:34:25 PM UTC+8, Ketil Froyn wrote:
>
> Argh, never mind. This is fixed in the latest version from git. This
> is the last commit, and I'm guessing this was what caused my problem:
>
> commit ce581c1cd8063feb2b5a1c59962d849c8562b124
> Author: sjmachin <sjma...@lexicon.net <javascript:>>
> Date: Sun Apr 15 05:13:13 2012 -0700
>
> Combination of character(s) outside the BMP (ord > 0xFFFF) with a
> wide-unicode Python build was causing a corrupt SST (shared string
> table). Rich text still needs investigation.
>
> Now my (newly written) test case, and my original problem case, seem
> to be working fine.
>
> Thanks for the help anyway!
>
> Cheers, Ketil
>
> --
> You received this message because you are subscribed to the Google
> Groups "python-excel" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to python-excel...@googlegroups.com
> <mailto:python-excel...@googlegroups.com>.
> To post to this group, send email to python...@googlegroups.com
> <mailto:python...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/python-excel.
> For more options, visit https://groups.google.com/d/optout.


--
Adrian Klaver
adrian...@aklaver.com
Reply all
Reply to author
Forward
0 new messages