Thanks for getting back to me so fast! Seems some weird unicode
characters in different cells were responsible, more details below...
On Tue, May 8, 2012 at 12:59 PM, John Machin <
sjma...@lexicon.net> wrote:
>
> On Tuesday, May 8, 2012 6:33:46 PM UTC+10, Ketil Froyn wrote:
>>
>> Hi,
>>
>> I have written a python script that pulls content out of an sqlite3
>> database and writes this into an Excel spreadsheet using xlwt.
>>
>> I discovered that information in a column in the xls was missing after
>> a certain point.
>
> Discovered how: opening it with (what version of) Excel or
> (Open|Libre)Office Calc or Gnumeric or xlrd?
Excel and LibreOffice. Actually, Excel reports that it finds
unreadable content, and fixing it removes lots of data. LibreOffice
apparently handles this silently. LibreOffice shows all the chinese
symbols (which does contain proper text), while the field is just
blank in Excel after it has "corrected" the data.
>> I decided that this could be caused either by
>> problems reading the sqlite file, or from writing the xls, so I
>> decided to write all the data to a CSV as well using pythons "csv"
>> module.
>>
>> After testing this, I see that all the expected data is in the csv,
>> but it's still missing from the xls.
>
> Using the Python csv module doesn't always give you 100% of your data:
This isn't the problem. I can also view the DB directly, and see data
there that was missing from the xls. All the data I've checked is
available in the csv, so it appears my sqlite reading is ok.
>> Looking more closely at the xls,
>> I see that one of the cells in the relevant column in the xls has lots
>> and lots of chinese characters, and way more data than this cell
>> should have. Before this cell, everything seems to be ok, but after
>> this cell, lots of data is missing, though some cells have data.
>>
>> Has nyone else experienced anything like this?
>
> It is possible for plain old ASCII data to look like CJK characters if it's
> encoded in UTF-16 and you insert or delete an odd number of bytes, or decode
> starting at an odd byte offset:
>
> | >>> import unicodedata as ucd
> | >>>
ucd.name('c\x00'.decode('utf_16_le'))
> | 'LATIN SMALL LETTER C'
> | >>>
ucd.name('\x00c'.decode('utf_16_le'))
> | 'CJK UNIFIED IDEOGRAPH-6300'
>
>> Unfortunately, my case
>> has quite a lot of data, and I'm not at liberty to share it.
>
> How much is "quite a lot"? What about an NDA? Please be aware that if the
> problem turns out to be in xlwt writing the shared string table, that can't
> be debugged without access to the whole file.
My sqlite file is about 100MB, and in the resulting xls (spread across
multiple sheets) there's about 230k rows. In some cases I have to
split data across multiple sheets because of the 65k limitation in
xlwt. I also have to split/truncate cells if there's more than 32k
characters.
>> But if
>> anyone has any pointers on what I can do to isolate the problem, which
>> could let me create a test case that I can share or fix, I'd be happy
>> to give that a stab.
>
>
> Assuming you are writing to the xlwt workbook like this:
>
> sheet.write(rowx, colx, value)
>
> and the bad column is badcolx and the big lump of "Chinese characters"
> appears in badrowx, precede your write line by:
>
> if -2 <= rowx - badrowx <= 5 and -1 <= colx - badcolx <= 1:
> print rowx, colx, type(value), len(value) if isinstance(value,
> basestring) else "?", repr(value)
Yes, that's how I write data to the workbook. I added your print to
everything and saved it all as a debug log. The first cell that fails,
and that shows all the chinese characters in LibreOffice, is a plain
string with 55 characters, matching [a-z0-9\s.?]. There are some other
strings in the data that aren't plain ascii, but they are all utf-8
(otherwise csv would barf). However, there were a couple of odd
unicode characters embedded, and one was two rows before the error,
and in the same column. They were U+1F366 and U+1F48B. I removed those
from a copy of the sqlite database and tried again, and now my
spreadsheet loads without issues! Thanks for helping me nail the
issue, but do you know if there is a way to fix this? I create my
workbook with:
xlwt.Workbook(encoding='utf8')
> Do you see what you expect? What language are you expecting?
English and norwegian. I usually see what I expect.
> Please consider sharing your code. Of particular interest will be what
> manipulations are done between getting the data from sqlite3 and writing it
> with xlwt. Do you do workbook = xlwt.Workbook(encoding="something") ? Do you
> decode or encode anywhere in your code?
I replace blobs from the DB with text. I also encode all text as
utf-8, because the csv module needs that. I didn't encode or decode
before, but had the same issue before.
>> I am using the xlwt that is bundled with my Ubuntu distribution
>> (0.7.2-2)
>
> What is the -2 on the end?
Ubuntu's numbering scheme, not sure why they tack that on there.
>> , but after checking out the lates subversion code and
>
> xlwt is on github now ...
Ah, that explains why I didn't find a tag for version 0.7.4! Thanks,
I'll fetch that.
Cheers, Ketil