Since this is presumably a once-off conversion operation, I'd make sure
that the data coming from the source database was converted into Python
unicode objects before passing it to Django. That way, any errors will
be caught.
Now you might well be able to have this happen automatically using the
"unicode" option to MySQLdb -- it knows how to convert between various
server-side encodings and Python unicode. So look at that parameter to
the connect() call. It's fairly well done in MySQLdb (it and PostgreSQL
were almost trivial to make work when we added Unicode support to
Django).
Alternatively, if you're getting bytestrings backs, run them through a
decode() call:
data = original_data.decode('cp1252')
Since cp1252 is a superset of latin1 (iso-8859-1), you can specify the
same encoding for both. Valid latin1 won't contain any data in the extra
codepoints used by cp1252. Once your data is in Unicode, passing it to
Django's ORM will Just Work(tm). However, I'd definitely call this Plan
B and see if passing the unicode=True option to MySQLdb.connect() works,
since that might just be a one-line solution.
> To further complicate things, once the data is in the new UTF8
> database, I have a script that exports to a CSV file for a client to
> use a subset of the data. And right now this is all sorts of fail for
> me. I tried using the Django snippet here: http://www.djangosnippets.org/snippets/993/
> but am essentially getting what the first commenter says unless I
> import as Windows 1252 — then the boxes turn into quotes and
> apostrophes that look right.
I can't help there. It sounds like Excel on Windows is ignoring the fact
that the data is UTF-16 and treating it as cp1252, which is, of course,
totally broken. I don't completely understand that fragment, but the
bits that are confusing to me (it writes to both a writer and a stream,
for example) are probably because I haven't opened up the csv writer
class to see what should be subclassed.
Just for laughs, though, try running "file" on the csv file you generate
and make sure it, at least, detects that it is a UTF-16 file.
> Character encodings are a big confusion for me.
Working with things through Django should be relatively straightforward.
Django will give you Unicode strings (type "unicode"). You call the
encode() method to convert it to whichever encoding you like (unicode
objects on their own can't be written out -- you need to pick an
encoding). If your target requires UTF-16, you need to start off with a
byte order mark (BOM) to indicate whether the two-byte output of each
character is in little-endian or big-endian order. If you
call .encode('utf-16'), Python writes out the BOM for you (the '\xff
\xfe' sequence at the start).
Regards,
Malcolm
[snip debug info]
Now instead of \x95 I get \u2022 (which is a bullet).
From here I'm not sure what the best way to proceed is... do I want
the \u2022 version instead, in which case, should I not pass in
unicode=True and manually decode each column?
latin1 is the same as the
Windows cp1252 character set. This
means it is the same as the official ISO
8859-1 or IANA (Internet Assigned Numbers
Authority) latin1, except that IANA
latin1 treats the code points between
0x80 and 0x9f as
"undefined," whereas
cp1252, and therefore MySQL's
latin1, assign characters for those
positions. For example, 0x80 is the
Euro sign.
I'm partly thinking that since this is a one-time operation (actually,
it's a many one-time operation until we're ready to switch over to the
new site), I could scan for any "\x" characters and manually replace
them. There are likely only a handful as in the above. But how does
one scan and replace these so the output is correct?
You can see how this works better than the latin1 code in a Python shell:
>>> x = 'Bullet ->\x95<- and curly apostrophe ->\x92<- in a cp1252 bytestring'
>>> ulatin1 = x.decode('latin1')
>>> ulatin1
u'Bullet ->\x95<- and curly apostrophe ->\x92<- in a cp1252 bytestring'
>>> print ulatin1
Bullet ->•<- and curly apostrophe ->'<- in a cp1252 bytestring