Character encoding... latin1 to utf8?

1,126 views
Skip to first unread message

Rob Hudson

unread,
Dec 4, 2008, 7:34:07 PM12/4/08
to Django users
I'm migrating a site to Django. The old site was PHP/MySQL with MySQL
having a default encoding of latin1. It seems like there are also
Windows 1252 encodings but I'm not sure.

I have the old database and the new Django UTF8 one side by side and
have a migration script that uses raw MySQLdb to connect to the old,
and Django's ORM to connect to the new. Is there anything I can do to
ensure the data going into the new is UTF8?

To further complicate things, once the data is in the new UTF8
database, I have a script that exports to a CSV file for a client to
use a subset of the data. And right now this is all sorts of fail for
me. I tried using the Django snippet here: http://www.djangosnippets.org/snippets/993/
but am essentially getting what the first commenter says unless I
import as Windows 1252 — then the boxes turn into quotes and
apostrophes that look right.

Character encodings are a big confusion for me.

Thanks for any help,
Rob

Malcolm Tredinnick

unread,
Dec 4, 2008, 9:12:41 PM12/4/08
to django...@googlegroups.com

On Thu, 2008-12-04 at 16:34 -0800, Rob Hudson wrote:
> I'm migrating a site to Django. The old site was PHP/MySQL with MySQL
> having a default encoding of latin1. It seems like there are also
> Windows 1252 encodings but I'm not sure.
>
> I have the old database and the new Django UTF8 one side by side and
> have a migration script that uses raw MySQLdb to connect to the old,
> and Django's ORM to connect to the new. Is there anything I can do to
> ensure the data going into the new is UTF8?

Since this is presumably a once-off conversion operation, I'd make sure
that the data coming from the source database was converted into Python
unicode objects before passing it to Django. That way, any errors will
be caught.

Now you might well be able to have this happen automatically using the
"unicode" option to MySQLdb -- it knows how to convert between various
server-side encodings and Python unicode. So look at that parameter to
the connect() call. It's fairly well done in MySQLdb (it and PostgreSQL
were almost trivial to make work when we added Unicode support to
Django).

Alternatively, if you're getting bytestrings backs, run them through a
decode() call:

data = original_data.decode('cp1252')

Since cp1252 is a superset of latin1 (iso-8859-1), you can specify the
same encoding for both. Valid latin1 won't contain any data in the extra
codepoints used by cp1252. Once your data is in Unicode, passing it to
Django's ORM will Just Work(tm). However, I'd definitely call this Plan
B and see if passing the unicode=True option to MySQLdb.connect() works,
since that might just be a one-line solution.

> To further complicate things, once the data is in the new UTF8
> database, I have a script that exports to a CSV file for a client to
> use a subset of the data. And right now this is all sorts of fail for
> me. I tried using the Django snippet here: http://www.djangosnippets.org/snippets/993/
> but am essentially getting what the first commenter says unless I
> import as Windows 1252 — then the boxes turn into quotes and
> apostrophes that look right.

I can't help there. It sounds like Excel on Windows is ignoring the fact
that the data is UTF-16 and treating it as cp1252, which is, of course,
totally broken. I don't completely understand that fragment, but the
bits that are confusing to me (it writes to both a writer and a stream,
for example) are probably because I haven't opened up the csv writer
class to see what should be subclassed.

Just for laughs, though, try running "file" on the csv file you generate
and make sure it, at least, detects that it is a UTF-16 file.

> Character encodings are a big confusion for me.

Working with things through Django should be relatively straightforward.
Django will give you Unicode strings (type "unicode"). You call the
encode() method to convert it to whichever encoding you like (unicode
objects on their own can't be written out -- you need to pick an
encoding). If your target requires UTF-16, you need to start off with a
byte order mark (BOM) to indicate whether the two-byte output of each
character is in little-endian or big-endian order. If you
call .encode('utf-16'), Python writes out the BOM for you (the '\xff
\xfe' sequence at the start).


Regards,
Malcolm


Rob Hudson

unread,
Dec 6, 2008, 11:10:57 AM12/6/08
to Django users
Thanks Malcolm,

On Dec 4, 6:12 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:
> Now you might well be able to have this happen automatically using the
> "unicode" option to MySQLdb -- it knows how to convert between various
> server-side encodings and Python unicode. So look at that parameter to
> the connect() call. It's fairly well done in MySQLdb (it and PostgreSQL
> were almost trivial to make work when we added Unicode support to
> Django).

I actually had that set up already. I'm trying to look at it a little
more closely. Here's a dpaste of a SQL call and a few columns. Look
at the "fdescr" column output... it's showing the string is unicode
but it has some characters in it like \x95 and \x92.
http://dpaste.com/96601/

> Alternatively, if you're getting bytestrings backs, run them through a
> decode() call:
>
>         data = original_data.decode('cp1252')

I tried this at the bottom of the above dpaste just to see... I know
I'm not getting bytestrings back. So I tried it also without the
unicode=True flag to connect and it produces different output than
above:

>>> row['fdescr'].decode('cp1252')
u'Lefty Kreh is one of the most experienced, well-prepared, and
thoughtful anglers in the world. In <i>101 Fly-Fishing Tips</i>, he
shares this wealth of experience with a variety of common-sense
solutions to the problems that anglers face. Included are tips on:<br /
> \u2022how to pacify a fish<br /> \u2022which hook-sharpening tools
to use and when<br /> \u2022how to take a rod apart when it\u2019s
stuck<br /> \u2022what to do when a fish runs under your boat<br />
\u2022how to dry waders and find leaks<br /> \u2022why long hat brims
affect casting accuracy<br /> \u2022and much more<br /><br />Sure to
improve a fly fisher\u2019s success, comfort, and enjoyment while on
the water. A must for any angler.<br /><br /><b>ABOUT THE AUTHOR</
b><br />Lefty Kreh is an internationally known and respected master in
the field of fly fishing, and the author of numerous articles and
books on the subject. He lives in Maryland.'

Now instead of \x95 I get \u2022 (which is a bullet).

From here I'm not sure what the best way to proceed is... do I want
the \u2022 version instead, in which case, should I not pass in
unicode=True and manually decode each column?

I'm partly thinking that since this is a one-time operation (actually,
it's a many one-time operation until we're ready to switch over to the
new site), I could scan for any "\x" characters and manually replace
them. There are likely only a handful as in the above. But how does
one scan and replace these so the output is correct?

> Just for laughs, though, try running "file" on the csv file you generate
> and make sure it, at least, detects that it is a UTF-16 file.

It actually tells me nothing...
> file export.csv
export.csv:

Thanks,
Rob

Karen Tracey

unread,
Dec 6, 2008, 1:36:46 PM12/6/08
to django...@googlegroups.com
On Sat, Dec 6, 2008 at 11:10 AM, Rob Hudson <trebor...@gmail.com> wrote:
[snip debug info]


Now instead of \x95 I get \u2022 (which is a bullet).

From here I'm not sure what the best way to proceed is... do I want
the \u2022 version instead, in which case, should I not pass in
unicode=True and manually decode each column?

What you've got in your DB is actually cp1252 (although MySQL calls it latin1) data.  The values assigned to 0x95 and 0x92 in cp1252 are bullet and curly apostrophe.  What you want in your unicode strings are the \u2022 and \u2019 versions, since these are the correct code point assignments for bullet and curly apostrophe in unicode.  (Unicode \x95 is the message waiting control character and \x92 is 'private use two' control character).

In case you care, why it is not working when you specify unicode=True to MySQLdb, is, I believe, a combination two factors: first, the Python codec MySQLdb chooses to use to decode the data coming from MySQL, and second how that codec behaves in the face of technically invalid data.

First, MySQLdb apparently decides to use the latin1 Python codec to decode the data coming from MySQL into a unicode string.  On the face of it this seems like a reasonable choice, since after all MySQL reports that the data is 'latin1'.  However if you read the MySQL docs what they call 'latin1' is really 'cp1252': (http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html):

MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as "undefined," whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign.

So, MySQL allows bytes in 'latin1' strings that are technically 'not assigned', and assumes they have their cp1252-assigned meanings.  MySQLdb uses the Python latin1 codec for data MySQL reports to be latin1 (though MySQLdb might have better chosen cp1252 here, I think, given MySQL clearly documents that they really mean cp1252 when they say latin1). The Python latin1 codec, however, does not assume 'unassigned' latin1 code points have their cp1252-assigned values.  Rather it assumes they have their unicode-assigned values, and passes them through unscathed and without error.  So your cp1252 bullets turn into unicode message waiting control characters because MySQL assumes the unassigned latin1 \x95 byte has its cp1252-assigned meaning while Python assumes it has its unicode-assigned meaning.

Given your data is really cp1252, you need to use the cp1252 codec to decode it.  You can see how this works better than the latin1 code in a Python shell:

>>> x = 'Bullet ->\x95<- and curly apostrophe ->\x92<- in a cp1252 bytestring'
>>> ulatin1 = x.decode('latin1')
>>> ulatin1
u'Bullet ->\x95<- and curly apostrophe ->\x92<- in a cp1252 bytestring'
>>> print ulatin1
Bullet ->•<- and curly apostrophe ->'<- in a cp1252 bytestring
>>> ucp1252 = x.decode('cp1252')
>>> ucp1252
u'Bullet ->\u2022<- and curly apostrophe ->\u2019<- in a cp1252 bytestring'
>>> print ucp1252
Bullet ->•<- and curly apostrophe ->'<- in a cp1252 bytestring
>>>


I'm partly thinking that since this is a one-time operation (actually,
it's a many one-time operation until we're ready to switch over to the
new site), I could scan for any "\x" characters and manually replace
them.  There are likely only a handful as in the above.  But how does
one scan and replace these so the output is correct?

You could also just convert the character set used on the MySQL side:

http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html

Presumably since MySQL knows it really means cp1252 for stuff it calls latin1, it would convert properly to utf-8 when you told it to.  You'd sidestep the issues you've hit with 'latin1' meaning different things to different pieces of software.

Karen

Karen Tracey

unread,
Dec 6, 2008, 1:51:23 PM12/6/08
to django...@googlegroups.com
On Sat, Dec 6, 2008 at 1:36 PM, Karen Tracey <kmtr...@gmail.com> wrote:

  You can see how this works better than the latin1 code in a Python shell:

>>> x = 'Bullet ->\x95<- and curly apostrophe ->\x92<- in a cp1252 bytestring'
>>> ulatin1 = x.decode('latin1')
>>> ulatin1
u'Bullet ->\x95<- and curly apostrophe ->\x92<- in a cp1252 bytestring'
>>> print ulatin1
Bullet ->•<- and curly apostrophe ->'<- in a cp1252 bytestring

Hah, this looks right in the posted version.  It looked wrong (some weird char not a bullet,and a blank for the apostrophe) in the python shell itself and in the version pasted into the composition window (where the two chars turned into boxes).  I haven't the foggiest idea how they came out looking correct when posted...

Karen

Rob Hudson

unread,
Dec 6, 2008, 2:59:25 PM12/6/08
to Django users
Wow, thanks so much Karen, for slicing and dicing the problem like
that.

On Dec 6, 10:36 am, "Karen Tracey" <kmtra...@gmail.com> wrote:
> You could also just convert the character set used on the MySQL side:
>
> http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html
>
> Presumably since MySQL knows it really means cp1252 for stuff it calls
> latin1, it would convert properly to utf-8 when you told it to.  You'd
> sidestep the issues you've hit with 'latin1' meaning different things to
> different pieces of software.

The problem is, there is no cp1252 character set in MySQL as far as I
can tell, since cp1252 == latin1 to mysql. And setting the keyword
argument to connect to "charset='cp1252'" threw a MySQL error.

My data migration script is working now, though, when I don't specify
'use_unicode=True' and manually run .decode('cp1252') on the columns I
need to.

Much thanks to both you and Malcolm for helping me get this cleared
up.

Thanks,
Rob

Karen Tracey

unread,
Dec 6, 2008, 4:33:41 PM12/6/08
to django...@googlegroups.com

What I meant was you could use the mysql program and commands like:

ALTER TABLE book CONVERT TO CHARACTER SET utf8;

to get MySQL to do the conversion itself.  Presumably it would convert what it calls latin1 to utf8 correctly. (I'd first do it on a dumped/reloaded to a test DB version before trying such a command on a production DB.)

But if you've got your migration script working, then there's no point with experimenting to see if you could get MySQL to do it correctly itself.

Karen

Reply all
Reply to author
Forward
0 new messages