MySQL encoding trouble

Stefan Meretz

unread,

Nov 11, 2006, 4:56:20 PM11/11/06

to turbo...@googlegroups.com

Hi,

googling through the web, I didn't find a solution for the following
encoding problem using TurboGears/SQLAlchemy:

Environment: see below

The test data model is fairly simple:

class Project(ActiveMapper):
"test model"
class mapping:
id = column(Integer, primary_key=True)
title = column(Unicode(255))
title_key = column(Unicode(30))

test data (edited via phpMyAdmin):
- title = ''The german ö is an o-umlaut"
- tite_key = 'test_key'

running from tg-admin shell:
>>> from model import Project
>>> proj = Project.select_by(title_key="testkey")
(...)
Traceback (most recent call last):
File "<console>", line 1, in ?
File "build/bdist.linux-i686/egg/sqlalchemy/ext/assignmapper.py",
line 7, in do
(...)
File "build/bdist.linux-i686/egg/sqlalchemy/engine/base.py",
line 632, in _get_col
File "build/bdist.linux-i686/egg/sqlalchemy/types.py",
line 190, in convert_result_value
File "encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode bytes
in position 11-14: invalid data

Ok, why is this? Why a "decode", everything should be utf8?

Let's do a simple access to MySQL on raw console:

>>> import MySQLdb
>>> con = MySQLdb.connect(host="127.0.0.1", port=...etc.)
>>> cur = con.cursor()
>>> sql = "select title from project where title_key = 'testkey'";
>>> cur.execute(sql);
1L
>>> t = cur.fetchall()[0][0]
>>> con.close()
>>> t
'The german \xf6 is an o-umlaut'
>>> t.decode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "encodings/utf_8.py", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode bytes
in position 11-14: invalid data

The decode failed, because the string is not a "utf8-string" (although
seemingly utf8 encoded):

>>> isinstance(t, unicode)
False

It should run this way:

>>> s = u'The german \xf6 is an o-umlaut'
>>> isinstance(s,unicode)
True
>>> print s
The german ö is an o-umlaut

But what is t if not an utf8-string like s?

>>> s.encode("latin1")
'The german \xf6 is an o-umlaut'
>>> s.encode("latin1") == t
True

It is a latin-1 encoded "utf8-string". What process encodes the already
correct unicode-string wrongly using latin-1?

Is this a MySQLdb-failure?

Do I miss something?

My environment:
Linux Ubuntu 6.10
TurboGears 1.0b1
SQLAlchemy 0.3.0 (release) using ActiveMapper
MySQL 5.0.24a-Debian
- started by root: /usr/bin/mysqld_safe --character-set-server=utf8 &

Any hints are welcome:-)

TIA,
Stefan

--
Start here: www.meretz.de

Robin Haswell

unread,

Nov 14, 2006, 7:59:24 AM11/14/06

to turbo...@googlegroups.com

Stefan Meretz wrote:
> test data (edited via phpMyAdmin):

If this means "I put in my test data with phpMyAdmin", then I'd guess
there's your trouble. PHP doesn't support Unicode, MySQL doesn't support
Unicode, and in my experience PMA's unicode-awareness is patchy. I'd say
PMA has caused your unicode problem.

-Rob

Stefan Meretz

unread,

Nov 14, 2006, 9:39:02 AM11/14/06

to turbo...@googlegroups.com

Tnx, Rob, for this idea.

Both, PMA and MySQL, support Unicode.

Today I solved it using the 0.3.1 update from SQLAlchemy, where Mike
Bayer made is possible to pass over some additional custom DBAPI
connect arguments:
http://www.sqlalchemy.org/docs/dbengine.myt#dbengine_establishing_custom

I found out, that MySQLdb need use_unicode=True during database
connection.

So I simply put into my dev.cnf
sqlalchemy.dburi="mysql://user:password@host/database?use_unicode=1"
and that's it.

Cool!

Ciao,

Robin Haswell

unread,

Nov 14, 2006, 4:23:58 PM11/14/06

to turbo...@googlegroups.com

Stefan Meretz wrote:
> Both, PMA and MySQL, support Unicode.

Good to hear you fixed your problem, although I'd be wary of how much
Unicode support MySQL has.

Example:

Create a table with a column defined as varchar(4), with one of the
Unicode collations.

Insert 'öööö' into the column, then select it out again.

You'll find your select result only contains 'öö'. This is because MySQL
varchar(4) means 4 *bytes*, not characters (and ö is two bytes). It's
not multibyte character aware.

This can cause lots of exceptions in Python when multibyte characters
are truncated at the size limit of a column. When Python sees half a
multibyte character, an exception is raised.

-Rob

Diez B. Roggisch

unread,

Nov 14, 2006, 6:32:45 PM11/14/06

to turbo...@googlegroups.com

Robin Haswell schrieb:

> Stefan Meretz wrote:
>> Both, PMA and MySQL, support Unicode.
>
> Good to hear you fixed your problem, although I'd be wary of how much
> Unicode support MySQL has.
>
> Example:
>
> Create a table with a column defined as varchar(4), with one of the
> Unicode collations.
>
> Insert 'öööö' into the column, then select it out again.
>
> You'll find your select result only contains 'öö'. This is because MySQL
> varchar(4) means 4 *bytes*, not characters (and ö is two bytes). It's
> not multibyte character aware.

This is standard behaviour even with oracle. As the dimensions of a text
column have to have an upper limit for efficiency reasons, you can
either use a fixed-width encoding like ucs2 or even ucs4, wasting
awfully lots of bytes, or you have to cap the size.

> This can cause lots of exceptions in Python when multibyte characters
> are truncated at the size limit of a column. When Python sees half a
> multibyte character, an exception is raised.

Not necessarily, you can alter the decoding so that it silently discards
unknown bytes. I'm not sure how that is working with a DB-adapter though.

Diez

Jorge Godoy

unread,

Nov 14, 2006, 6:50:00 PM11/14/06

to turbo...@googlegroups.com

"Diez B. Roggisch" <diez.r...@artnology.com> writes:

> This is standard behaviour even with oracle. As the dimensions of a text
> column have to have an upper limit for efficiency reasons, you can either
> use a fixed-width encoding like ucs2 or even ucs4, wasting awfully lots of
> bytes, or you have to cap the size.

The size of the columns should be determined by the database encoding. If I
create an UTF-8 database then char(1) should be able to store *any* char that
can be represented in UTF-8.

--
Jorge Godoy <jgo...@gmail.com>

Steve Holden

unread,

Nov 15, 2006, 2:17:59 AM11/15/06

to turbo...@googlegroups.com

... and if wishes were horses then beggars would ride :)

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

Jorge Godoy

unread,

Nov 15, 2006, 5:47:31 AM11/15/06

to turbo...@googlegroups.com

Steve Holden <st...@holdenweb.com> writes:

> ... and if wishes were horses then beggars would ride :)

This is a problem with MySQL. Other databases work just fine, Steve. There's
now wishing. This is a *bug* with MySQL. Look at how PostgreSQL behaves
correctly (I'd say, then, that their beggars are more productive because the
*DO* have horses...):

teste=# SHOW server_encoding ;
server_encoding
-----------------
UTF8
(1 registro)

teste=# show client_encoding ;

client_encoding
-----------------
UTF8
(1 registro)

teste=# create table char_test (test_col char(1));
CREATE TABLE
teste=# insert into char_test values ('ç');
INSERT 0 1
teste=# insert into char_test values ('ã');
INSERT 0 1
teste=# insert into char_test values ('²');
INSERT 0 1
teste=# insert into char_test values ('¤');
INSERT 0 1
teste=# insert into char_test values ('¼');
INSERT 0 1
teste=# insert into char_test values ('¾');
INSERT 0 1
teste=# insert into char_test values ('ð');
INSERT 0 1
teste=# insert into char_test values ('®');
INSERT 0 1
teste=# select * from char_test;
test_col
----------
ç
ã
²
¤
¼
¾
ð
®
(8 registros)

Try the same with MySQL and recover these chars with Python.

teste=# alter table char_test add column id serial not null;
NOTICE: ALTER TABLE will create implicit sequence "char_test_id_seq" for serial column "char_test.id"
ALTER TABLE
teste=# select * from char_test;
test_col | id
----------+----
ç | 1
ã | 2
² | 3
¤ | 4
¼ | 5
¾ | 6
ð | 7
® | 8
µ | 9
(9 registros)

>>> from sqlobject import *
>>> uri = 'postgres://localhost/teste'
>>> sqlhub.processConnection = connectionForURI(uri)
>>> class CharTest(SQLObject):
... testCol = UnicodeCol()
...
>>> list(CharTest.select())
[<CharTest 1 testCol=u'\xe7'>, <CharTest 2 testCol=u'\xe3'>, <CharTest 3 testCol=u'\xb2'>, <CharTest 4 testCol=u'\xa4'>, <CharTest 5 testCol=u'\xbc'>, <CharTest 6 testCol=u'\xbe'>, <CharTest 7 testCol=u'\xf0'>, <CharTest 8 testCol=u'\xae'>, <CharTest 9 testCol=u'\xb5'>]
>>> ct = CharTest.select()
>>> for ct_instance in ct:
... print ct_instance
...
<CharTest 1 testCol=u'\xe7'>
<CharTest 2 testCol=u'\xe3'>
<CharTest 3 testCol=u'\xb2'>
<CharTest 4 testCol=u'\xa4'>
<CharTest 5 testCol=u'\xbc'>
<CharTest 6 testCol=u'\xbe'>
<CharTest 7 testCol=u'\xf0'>
<CharTest 8 testCol=u'\xae'>
<CharTest 9 testCol=u'\xb5'>
>>> for ct_instance in ct:
... print ct_instance.testCol
...
ç
ã
²
¤
¼
¾
ð
®
µ
>>>

It's all there, all UTF-8. And the same holds true when I specify the size of
the column in Python as well:

>>> from sqlobject import *
>>> uri = 'postgres://localhost/teste'
>>> sqlhub.processConnection = connectionForURI(uri)
>>> class CharTest2(SQLObject):
... class sqlmeta:
... table = 'char_test'
... testCol = UnicodeCol(length=1)
...
>>> list(CharTest2.select())
[<CharTest2 1 testCol=u'\xe7'>, <CharTest2 2 testCol=u'\xe3'>, <CharTest2 3 testCol=u'\xb2'>, <CharTest2 4 testCol=u'\xa4'>, <CharTest2 5 testCol=u'\xbc'>, <CharTest2 6 testCol=u'\xbe'>, <CharTest2 7 testCol=u'\xf0'>, <CharTest2 8 testCol=u'\xae'>, <CharTest2 9 testCol=u'\xb5'>]
>>> for ct_instance in CharTest2.select():
... print ct_instance.testCol
...
ç
ã
²
¤
¼
¾
ð
®
µ
>>>

Maybe it's time for MySQL to learn horse riding...

--
Jorge Godoy <jgo...@gmail.com>

Stefan Meretz

unread,

Nov 15, 2006, 5:55:22 AM11/15/06

to turbo...@googlegroups.com

On 2006-11-14 22:23, Robin Haswell wrote:
> Create a table with a column defined as varchar(4), with one of the
> Unicode collations.
>
> Insert 'öööö' into the column, then select it out again.
>
> You'll find your select result only contains 'öö'. This is because
> MySQL varchar(4) means 4 *bytes*, not characters (and ö is two
> bytes). It's not multibyte character aware.

Hm, I can not confirm that. I did some tests, and MySQL always delivered
the complete data e.g. 'öööö'

I tested it on two machines, and got two slightly different results
(double byte vs. single byte representation?) after reading from
database:

1st machine:
>>> p.test
u'\xc3\xb6\xc3\xb6\xc3\xb6\xc3\xb6'
>>> print p.test
öööö

2nd machine:
>>> p.test
u'\xf6\xf6\xf6\xf6'
>>> print p.test
öööö

Any explanation for that difference?

> This can cause lots of exceptions in Python when multibyte characters
> are truncated at the size limit of a column. When Python sees half a
> multibyte character, an exception is raised.

Maybe dependent from the MySQL version?

But anyway, good to know, that such behavior may happen.

Jorge Godoy

unread,

Nov 15, 2006, 6:38:18 AM11/15/06

to turbo...@googlegroups.com

Stefan Meretz <stefan...@hbv.org> writes:

> I tested it on two machines, and got two slightly different results
> (double byte vs. single byte representation?) after reading from
> database:

(...)

> Any explanation for that difference?

Can you show us the DDL used on the table and the declaration on SQL Object /
SQL Alchemy? If we can reproduce your tests or see how you've done it we can
give better answers.

> Maybe dependent from the MySQL version?

I believe it was more a problem of mysqldb and older versions of MySQL,
yes, as you can see in this list archives. The problems didn't occur for
other supported database servers.

Be seeing you,
--
Jorge Godoy <jgo...@gmail.com>

Stefan Meretz

unread,

Nov 15, 2006, 8:34:18 AM11/15/06

to turbo...@googlegroups.com

On 2006-11-15 12:38, Jorge Godoy wrote:
> Can you show us the DDL used on the table and the declaration on SQL
> Object / SQL Alchemy? If we can reproduce your tests or see how
> you've done it we can give better answers.

model.py:
class Testtable(ActiveMapper):
class mapping:
pkey = column(Unicode(10), primary_key=True)
test = column(Unicode(4))

dev.cfg:
sqlalchemy.dburi="mysql://test:test@localhost/testdb?use_unicode=1"

MySQL started using (make sure, that all is utf8):
mysqld_safe --character-set-server=utf8

Entered the data due to availability via PMA (I am a TG/SA-beginner).

1st machine: Select from tg-admin:
>>> from model import *
>>> t = Testtable.selectone()
>>> t.test

u'\xf6\xf6\xf6\xf6'

>>> print t.test
öööö

2nd machine: Select from tg-admin (don't know, why looks different):
In [1]: from model import *
In [2]: t = Testtable.selectone()
In [3]: t.test
Out[3]: u'\xc3\xb6\xc3\xb6\xc3\xb6\xc3\xb6'
In [4]: print t.test
öööö

"tg-admin info" gives on both machines:
TurboGears Complete Version Information
TurboGears requires:
* TurboGears 1.0b1
* nose 0.9.1
* configobj 4.3.2
* RuleDispatch 0.5a0.dev-r2115
* setuptools 0.6c3
* FormEncode 0.6
* cElementTree 1.0.5-20051216
* PasteScript 1.0
* elementtree 1.2.6-20050316
* simplejson 1.4
* SQLObject 0.7.2b1
* CherryPy 2.2.1
* TurboKid 0.9.9
* TurboCheetah 0.9.5
* TurboJson 0.9.9
* PyProtocols 1.0a0
* Cheetah 2.0rc7
* PasteDeploy 1.0
* Paste 1.0
* FormEncode 0.6
* kid 0.9.3
* Cheetah 2.0rc7
* elementtree 1.2.6-20050316
Identity Providers
* sqlobject (TurboGears 1.0b1)
* sqlalchemy (TurboGears 1.0b1)
tg-admin Commands
* info (TurboGears 1.0b1)
* shell (TurboGears 1.0b1)
* quickstart (TurboGears 1.0b1)
* update (TurboGears 1.0b1)
* sql (TurboGears 1.0b1)
* i18n (TurboGears 1.0b1)
* toolbox (TurboGears 1.0b1)
Visit Managers
* sqlobject (TurboGears 1.0b1)
* sqlalchemy (TurboGears 1.0b1)
Template Engines
* kid (TurboKid 0.9.9)
* cheetah (TurboCheetah 0.9.5)
* json (TurboJson 0.9.9)
Widget Packages
TurboGears Extensions
* visit (TurboGears 1.0b1)
* identity (TurboGears 1.0b1)

Python 1st:
Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02)
[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2

Python 2nd:
Python 2.4.4c0 (#2, Jul 30 2006, 18:20:12)
[GCC 4.1.2 20060715 (prerelease) (Debian 4.1.1-9)] on linux2

MySQLdb 1st:
MySQLdb.__version__
1.2.1_p2

MySQLdb 2nd:
MySQLdb.__version__
1.2.1g2

> > Maybe dependent from the MySQL version?
>
> I believe it was more a problem of mysqldb and older versions of
> MySQL, yes, as you can see in this list archives. The problems
> didn't occur for other supported database servers.

Yes, there was also a thread in SQLAlchemy mailinglist, but I am not
sure, whether this is a problem of "double encoding" mentioned here:
http://groups.google.com/group/sqlalchemy/browse_thread/thread/333c7162cc41400e
And there is a corresponding entry at MySQLdb tracker:
http://sourceforge.net/tracker/index.php?func=detail&aid=1592353&group_id=22307&atid=374932
HTH.

Thanks,

Steve Holden

unread,

Nov 15, 2006, 8:53:33 AM11/15/06

to turbo...@googlegroups.com

Jorge Godoy wrote:
> Steve Holden <st...@holdenweb.com> writes:
>
>> ... and if wishes were horses then beggars would ride :)
>
> This is a problem with MySQL. Other databases work just fine, Steve. There's
> now wishing. This is a *bug* with MySQL. Look at how PostgreSQL behaves
> correctly (I'd say, then, that their beggars are more productive because the
> *DO* have horses...):

I'm not suggesting this isn't an error. Simply that it's an error that
is also implemented in (among others) Oracle and MS Access, IIRC, and
that the vendors clearly aren't that concerned about it.

Very nice demonstration of the problem, by the way.

Jorge Godoy

unread,

Nov 15, 2006, 12:34:27 PM11/15/06

to turbo...@googlegroups.com

Steve Holden <st...@holdenweb.com> writes:

> I'm not suggesting this isn't an error. Simply that it's an error that
> is also implemented in (among others) Oracle and MS Access, IIRC, and
> that the vendors clearly aren't that concerned about it.

We have a saying here in Brazil that translates to something like "two wrongs
doesn't make one right". I don't think that just because other vendors don't
get it right MySQL shouldn't get it too... MS Access is nothing I would care
for commercial systems (even though there's a lot of semi-standalone systems
that use it here in Brazil, unfortunately). Oracle has so many configuration
options that I wouldn't be surprised if there was some
w.x.y.z.something.enconding that activated the correct behavior and that came
with an ISO-8859-1 default so that it preserver backwards compatibility with
Oracle 2... :-P :-)

> Very nice demonstration of the problem, by the way.

Oh well... I tried giving you the minimum work to reproduce the behavior on
MySQL. It is important that somebody shows that I am wrong or that there are
other parameters needed on the connection string, database setup, etc. so that
other people can fix their code.

--
Jorge Godoy <jgo...@gmail.com>

Stefan Meretz

unread,

Nov 19, 2006, 6:41:10 AM11/19/06

to turbo...@googlegroups.com

Hi all,

concerning this thread:
http://groups.google.com/group/turbogears/browse_thread/thread/151483c63981a239
I could verify the cause of the encoding problem I observed.

On 2006-11-15 14:34, Stefan Meretz wrote:
> MySQLdb 1st:
> MySQLdb.__version__
> 1.2.1_p2
>
> MySQLdb 2nd:
> MySQLdb.__version__
> 1.2.1g2

You have to be aware of MySQLdb version, the first (1.2.1_p2) works, the
second doesn't. That's all:-)