convert_unicode=True results in double encoding

447 views
Skip to first unread message

Shannon -jj Behrens

unread,
Nov 3, 2006, 4:25:44 PM11/3/06
to sqlal...@googlegroups.com, Todd Agulnick, al...@nefud.org
I'm using convert_unicode=True. Everything is fine as long as I'm the
one reading and writing the data. However, if I look at what's
actually being stored in the database, it's like the data has been
encoded twiced. If I switch to use_unicode=True, which I believe is
MySQL specific, things work just fine and what's being stored in the
database looks correct.

I started looking through the SQLAlchemy code, and I came across this:

def convert_bind_param(self, value, dialect):
if not dialect.convert_unicode or value is None or not
isinstance(value, unicode):
return value
else:
return value.encode(dialect.encoding)
def convert_result_value(self, value, dialect):
if not dialect.convert_unicode or value is None or
isinstance(value, unicode):
return value
else:
return value.decode(dialect.encoding)

The logic looks backwards. It says, "If it's not a unicode object,
return it. Otherwise, encode it." Later, "If it is a unicode object,
return it. Otherwise decode it."

Am I correct that this is backwards? If so, this is going to be
*painful* to update all the databases out there!

Thanks,
-jj

--
http://jjinux.blogspot.com/

Shannon -jj Behrens

unread,
Nov 3, 2006, 9:21:32 PM11/3/06
to sqlal...@googlegroups.com, Todd Agulnick, al...@nefud.org

Ok, MySQLdb doesn't have a mailing list, so I can't ask there. Here
are some things I've learned:

Changing from convert_unicode=True to use_unicode=True doesn't do what
you'd expect. SQLAlchemy is passing keyword arguments all over the
place, and use_unicode actually gets ignored. <minor rant>I
personally think that you should be strict *somewhere* when you're
passing around keyword arguments. I've been bitten in this way too
many times. Unknown keyword arguments should result in
exceptions.</minor rant>

Anyway, I'm still a bit worried about that code above like I said.
However, here's what's even scarier. If I use the following code:

import MySQLdb


for use_unicode in (True, False):
connection = MySQLdb.connect(host="localhost", user="user",
passwd='dataase', db="users",
use_unicode=use_unicode)
cursor = connection.cursor()
cursor.execute("select firstName from users where username='test'")
row = cursor.fetchone()
print "use_unicode:%s %r" % (use_unicode, row)

I get

use_unicode:True (u'test \xc3\xa7',)
use_unicode:False ('test \xc3\xa7',)

Notice the result is the same, but one has a unicode object and the
other doesn't. Notice that it's \xc3\xa7 each time? It shouldn't be.
Consider:

>>> s = 'test \xc3\xa7'
>>> s.decode('utf-8')
u'test \xe7'

*It's creating a unicode object without actually doing any decoding!*

This is somewhere low level. Like I said, this is lower level than
SQLAlchemy, but I don't have anywhere else to turn.

SQLAlchemy: 0.2.8
MySQLdb: 1.36.2.4
mysql client and server: 5.0.22
Ubuntu: 6.0.6

Help!
-jj

--
http://jjinux.blogspot.com/

Andrija Zarić

unread,
Nov 4, 2006, 9:07:54 AM11/4/06
to sqlal...@googlegroups.com, Shannon -jj Behrens
Logic is just fine... I guess you forgot to consider first param "not
dialect.convert_unicode".

On the sidenote, the doc string
(http://www.sqlalchemy.org/docs/docstrings.myt#docstrings_sqlalchemy.engine)
says that:

convert_unicode=False : True if unicode conversion should be applied
to all str types

which is little bit misleading -- it will be applied on String types
(or similar), but not on _all_ str. E.g. this will fail on both assert
tests:

db.engine.dialect.convert_unicode = True
rawdata = 'Alors vous imaginez ma surprise, au lever du jour, quand
une dr\xc3\xb4le de petit voix m\xe2\x80\x99a r\xc3\xa9veill\xc3\xa9.
Elle disait: \xc2\xab S\xe2\x80\x99il vous pla\xc3\xaet\xe2\x80\xa6
dessine-moi un mouton! \xc2\xbb\n'
unicodedata = rawdata.decode('utf-8')
unicode_table.insert().execute(unicode_data=unicodedata, plain_data=rawdata)
conn = db.engine.connect()
x = conn.execute("select * from unicode_table").fetchone()
self.assert_(isinstance(x['unicode_data'], unicode) and
x['unicode_data'] == unicodedata)
self.assert_(isinstance(x['plain_data'], unicode) and x['plain_data']
== unicodedata)

so if you are using engine/connection directly, you must "manually" do
the conversion if you need one.

For the MySQL problem, consider this:

>>> s = u'test \xc3\xa7'
>>> s.decode('utf-8')
u'test \xc3\xa7'

so the question is, how did you put the data into database?

Michael Bayer

unread,
Nov 4, 2006, 11:58:14 AM11/4/06
to sqlalchemy

Shannon -jj Behrens wrote:
> I'm using convert_unicode=True. Everything is fine as long as I'm the
> one reading and writing the data. However, if I look at what's
> actually being stored in the database, it's like the data has been
> encoded twiced. If I switch to use_unicode=True, which I believe is
> MySQL specific, things work just fine and what's being stored in the
> database looks correct.

yes, if mysql client lib is encoding, and SA is also encoding, the data
will get encoded twice. im not familiar with how i could look at the
encoded data to tell if it was already encoded (and not sure if i
should be...the unicode encoding option should only be enabled in one
place, not two)

>
> I started looking through the SQLAlchemy code, and I came across this:
>
> def convert_bind_param(self, value, dialect):
> if not dialect.convert_unicode or value is None or not
> isinstance(value, unicode):
> return value
> else:
> return value.encode(dialect.encoding)
> def convert_result_value(self, value, dialect):
> if not dialect.convert_unicode or value is None or
> isinstance(value, unicode):
> return value
> else:
> return value.decode(dialect.encoding)
>
> The logic looks backwards. It says, "If it's not a unicode object,
> return it. Otherwise, encode it." Later, "If it is a unicode object,
> return it. Otherwise decode it."

sending unicode values to databases whose client APIs dont handle
unicode involves taking a python unicode object from the application,
encoding it into an encoded series of bytes, and sending it to the
database. receieving a result value involves taking the encoded series
of bytes and decoding into a unicode object. so you have *non* unicode
instances going into the DB, and *non* unicode coming out - the DBAPI
is assumed to not have any idea what a python unicode object is (such
as pscopg's).

We've been doing the unicode thing for a while now, and you should
notice that we have unit tests for just about every function in SA,
especially important ones like this. the unicode unit test runs
unicode and raw encoded values in and out in numerous ways, which pass
for at least mysql,sqlite, postgres, oracle, and ms-sql. we have had
some people having issues with MySQL specifically, which seems to be
because some folks have a mysql config that is stuck in "convert
unicode" mode and experience the double-encoding issue. the one
improvement that could be made here is for MySQL to provide a
subclassed unicode type that disables conversion if the dialect is
known to have convert_unicode=True already....then again i sort of like
that this forces people to understand their database config.

Michael Bayer

unread,
Nov 4, 2006, 12:15:43 PM11/4/06
to sqlalchemy

Shannon -jj Behrens wrote:
> Changing from convert_unicode=True to use_unicode=True doesn't do what
> you'd expect. SQLAlchemy is passing keyword arguments all over the
> place, and use_unicode actually gets ignored. <minor rant>I
> personally think that you should be strict *somewhere* when you're
> passing around keyword arguments. I've been bitten in this way too
> many times. Unknown keyword arguments should result in
> exceptions.</minor rant>

uhhhhhh.....where is use_unicode=True documented as a create_engine
keyword ? if used as DBAPI-specific keywords are documented (in the
FAQ, in the docs) it works just fine. We do try be strict about kwargs
as much as possible but this is a case where its difficult since
create_engine() is passing along kwargs to any number of dialect
implementations. ill add a ticket to put in more kwargs.pop() and
remove **kwargs from the base dialect class.

from sqlalchemy import *

import MySQLdb as mysql

class FakeMySQLDBAPI(object):
def __init__(self):
self.paramstyle = mysql.paramstyle
def connect(self, *args, **kwargs):
print "Args:" + repr(args) + "Kwargs:" + repr(kwargs)
return mysql.connect(*args, **kwargs)

e = create_engine('mysql://scott:tiger@localhost/test',
connect_args={'use_unicode':True}, module=FakeMySQLDBAPI())
c = e.connect()

Args:()Kwargs:{'passwd': 'tiger', 'host': 'localhost', 'db': 'test',
'user': 'scott', 'use_unicode': True}

> for use_unicode in (True, False):
> connection = MySQLdb.connect(host="localhost", user="user",
> passwd='dataase', db="users",
> use_unicode=use_unicode)
> cursor = connection.cursor()
> cursor.execute("select firstName from users where username='test'")
> row = cursor.fetchone()
> print "use_unicode:%s %r" % (use_unicode, row)
>
> I get
>
> use_unicode:True (u'test \xc3\xa7',)
> use_unicode:False ('test \xc3\xa7',)
>
> Notice the result is the same, but one has a unicode object and the
> other doesn't. Notice that it's \xc3\xa7 each time? It shouldn't be.

yeah id say screw using use_unicode, make sure your MySQL isnt doing
any unicode conversion (check out all your my.cnf files etc), and use
SA's convert_unicode or Unicode type instead.

Michael Bayer

unread,
Nov 4, 2006, 1:13:05 PM11/4/06
to sqlalchemy
Shannon -

im trying to figure a way to make create_engine() behave the way you
want. but let me show you a highly simplified version of how
create_engine() works. how would you change it ?

def create_engine(url, **kwargs):
pool = pool.Pool(**kwargs)
dialect = module.Dialect(url, **kwargs)
engine = Engine(dialect, pool, **kwargs)
return engine

As you can see, the returned Engine is composed of itself, a connection
pool, and a Dialect. the **kwargs you pass to create_engine() can be
used by any three of those objects. Also, there are many different
types of Pool, Dialect, and Engine which take different arguments..so
placing explicit awareness of every argument within create_engine
itself is excessively brittle.

since the constructors of all the classes declare their keyword
arguments as normal inline arguments, we need to pass them around using
"**" and therefore the kwargs dictionary cannot be "consumed". if they
could be consumed, we could then check at the end that the kwargs dict
is in fact empty so we know that no unknown arguments were sent.

but as it is, the full dict gets passed to all three types of objects
where they have to just "ignore" the kwargs they dont know about. this
is not unlike WSGI where it passes around an "environ" dictionary, and
its fully likely that many erroneous keyword arguments could be present
in that dictionary as well.

the only way i can see to fix this is to modify the constructors of all
the objects to take a single dictionary argument from which they
"consume" their arguments, and we dont pass the arguments with the **
operator anymore. which means the nice constructors we have which
clearly lay out the arguments they take now will just have one
non-self-documenting "arguments" dictionary. is that the only option ?

Shannon -jj Behrens

unread,
Nov 7, 2006, 2:58:02 PM11/7/06
to sqlal...@googlegroups.com
On 11/4/06, Michael Bayer <zzz...@gmail.com> wrote:
> Shannon -jj Behrens wrote:
> > I'm using convert_unicode=True. Everything is fine as long as I'm the
> > one reading and writing the data. However, if I look at what's
> > actually being stored in the database, it's like the data has been
> > encoded twiced. If I switch to use_unicode=True, which I believe is
> > MySQL specific, things work just fine and what's being stored in the
> > database looks correct.
>
> yes, if mysql client lib is encoding, and SA is also encoding, the data
> will get encoded twice. im not familiar with how i could look at the
> encoded data to tell if it was already encoded (and not sure if i
> should be...the unicode encoding option should only be enabled in one
> place, not two)

If it's a unicode object, you should encode it. If it's a str object,
you should assume it's already encoded. If it's a str object, and
.encode gets called on it, if it's ASCII, nothing happens. If it's
anything else, you'll get an exception. All of this is fine. I just
don't understand why I'm getting something from MySQLdb (not from
SQLAlchemy) that says it's a unicode object, but wasn't actually
decoded :-/

> > I started looking through the SQLAlchemy code, and I came across this:
> >
> > def convert_bind_param(self, value, dialect):
> > if not dialect.convert_unicode or value is None or not
> > isinstance(value, unicode):
> > return value
> > else:
> > return value.encode(dialect.encoding)
> > def convert_result_value(self, value, dialect):
> > if not dialect.convert_unicode or value is None or
> > isinstance(value, unicode):
> > return value
> > else:
> > return value.decode(dialect.encoding)
> >
> > The logic looks backwards. It says, "If it's not a unicode object,
> > return it. Otherwise, encode it." Later, "If it is a unicode object,
> > return it. Otherwise decode it."
>
> sending unicode values to databases whose client APIs dont handle
> unicode involves taking a python unicode object from the application,
> encoding it into an encoded series of bytes, and sending it to the
> database. receieving a result value involves taking the encoded series
> of bytes and decoding into a unicode object. so you have *non* unicode
> instances going into the DB, and *non* unicode coming out - the DBAPI
> is assumed to not have any idea what a python unicode object is (such
> as pscopg's).

That sounds fine.

> We've been doing the unicode thing for a while now, and you should
> notice that we have unit tests for just about every function in SA,

Yeah, I definitely think this is not SA's fault.

> especially important ones like this. the unicode unit test runs
> unicode and raw encoded values in and out in numerous ways, which pass
> for at least mysql,sqlite, postgres, oracle, and ms-sql. we have had
> some people having issues with MySQL specifically, which seems to be
> because some folks have a mysql config that is stuck in "convert
> unicode" mode and experience the double-encoding issue.

I think you're onto something.

> the one
> improvement that could be made here is for MySQL to provide a
> subclassed unicode type that disables conversion if the dialect is
> known to have convert_unicode=True already....then again i sort of like
> that this forces people to understand their database config.

Thanks, Mike.

-jj

--
http://jjinux.blogspot.com/

Shannon -jj Behrens

unread,
Nov 7, 2006, 3:20:23 PM11/7/06
to sqlal...@googlegroups.com
On 11/4/06, Michael Bayer <zzz...@gmail.com> wrote:
>

The one bad thing about overloading the kwargs in this way is that
it's really easy for there to be a namespace collision. What happens
if suddenly the dialect and the engine both accepted a keyword
argument named foo? As a general rule, I think it's safe to wrap one
function and pass through the kwargs, but you're asking for trouble if
you wrap more than one. I don't think it's asking too much for the
user to have to do something like:

create_engine(url, Pool(arg="foo"), dialect=Dialect(otherarg="bar"))

However, you can't change the API now. Hence, how can we make it
"safer"? I was playing around in the shell. Here's a start:

>>> def f(a=None, b="foo"):
... pass
...
>>> f.func_code.co_varnames
('a', 'b')

Now, write a function:

def pass_and_strip_kwargs(f, kwargs):
"""Pass the correct keyword arguments to f and remove them from kwargs."""
...

Now you can do:

def create_engine(url, **kwargs):
pool = pass_and_strip_kwargs(pool.Pool, kwargs)
kwargs['url'] = url
dialect = pass_and_strip_kwargs(module.Dialect, kwargs)
kwargs['dialect'] = dialect
kwargs['pool'] = pool
engine = pass_and_strip_kwargs(Engine, kwargs)
return engine

Without too much work, I'm sure you could update it so that you can
also pass positional arguments which will make things cleaner. I'm
sure you get the idea.

By the way, don't let this one little nitpick take away from the fact
that I really like SQLAlchemy as a whole! ;)

Best Regards,
-jj

--
http://jjinux.blogspot.com/

Shannon -jj Behrens

unread,
Nov 7, 2006, 5:47:13 PM11/7/06
to sqlal...@googlegroups.com
The following results in correct data going into and coming out of the
database, but the data in the database itself looks double encoded:

import MySQLdb


connection = MySQLdb.connect(host="fmapp03", user="foxmarks",
passwd='ChunkyBacon', db="users")
cursor = connection.cursor()
cursor.execute("""
INSERT INTO users
VALUES (12345678, 'jjtest1234', 'f...@foo.com', 'pass', %s,
'asdf', 'N/A', 'N/A', 0, NOW(), NOW())
""", ('\xc3\xa7',))
cursor.execute("SELECT * FROM users WHERE id = 12345678")
row = cursor.fetchone()
print `row`
connection.commit()

The following results in correct data going into and out of the
database, but does not result in the data in the database itself being
double encoded:

import MySQLdb


connection = MySQLdb.connect(host="fmapp03", user="foxmarks",
passwd='ChunkyBacon', db="users",
charset='utf8')
cursor = connection.cursor()
cursor.execute("""
INSERT INTO users
VALUES (12345678, 'jjtest1234', 'f...@foo.com', 'pass', %s,
'asdf', 'N/A', 'N/A', 0, NOW(), NOW())
""", (u'\xe7',))
cursor.execute("SELECT * FROM users WHERE id = 12345678")
row = cursor.fetchone()
print `row`
connection.commit()

It looks like for the version of MySQLdb I'm using, 1.2.1p2, a lot of
this stuff has changed. If you don't let MySQLdb take care of encoding
and decoding, it ends up double encoding things in the database. This
must be a bug in MySQLdb. The clear way to work around the bug is to
let the driver take care of encoding and decoding instead of
SQLAlchemy.

Yuck,
-jj

Michael Bayer

unread,
Nov 7, 2006, 6:30:16 PM11/7/06
to sqlalchemy
yeah, or use introspection to consume the args, i thought of that too.
i guess we can do that. of course i dont like having to go there but i
guess not a big deal.

as far as kwargs collisions, yeah, thats a potential issue too. but
the number of dialects/pools is not *that* varied, theyre generally
pretty conservative with the kwargs. if we broke out create_engine()
to take in dialect, pool, etc., well id probably just create a new
function for that first off so create_engine() can just remain...im not
sure if i want to force users to be that exposed to the details as
people might get a little intimidated by all that.

Shannon -jj Behrens

unread,
Nov 7, 2006, 7:33:49 PM11/7/06
to sqlal...@googlegroups.com

BTW, it also turned out that I needed to run the version of SQLAlchemy
from trunk. Otherwise, this setup results in the following exception:

File "/usr/lib/python2.4/site-packages/PasteScript-1.0-py2.4.egg/paste/script/serve.py",
line 184, in command
relative_to=base)
File "/usr/lib/python2.4/site-packages/PasteScript-1.0-py2.4.egg/paste/script/serve.py",
line 204, in loadapp
return loadapp(app_spec, name=name, relative_to=relative_to)
File "/usr/lib/python2.4/site-packages/PasteDeploy-1.0-py2.4.egg/paste/deploy/loadwsgi.py",
line 193, in loadapp
return loadobj(APP, uri, name=name, **kw)
File "/usr/lib/python2.4/site-packages/PasteDeploy-1.0-py2.4.egg/paste/deploy/loadwsgi.py",
line 214, in loadobj
return context.create()
File "/usr/lib/python2.4/site-packages/PasteDeploy-1.0-py2.4.egg/paste/deploy/loadwsgi.py",
line 596, in create
return self.object_type.invoke(self)
File "/usr/lib/python2.4/site-packages/PasteDeploy-1.0-py2.4.egg/paste/deploy/loadwsgi.py",
line 99, in invoke
return fix_call(context.object, context.global_conf, **context.local_conf)
File "/usr/lib/python2.4/site-packages/PasteDeploy-1.0-py2.4.egg/paste/deploy/util/fixtypeerror.py",
line 57, in fix_call
val = callable(*args, **kw)
File "/usr/lib/python2.4/site-packages/acctmgr-0.0.0dev_r529-py2.4.egg/acctmgr/config/middleware.py",
line 36, in make_app
g=app_globals.Globals)
File "/usr/lib/python2.4/site-packages/Pylons-0.9.3-py2.4.egg/pylons/wsgiapp.py",
line 302, in __init__
g = g(config.global_conf, config.app_conf, config=config)
File "/usr/lib/python2.4/site-packages/acctmgr-0.0.0dev_r529-py2.4.egg/acctmgr/lib/app_globals.py",
line 34, in __init__
init_model(app_conf)
File "/usr/lib/python2.4/site-packages/acctmgr-0.0.0dev_r529-py2.4.egg/acctmgr/models/__init__.py",
line 39, in init_model
users_table = Table('users', metadata, autoload=True)
File "build/bdist.linux-i686/egg/sqlalchemy/schema.py", line 138, in __call__
File "build/bdist.linux-i686/egg/sqlalchemy/engine/base.py", line
491, in reflecttable
File "build/bdist.linux-i686/egg/sqlalchemy/databases/mysql.py", line 358, in
reflecttable
TypeError: __init__() keywords must be strings

Debugging produced:

args: [20]
kargs: {u'unsigned': True}

Anyway, it's fixed on trunk.

Shannon -jj Behrens

unread,
Nov 7, 2006, 7:40:33 PM11/7/06
to sqlal...@googlegroups.com

By the way, as if to prove my point, I had another problem. I was
trying to pass a create_args keyword argument per the example
<http://www.sqlalchemy.org/docs/dbengine.myt#dbengine_establishing_custom>.
SQLAlchemy didn't complain. My code didn't even complain when I used
a cedilla (ç). My code crashed when I used Japanese. It turns out
that MySQLdb was defaulting to Latin-1 or something like that.
SQLAlchemy was ignoring my create_args keyword argument. It turns out
that that example is wrong (should I file a bug?). The documentation
above it is correct; the correct keyword argument is named
connect_args.

Michael Bayer

unread,
Nov 7, 2006, 8:23:19 PM11/7/06
to sqlal...@googlegroups.com
ugh, well i fixed that one. random doc bugs you can file tickets
for, or ill just give you commit access if you want to knock some out
yourself.


Michael Bayer

unread,
Nov 12, 2006, 2:56:10 PM11/12/06
to sqlalchemy
since create_engine deals with class constructors, i went with this
approach:

def get_cls_kwargs(cls):
"""return the full set of legal kwargs for the given cls"""
kw = []
for c in cls.__mro__:
cons = c.__init__
if hasattr(cons, 'func_code'):
for vn in cons.func_code.co_varnames:
if vn != 'self':
kw.append(vn)
return kw

so now you get these luxurious TypeErrors if you send any combination
of invalid arguments:

TypeError: Invalid argument(s) 'lala','hoho' sent to create_engine(),
using configuration PGDialect/QueuePool/Engine. Please check that the
keyword arguments are appropriate for this combination of components.

Shannon -jj Behrens

unread,
Nov 13, 2006, 1:29:44 PM11/13/06
to sqlal...@googlegroups.com
On 11/12/06, Michael Bayer <zzz...@gmail.com> wrote:
> since create_engine deals with class constructors, i went with this
> approach:
>
> def get_cls_kwargs(cls):
> """return the full set of legal kwargs for the given cls"""
> kw = []
> for c in cls.__mro__:
> cons = c.__init__
> if hasattr(cons, 'func_code'):
> for vn in cons.func_code.co_varnames:
> if vn != 'self':
> kw.append(vn)
> return kw
>
> so now you get these luxurious TypeErrors if you send any combination
> of invalid arguments:

Thanks! You'll probably never hear about this again, but I bet this
will save many people hours of frustration. :)

Philip Jenvey

unread,
Nov 13, 2006, 4:13:15 PM11/13/06
to sqlal...@googlegroups.com

On Nov 12, 2006, at 11:56 AM, Michael Bayer wrote:

>
> since create_engine deals with class constructors, i went with this
> approach:
>
> def get_cls_kwargs(cls):
> """return the full set of legal kwargs for the given cls"""
> kw = []
> for c in cls.__mro__:
> cons = c.__init__
> if hasattr(cons, 'func_code'):
> for vn in cons.func_code.co_varnames:
> if vn != 'self':
> kw.append(vn)
> return kw
>
> so now you get these luxurious TypeErrors if you send any combination
> of invalid arguments:
>

Unless I'm missing something, you don't want to gather args from the
mro class's constructors -- they don't apply to cls's.

Here's a cleaner version:

def get_cls_kwargs(cls):
"""return the full set of legal kwargs for the given cls"""

import inspect
return inspect.getargspec(cls.__init__)[0][1:]


--
Philip Jenvey

Michael Bayer

unread,
Nov 13, 2006, 5:03:34 PM11/13/06
to sqlalchemy

that approach fails to take into account arguments consumed from
**kwargs by superclasses. the particular method I have makes some
assumptions about the construction of the class being inspected, i.e.
that it looks like this:

class Foo(object):
def __init__(self, arg1, arg2):
...

class Bar(Foo):
def __init__(self, arg3, arg4, **kwargs):
super(Bar, self).__init__(**kwargs)
...

so that the full set of keyword arguments for Bar is arg1, arg2, arg3,
arg4 .

Reply all
Reply to author
Forward
0 new messages