Unicode + memcache = bug

358 views
Skip to first unread message

Jeremy Dunck

unread,
Jul 12, 2007, 6:34:53 AM7/12/07
to django-d...@googlegroups.com
When using the low-level cache and memcache as the backend, you're
likely to run into this stack trace:

...
File "/pegasus/code/current/django/core/cache/backends/memcached.py" in set
48. self._cache.set(key, value, timeout or self.default_timeout)
File "/usr/lib/python2.5/site-packages/memcache.py" in set
305. return self._set("set", key, val, time)
File "/usr/lib/python2.5/site-packages/memcache.py" in _set
328. fullcmd = "%s %s %d %d %d\r\n%s" % (cmd, key, flags, time, len(val), val)

UnicodeDecodeError at /
'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

What's going on here is that the memcache.py library does this with
the passed parameters:

fullcmd = "%s %s %d %d %d\r\n%s" % (cmd, key, flags, time, len(val), val)

Since "key" is often a unicode string, it infects, as it were, the
rest of the line, forcing "val" to be encoded, then decoded.

It may be that only the memcache backend has this problem, but the
general solution I'd suggest is to use smart_str on the key given to
each low-level cache's backend set method. Works-for-me.

It may also make sense to run on the value, but I imagine that has a
significant overhead, and I haven't had a problem with it yet....

Jeremy Dunck

unread,
Jul 12, 2007, 6:42:21 AM7/12/07
to django-d...@googlegroups.com
On 7/12/07, Jeremy Dunck <jdu...@gmail.com> wrote:
...

> It may be that only the memcache backend has this problem, but the
> general solution I'd suggest is to use smart_str on the key given to
> each low-level cache's backend set method. Works-for-me.


To be clear, if this is accepted as a solution, I'm happy to make a
ticket and patch.

Malcolm Tredinnick

unread,
Jul 12, 2007, 9:32:58 AM7/12/07
to django-d...@googlegroups.com
On Thu, 2007-07-12 at 05:34 -0500, Jeremy Dunck wrote:
> When using the low-level cache and memcache as the backend, you're
> likely to run into this stack trace:
>
> ...
> File "/pegasus/code/current/django/core/cache/backends/memcached.py" in set
> 48. self._cache.set(key, value, timeout or self.default_timeout)
> File "/usr/lib/python2.5/site-packages/memcache.py" in set
> 305. return self._set("set", key, val, time)
> File "/usr/lib/python2.5/site-packages/memcache.py" in _set
> 328. fullcmd = "%s %s %d %d %d\r\n%s" % (cmd, key, flags, time, len(val), val)
>
> UnicodeDecodeError at /
> 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
>
> What's going on here is that the memcache.py library does this with
> the passed parameters:
>
> fullcmd = "%s %s %d %d %d\r\n%s" % (cmd, key, flags, time, len(val), val)
>
> Since "key" is often a unicode string, it infects, as it were, the
> rest of the line, forcing "val" to be encoded, then decoded.

I thought I understood the problem until I read this sentence. Now my
brain hurts. I fully understand that the whole string is treated as
Unicode as soon as one argument is Unicode. Why is "val" the problem
here then? What sort of object is "val" and why doesn't unicode(val)
work (aah ... is is going via str(val) and val is non-ASCII? That could
do it).

The error in the traceback suggests it is trying to treat something
*not* as Unicode. I'm a little fuzzy on what's going on.

> It may be that only the memcache backend has this problem, but the
> general solution I'd suggest is to use smart_str on the key given to
> each low-level cache's backend set method. Works-for-me.

Hasn't actually occurred to me to check previously: can memcache handle
non-ASCII data there, because even converting to UTF-8 is going to give
values that are not always understandable to the ascii codec.

> It may also make sense to run on the value, but I imagine that has a
> significant overhead, and I haven't had a problem with it yet....

Assuming the missing key part of this sentence is force_unicode(), it
should be not really worse than running smart_str() (about one extra
function call), from first glance. However, as indicated above, I'll
admit to being sketchy about the real problem still.

If you can guarantee that str(val) will always make sense and be encoded
as UTF-8, then your proposed solution sounds fine. The encoding of
str(val) is important, because we have to able to understand it when we
pull it out from the cache again later.

Regards,
Malcolm

--
Works better when plugged in.
http://www.pointy-stick.com/blog/

Jeremy Dunck

unread,
Jul 12, 2007, 9:55:03 AM7/12/07
to django-d...@googlegroups.com
On 7/12/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
>
> On Thu, 2007-07-12 at 05:34 -0500, Jeremy Dunck wrote:
...

> > What's going on here is that the memcache.py library does this with
> > the passed parameters:
> >
> > fullcmd = "%s %s %d %d %d\r\n%s" % (cmd, key, flags, time, len(val), val)
> >
> > Since "key" is often a unicode string, it infects, as it were, the
> > rest of the line, forcing "val" to be encoded, then decoded.
>
> I thought I understood the problem until I read this sentence. Now my
> brain hurts. I fully understand that the whole string is treated as
> Unicode as soon as one argument is Unicode. Why is "val" the problem
> here then? What sort of object is "val" and why doesn't unicode(val)
> work (aah ... is is going via str(val) and val is non-ASCII? That could
> do it).

Sorry for not giving more context.

In that quoted line, cmd is a str (created by the library itself), key
is whatever the low-level django API passes in (very likely a
Unicode), and val is a pickled object (that is, arbitrary binary).

When key is Unicode, it forces val to be decoded into Unicode, which
fails, since it's a binary.

At least, I'm pretty darn sure. I *think* I understand this bit-pushing. :)

> Hasn't actually occurred to me to check previously: can memcache handle
> non-ASCII data there, because even converting to UTF-8 is going to give
> values that are not always understandable to the ascii codec.
>

/me checks python-memcache code.

python-memcache assumes a str key with no control characters (ord(c)
>= 33) and len(key) < 250.

The stored value can be any object, but there are a few optimizations.
This is how the marshalling is done:

if isinstance(val, types.StringTypes):
pass
elif isinstance(val, int):
flags |= Client._FLAG_INTEGER
val = "%d" % val
elif isinstance(val, long):
flags |= Client._FLAG_LONG
val = "%d" % val
else:
flags |= Client._FLAG_PICKLE
val = pickle.dumps(val, 2)


fullcmd = "%s %s %d %d %d\r\n%s" % (cmd, key, flags, time, len(val), val)

The result, fullcmd, is then sent over the wire.

So, my assertion is that key is the only possible unicode value, and
that it better be coercable to str using sys.getdefaultencoding(),
because otherwise the string format will die.

cmd, flags, time, len(val), and val must all be str or unicode (it's
odd that they have StringTypes there, when they clearly don't handle a
Unicode value in the general sense).

My understanding is that smart_str forces a unicode value to str using
encoding='utf-8', and is a no-op when passed a str.

I want to make sure that all parameters there are str; I'm pretty
confident "key" is the only non-str object.

> The encoding of
> str(val) is important, because we have to able to understand it when we
> pull it out from the cache again later.

I agree, but I don't want to mess with val; I want to force encoding of "key".

Clearer?

Malcolm Tredinnick

unread,
Jul 12, 2007, 10:06:43 AM7/12/07
to django-d...@googlegroups.com
On Thu, 2007-07-12 at 08:55 -0500, Jeremy Dunck wrote:
> On 7/12/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
> >
> > On Thu, 2007-07-12 at 05:34 -0500, Jeremy Dunck wrote:
> ...
> > > What's going on here is that the memcache.py library does this with
> > > the passed parameters:
> > >
> > > fullcmd = "%s %s %d %d %d\r\n%s" % (cmd, key, flags, time, len(val), val)
> > >
> > > Since "key" is often a unicode string, it infects, as it were, the
> > > rest of the line, forcing "val" to be encoded, then decoded.
> >
> > I thought I understood the problem until I read this sentence. Now my
> > brain hurts. I fully understand that the whole string is treated as
> > Unicode as soon as one argument is Unicode. Why is "val" the problem
> > here then? What sort of object is "val" and why doesn't unicode(val)
> > work (aah ... is is going via str(val) and val is non-ASCII? That could
> > do it).
>
> Sorry for not giving more context.
>
> In that quoted line, cmd is a str (created by the library itself), key
> is whatever the low-level django API passes in (very likely a
> Unicode), and val is a pickled object (that is, arbitrary binary).

Okay. That makes things clearer. Memcache is expecting to handle val as
an opaque sequence of bytes here (they are using the binary pickling
format), which is the key point. So your proposed fix looks right to me.

> When key is Unicode, it forces val to be decoded into Unicode, which
> fails, since it's a binary.

Yes, I agree.

Regards,
Malcolm

--
Honk if you love peace and quiet.
http://www.pointy-stick.com/blog/

Chad Maine

unread,
Jul 12, 2007, 10:29:07 AM7/12/07
to django-d...@googlegroups.com
I did notice this bug, but it went away when I switched to cmemcache (a much faster alternative if its availble to you).

On 7/12/07, Jeremy Dunck < jdu...@gmail.com> wrote:

Bryan

unread,
Jul 12, 2007, 1:50:08 PM7/12/07
to Django developers
I ran into a similar issue when accessing the memcached python api.

When running django unicode the value returned from the database was
valid. However when running the non unicode version of django it'd
blow up in my face.


trying to do a .set with bytestrings that contain non ascii char
values doesn't work. It has to do a .encode('UTF-8') on the string I
was attempting to push into memcached. and likewise on pulling it
back out I had to do a .decode('UTF-8').

Jeremy Dunck

unread,
Jul 12, 2007, 2:24:49 PM7/12/07
to django-d...@googlegroups.com
On 7/12/07, Bryan <kae...@gmail.com> wrote:
> trying to do a .set with bytestrings that contain non ascii char
> values doesn't work. It has to do a .encode('UTF-8') on the string I
> was attempting to push into memcached. and likewise on pulling it
> back out I had to do a .decode('UTF-8').


Then you do Unicode.encode('utf-8'), you are creating a bytestring
with non-ascii char values. I'm not sure how your statements can be
simultaneously true.

At any rate, I'll get a patch done soon.

Bryan

unread,
Jul 12, 2007, 4:10:26 PM7/12/07
to Django developers
probably my headache is limiting my ability to rationally
articulate. :)

On Jul 12, 11:24 am, "Jeremy Dunck" <jdu...@gmail.com> wrote:

Simon G.

unread,
Jul 12, 2007, 7:30:14 PM7/12/07
to Django developers
#4845 is probably related here in some way, giving this traceback:

PythonHandler django.core.handlers.modpython:
MemcachedStringEncodingError: Keys must be str()'s, not unicode.
Convert your unicode strings using mystring.encode(charset)!

There's a few patches there which force the keys to ASCII, but this
may not be the best solution.

--Simon

[1] http://code.djangoproject.com/ticket/4845

Jeremy Dunck

unread,
Jul 16, 2007, 4:57:09 AM7/16/07
to django-d...@googlegroups.com
On 7/12/07, Simon G. <d...@simon.net.nz> wrote:
>
> #4845 is probably related here in some way, giving this traceback:


I've attached my patch and tests to that ticket.

Reply all
Reply to author
Forward
0 new messages