svn trunk broken since [4919], encoding problems

Michael Radziej

unread,

Apr 5, 2007, 10:06:57 AM4/5/07

to Django developers

Hi,

changeset [4919] made the escape filter tag return unicode strings instead
of bytestrings, which seems to create problems as soon as you use non-ASCII
characters.

If you see any encoding problems, especially with oldforms or the admin
interface, please take a look at ticket #3924. The exceptions typically
look like this:

'ascii' codec can't decode byte ...

http://code.djangoproject.com/ticket/3924

Michael

--
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100
http://www.noris.de - The IT-Outsourcing Company

Vorstand: Ingo Kraupa (Vorsitzender), Joachim Astel, Hansjochen Klenk -
Vorsitzender des Aufsichtsrats: Stefan Schnabel - AG Nürnberg HRB 17689

Malcolm Tredinnick

unread,

Apr 5, 2007, 10:55:15 AM4/5/07

to django-d...@googlegroups.com

On Thu, 2007-04-05 at 16:06 +0200, Michael Radziej wrote:
> Hi,
>
> changeset [4919] made the escape filter tag return unicode strings instead
> of bytestrings, which seems to create problems as soon as you use non-ASCII
> characters.

I've reverted that portion of the change. There's a whole rat's nest of
interacting problems there that it's going to take me a couple of days
to untangle.

Malcolm

Michael Radziej

unread,

Apr 5, 2007, 11:06:48 AM4/5/07

to django-d...@googlegroups.com

Hi Malcolm,

I'm trying to produce a small test case ... but it's hard to nail it down. I
haven't found the exact cirumcstances that make the bug appear. The test
suite of my own application finds it, but it's too deep to be a useful test
case.

Malcolm Tredinnick

unread,

Apr 5, 2007, 11:27:39 AM4/5/07

to django-d...@googlegroups.com

Hey Michael,

On Thu, 2007-04-05 at 17:06 +0200, Michael Radziej wrote:
> Hi Malcolm,
>
> On Fri, Apr 06, Malcolm Tredinnick wrote:
>
> >
> > On Thu, 2007-04-05 at 16:06 +0200, Michael Radziej wrote:
> > > Hi,
> > >
> > > changeset [4919] made the escape filter tag return unicode strings instead
> > > of bytestrings, which seems to create problems as soon as you use non-ASCII
> > > characters.
> >
> > I've reverted that portion of the change. There's a whole rat's nest of
> > interacting problems there that it's going to take me a couple of days
> > to untangle.
>
> I'm trying to produce a small test case ... but it's hard to nail it down. I
> haven't found the exact cirumcstances that make the bug appear. The test
> suite of my own application finds it, but it's too deep to be a useful test
> case.

I actually understand the problem now that it's been pointed out. Don't
waste too much time looking into this if you don't want. Your diagnosis
in the ticket is essentially correct. You can see the same problem at
the Python prompt. We have the equivalent of this going on:

>>> l = ['\xc3\x85', 'a', u'\u20ac', 'b']
>>> ''.join(l)
** Kaboom! **

To see it in practice, return non-ASCII bytestrings from a model's
__str__ method, for example and then use that in a template. That should
fail with the [4919] change in place.

The problem is the UTF-8 encoded bytestring at the start there; they
need to be decoded back to unicode before the joining. Which is
impossible to do correctly in the general case, because the encoding
could be anything -- we need to assume they are UTF-8 encoded or
something like that. And it would be handy if performance didn't take a
beating with lots of extra decoding. I need to try out a few options.

I thought I was being sufficiently careful in my changes and testing,
but, of course, I forgot a whole slab of code -- oldforms! Stupid
brain ... not remembering all the bits. Still, that's why I checked some
stuff in so that I could hear about any inadvertent problems (and it's
why I've backed it out).

Sorry for breaking admin for a day or so. Any other (new) i18n-related
oddities you see, by all means file new tickets. All the care in the
world isn't going to protect me or anybody else against accidental
oversights and situations I haven't thought of.

Regards,
Malcolm

Michael Radziej

unread,

Apr 5, 2007, 11:42:49 AM4/5/07

to django-d...@googlegroups.com

On Fri, Apr 06, Malcolm Tredinnick wrote:

> I actually understand the problem now that it's been pointed out. Don't
> waste too much time looking into this if you don't want. Your diagnosis
> in the ticket is essentially correct. You can see the same problem at
> the Python prompt. We have the equivalent of this going on:
>
> >>> l = ['\xc3\x85', 'a', u'\u20ac', 'b']
> >>> ''.join(l)
> ** Kaboom! **

Yeah, but I don't see how the unicode string enters the template system.

> Sorry for breaking admin for a day or so. Any other (new) i18n-related
> oddities you see, by all means file new tickets. All the care in the
> world isn't going to protect me or anybody else against accidental
> oversights and situations I haven't thought of.

Really, the automatic conversions from unicode to bytestrings and back are
a nuisance in python. They seem to work as long as you only use ASCII data,
and suddenly fail when exposed to non-ASCII character sets. If the joining
bytestrings and unicode strings failed even with ASCII data, it would be
so much easier to spot the problems! I absolutely understand that it's
so hard to find them when your native language only needs ASCII.

Just for the record, the problematic operations are:

unicode(s)
s.decode()
s.encode()
s.str()
s.__cmp__()
.join([stringlist])
s.__add__()

and probably more ...

Malcolm Tredinnick

unread,

Apr 5, 2007, 11:50:12 AM4/5/07

to django-d...@googlegroups.com

On Thu, 2007-04-05 at 17:42 +0200, Michael Radziej wrote:
> On Fri, Apr 06, Malcolm Tredinnick wrote:
>
> > I actually understand the problem now that it's been pointed out. Don't
> > waste too much time looking into this if you don't want. Your diagnosis
> > in the ticket is essentially correct. You can see the same problem at
> > the Python prompt. We have the equivalent of this going on:
> >
> > >>> l = ['\xc3\x85', 'a', u'\u20ac', 'b']
> > >>> ''.join(l)
> > ** Kaboom! **
>
> Yeah, but I don't see how the unicode string enters the template system.
>
> > Sorry for breaking admin for a day or so. Any other (new) i18n-related
> > oddities you see, by all means file new tickets. All the care in the
> > world isn't going to protect me or anybody else against accidental
> > oversights and situations I haven't thought of.
>
> Really, the automatic conversions from unicode to bytestrings and back are
> a nuisance in python. They seem to work as long as you only use ASCII data,
> and suddenly fail when exposed to non-ASCII character sets. If the joining
> bytestrings and unicode strings failed even with ASCII data, it would be
> so much easier to spot the problems! I absolutely understand that it's
> so hard to find them when your native language only needs ASCII.

The problem from a developer point of view is that internationalized
code shouldn't be using bytestrings except at the last step (for strings
intended for output). Python bytestrings carry no information around
with them about their encoding -- they are literally just strings of
bytes. So you can only guess how to turn them back into unicode objects
and Python (correctly) won't guess, preferring to raise an error
instead. This is why joining them is a problem. It's why we are going to
need to internally have a convention such as "all bytestrings are
assumed to be UTF-8" and why somebody using bytestrings in, say, a
KOI8-R encoded Python source file is going to lose. There's no avoiding
that problem. People should only be using u"..." strings in such files,
not bytestrings.

I'm familiar with the problems. Fixing all occurrences is harder than
understanding them, sadly. :-)

We'll get there, though. It's annoying me how often people are reporting
encoding problems, so I accidentally got side-tracked this week into
working on that.

Regards,
Malcolm

Michael Radziej

unread,

Apr 5, 2007, 12:02:53 PM4/5/07

to django-d...@googlegroups.com

On Fri, Apr 06, Malcolm Tredinnick wrote:

> I'm familiar with the problems. Fixing all occurrences is harder than
> understanding them, sadly. :-)

;-)

Well, I wasn't successful with the test case, and I don't know how much exposure
to computers I'll get during Eastern, but I've got to leave the office now
...

Have a lot of eggs!

Reply all

Reply to author

Forward