python_2_unicode_compatible pitfalls

648 views

Skip to first unread message

Mikhail Korobov

unread,

Dec 27, 2012, 2:20:24 PM12/27/12

to django-d...@googlegroups.com

Hi there,

First of all, many kudos for the Python 3.x support in upcoming django 1.5, and for the way it is handled (the approach, the docs, etc)!

I think there are some pitfalls with @python_2_unicode_compatible decorator as it currently implemented in django (and __str__/__repr__ in general), and want to share the thoughts before the 1.5 release. I'm sorry that this message is pretty vague; it points to some problems with the current approach (some of them are real, some would occur very rarely) but it doesn't propose the solution for django other than "please review the code once more".

1) @python_2_unicode_compatible doesn't handle __repr__.

For example, this affects django.db.models.options.Options,

django.core.files.base.File (and ContentFile),

django.contrib.admin.models.LogEntry, django.template.base.Variable

and probably many others (their __repr__ incorrectly returns unicode).

It also may be the cause why django.db.models.Model.__repr__ doesn't

follow Python conventions ("__repr__ should be information-rich and

unambiguous" - unicode values are replaced with "[Bad Unicode data]").

By the way, the way django detects whether value needs replacing

is not correct and doesn't prevent all errors because what

"u = six.text_type(self)" do for bytestring is decode data using

sys.getdefaultencoding() while repr is (most?) often used in console,

where sys.stdout.encoding matters.

2) under Python 2.x __str__ is implemented as __unicode__

encoded to utf8. This breaks 'print django_obj' when sys.stdout.encoding

is not utf8 because print uses __str__ (not __unicode__) for custom objects,

and the terminal expects the result to be encoded in sys.stdout.encoding

(print encodes unicode strings to sys.stdout.encoding, but doesn't

use __unicode__ of objects; this is hard-coded in Python 2.x).

This may affect REPL in Windows consoles and printing/writing to stdout

in management commands.

3) @python_2_unicode_compatible produces incorrect results

when applied twice (__str__ is patched by previous decorator application

and returns bytestring because of that).

This is easy to oversight e.g. when applying this decorator to a

subclass of a class which is wrapped to @python_2_unicode_compatible

and deleting the overridden __str__ afterwards.

4) __str__ is not always properly implemented for this decorator in django

code. To work properly with @python_2_unicode_compatible,

__str__ must return unicode string. This is quite subtle.

For example, take a look at django.contrib.gis.maps.google.GEvent.

__str__ is implemented as

"return mark_safe('"%s", %s' %(self.event, self.action))",

but "from __future__ import unicode_literals" is not applied to the file.

This means that if event and action are Python objects with both __str__

and __unicode__ methods defined (e.g. object of class wrapped with

python_2_unicode_compatible) then __str__ would be called for these objects,

not __unicode__ (because the format string is a bytestring). Generally,

"%s" % something is a good and correct pattern for __str__ implementation

(it does the right thing under both Python 2.x and 3.x when

unicode_literals future import is there), but it is incorrect under Python

2.x if unicode_literals is not imported.

5) %r is very tricky. If unicode_literals are in effect, or some

arguments for string formatting are unicode,

"%r" % obj would trigger bytes decoding using sys.getdefaultencoding() under

Python 2.x (unless obj is an unicode string), and if obj.__repr__ returns

non-ascii text or obj is a bytestring, exception would be raised

(because sys.getdefaultencoding() is usually ascii).

This format specifier is used, for example, in a default_error_messages

for django.db.models.fields.Field; after switching to unicode_literals

this may start raising UnicodeDecodeExceptions for non-ascii choices

if they are custom objects (not unicode strings).

Another example is django.http.response.HttpResponseBase._convert_to_charset

where BadHeaderError exception is raised: after switching to unicode_literals

%r format specifier start triggering decoding of "value" using sys.getdefaultencoding()

which is incorrect because "value" is a bytestring of 'charset' encoding under

Python 2.x. Another example is django.utils.datastructures.SortedDict:

its __repr__ uses '%r: %r' % (k, v) for k, v in six.iteritems(self)

which may fail if key is an unicode string and a value is a bytestring

or an object with __repr__ returning non-ascii text. Another example

is django.utils.encoding.DjangoUnicodeDecodeError

(it has incorrect __str__ by the way because it returns unicode) -

it uses "%r" for self.obj, with unicode string formatter,

and this would blow up if __repr__ of obj returns non-ascii text.

There are other places where %r is used and they all are fragile.

I've implemented an another python_2_unicode_compatible decorator (inspired by django's, the idea is cool) for NLTK: https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which resolves some of issues above (it handles __repr__, limits __str__ and __repr__ to ascii and supports subclassing better). The article (rather lengthy, with some django bashing :) that provides motivation for the decorator used in NLTK: http://kmike.ru/python-with-strings-attached/ (the code in the article is a bit outdated, it is not the code used in NLTK; NLTK version was improved, but I didn't update the article yet).

Mikhail Korobov

unread,

Dec 27, 2012, 3:26:14 PM12/27/12

to django-d...@googlegroups.com

Oh, my description of (5) is not totally correct: u'%r' % bytestring_value is fine because repr(non_ascii_bytestring) is an escaped 7bit ascii; this mean HttpResponseBase._convert_to_charset is almost fine (bytes would be of incorrect encoding, but this won't raise an exception). The argument about "%r" % object_with_non_ascii__repr__ still apply.

пятница, 28 декабря 2012 г., 1:20:24 UTC+6 пользователь Mikhail Korobov написал:

Aymeric Augustin

unread,

Dec 27, 2012, 7:15:06 PM12/27/12

to django-d...@googlegroups.com

Hi Mikhail,

Thanks for your feedback! Your email touches many topics; can we try

to extract the most important ones and identify what really needs fixing

before the 1.5 release?

1) @python_2_unicode_compatible doesn't handle __repr__.

Indeed, this decorator isn't designed to handle __repr__ — I haven't

even tried.

At first sight, I'd say this isn't a sufficient reason to delay the 1.5 release,

but it's a feature we should consider for 1.6.

For example, this affects django.db.models.options.Options,
django.core.files.base.File (and ContentFile),
django.contrib.admin.models.LogEntry, django.template.base.Variable
and probably many others (their __repr__ incorrectly returns unicode).

Is it a regression in 1.5? Or did the problem already exist?

2) under Python 2.x __str__ is implemented as __unicode__
encoded to utf8.

Yes, this is a legacy behavior that I strongly disagree with. It

makes __str__ / __unicode__ handling for Model subclasses

exceedingly sketchy.* The only reason why I didn't remove

it is backwards compatibility.

* I expect future maintainers, including myself, to hate this:

https://github.com/django/django/commit/2ea80b94

This breaks 'print django_obj' when sys.stdout.encoding
is not utf8 because print uses __str__ (not __unicode__) for custom objects,
and the terminal expects the result to be encoded in sys.stdout.encoding
(print encodes unicode strings to sys.stdout.encoding, but doesn't
use __unicode__ of objects; this is hard-coded in Python 2.x).
This may affect REPL in Windows consoles and printing/writing to stdout
in management commands.

Django often defaults to utf-8 rather than try to figure out the

most appropriate charset. Such code generally dates back

to the unicode branch. So, yeah, this problem exists but it's

not really new.

3) @python_2_unicode_compatible produces incorrect results
when applied twice (__str__ is patched by previous decorator application
and returns bytestring because of that).
This is easy to oversight e.g. when applying this decorator to a
subclass of a class which is wrapped to @python_2_unicode_compatible
and deleting the overridden __str__ afterwards.

This is a good point. I looked at the `_was_fixed` technique you're

proposing below. It does work in more cases, but I'm pretty sure

you can still break it by overriding __str__ or __unicode__ in a

sufficiently complex class hierarchy.

4) __str__ is not always properly implemented for this decorator in django
code. To work properly with @python_2_unicode_compatible,
__str__ must return unicode string. This is quite subtle.

This is destabilizing for Python 2 developers, but once you get it,

you get the whole "write Python 3 code that also works under

Python 2" philosophy, and it's documented. I stand by this decision.

For example, take a look at django.contrib.gis.maps.google.GEvent.
__str__ is implemented as
"return mark_safe('"%s", %s' %(self.event, self.action))",
but "from __future__ import unicode_literals" is not applied to the file.

django.contrib.gis is a large piece of code that hasn't received as

much attention as core. There's certainly a bunch of bugs like this one.

This means that if event and action are Python objects with both __str__
and __unicode__ methods defined (e.g. object of class wrapped with
python_2_unicode_compatible) then __str__ would be called for these objects,
not __unicode__ (because the format string is a bytestring). Generally,
"%s" % something is a good and correct pattern for __str__ implementation
(it does the right thing under both Python 2.x and 3.x when
unicode_literals future import is there), but it is incorrect under Python
2.x if unicode_literals is not imported.

I don't understand. Unless you're using @python_2_unicode_compatible,

__str__ must return a string. `"%s" % foo` works in Python 2 and 3 *unless*

you enable unicode_literals; if you do, the right pattern is `str("%s") % foo`.

5) %r is very tricky. If unicode_literals are in effect, or some
arguments for string formatting are unicode,
"%r" % obj would trigger bytes decoding using sys.getdefaultencoding() under
Python 2.x (unless obj is an unicode string), and if obj.__repr__ returns
non-ascii text or obj is a bytestring, exception would be raised
(because sys.getdefaultencoding() is usually ascii).
This format specifier is used, for example, in a default_error_messages
for django.db.models.fields.Field; after switching to unicode_literals
this may start raising UnicodeDecodeExceptions for non-ascii choices
if they are custom objects (not unicode strings).
Another example is django.http.response.HttpResponseBase._convert_to_charset
where BadHeaderError exception is raised: after switching to unicode_literals
%r format specifier start triggering decoding of "value" using sys.getdefaultencoding()
which is incorrect because "value" is a bytestring of 'charset' encoding under
Python 2.x. Another example is django.utils.datastructures.SortedDict:
its __repr__ uses '%r: %r' % (k, v) for k, v in six.iteritems(self)
which may fail if key is an unicode string and a value is a bytestring
or an object with __repr__ returning non-ascii text. Another example
is django.utils.encoding.DjangoUnicodeDecodeError
(it has incorrect __str__ by the way because it returns unicode) -
it uses "%r" for self.obj, with unicode string formatter,
and this would blow up if __repr__ of obj returns non-ascii text.
There are other places where %r is used and they all are fragile.

If I understand correctly, you're saying there's a regression in

modules that define __repr__ methods and where unicode_literals

were enabled?

I've implemented an another python_2_unicode_compatible decorator (inspired by django's, the idea is cool) for NLTK: https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which resolves some of issues above (it handles __repr__, limits __str__ and __repr__ to ascii and supports subclassing better). The article (rather lengthy, with some django bashing :) that provides motivation for the decorator used in NLTK: http://kmike.ru/python-with-strings-attached/ (the code in the article is a bit outdated, it is not the code used in NLTK; NLTK version was improved, but I didn't update the article yet).

Obviously your version is more elaborate and handles more cases.

The concept of `_was_fixed` is interesting but as explained above

I'm not sure we can make it fully reliable. I'm also concerned about

the transliteration / 7 bit stuff — it smell like something people are

going to complain about or bikeshed forever.

The real question here is -- how much complexity is a core dev willing

to introduce into Django, and defend in front of tens of thousands of

developers for the next few years? As far as I'm concerned, the answer

is "very little", because I'm not very smart and my days only have so

many hours.

To sum up, I think most of your remarks are valid. But taken all together

they aren't easily actionable. You're mixing existing behaviors that we

must keep for backwards-compatibility (even if they're debatable) and

new behaviors that we still have a chance to fix before 1.5 sets them in

stone. And you're requesting a massive lot of changes.

To move forward, I'd suggest to divide and classify all this into smaller

sets of problems, and describe them precisely:

- regressions introduced by Python 3 support (release blockers),

- limitations of the current Python 3 support,

- str / repr method that don't return the correct type,

- legacy design decisions that appear suboptimal,

- etc.

Thanks,

Aymeric.

Karen Tracey

unread,

Dec 27, 2012, 8:09:47 PM12/27/12

to django-d...@googlegroups.com

On Thu, Dec 27, 2012 at 7:15 PM, Aymeric Augustin <aymeric....@polytechnique.org> wrote:

2) under Python 2.x __str__ is implemented as __unicode__
encoded to utf8.

Yes, this is a legacy behavior that I strongly disagree with. It
makes __str__ / __unicode__ handling for Model subclasses
exceedingly sketchy.* The only reason why I didn't remove

it is backwards compatibility.

I'm curious what would you do instead, if backward-incompatibility were not a concern?

Karen

Aymeric Augustin

unread,

Dec 28, 2012, 3:37:46 AM12/28/12

to django-d...@googlegroups.com

2012/12/28 Karen Tracey <kmtr...@gmail.com>

In Python __unicode__ is __str__ + decode(sys.getdefaultencoding()):

http://docs.python.org/2/reference/datamodel.html#object.__unicode__

Django reverses this behavior for Model subclasses and uses utf-8.

I was first exposed to unicode handling in Python via Django and

it took me a long time to discover that writing a __unicode__ method

wasn't the canonical way to define an object's string representation.

If we set aside backwards-compatibility, I would probably

define a conventional, Django-specific method (eg. __display__)

and derive __unicode__ and __str__ from there, according to

Django's rules. Not using Python's default names makes it explicit

that Django adds specific behavior. This vastly simplifies the

definition of __str__ and __unicode__ in Python 2 and 3, and also

avoids the semantic issue of __str__ returning unicode on Python 2

under @python_2_unicode_compatible.

An alternative would be to provide an explicit way to change

Python's behavior -- either a decorator or a class attribute.

Though this isn't going to be very DRY.

I haven't had issues with using utf-8 instead of the system charset,

but Mikhail seems to suggest it can be a problem on Windows.

Using the system charset, like Python does, might be less surprising.

I don't have enough experience to tell. This won't be a problem for

displaying objects in templates, because they're rendered in unicode.

Aymeric.

Mikhail Korobov

unread,

Dec 28, 2012, 4:07:33 AM12/28/12

to django-d...@googlegroups.com

Hi Aymeric,

FYI: I didn't check anything in my message manually, this is a "code review" based on grepping over django source code for common problems.

пятница, 28 декабря 2012 г., 6:15:06 UTC+6 пользователь Aymeric Augustin написал:

Hi Mikhail,

Thanks for your feedback! Your email touches many topics; can we try
to extract the most important ones and identify what really needs fixing
before the 1.5 release?

1) @python_2_unicode_compatible doesn't handle __repr__.

Indeed, this decorator isn't designed to handle __repr__ — I haven't
even tried.

At first sight, I'd say this isn't a sufficient reason to delay the 1.5 release,
but it's a feature we should consider for 1.6.

For example, this affects django.db.models.options.Options,
django.core.files.base.File (and ContentFile),
django.contrib.admin.models.LogEntry, django.template.base.Variable
and probably many others (their __repr__ incorrectly returns unicode).

Is it a regression in 1.5? Or did the problem already exist?

This is a regression. It is not very serious (the unicode would contain only ascii characters in all the cases above), but it may have some subtle consequences, and it seems that these methods was written as returning unicode intentionally (e.g. LogEntry.__repr__ returns smart_text(self.action_time), not str(self.action_time)).

Oh, and my late night review of Model.__repr__ was not correct :) six.text_type(self) has nothing to do with encoding/decoding, it calls __unicode__ method and __repr__ returns utf8-encoded text as it supposed.

2) under Python 2.x __str__ is implemented as __unicode__
encoded to utf8.

Yes, this is a legacy behavior that I strongly disagree with. It
makes __str__ / __unicode__ handling for Model subclasses
exceedingly sketchy.* The only reason why I didn't remove
it is backwards compatibility.

* I expect future maintainers, including myself, to hate this:
https://github.com/django/django/commit/2ea80b94

This breaks 'print django_obj' when sys.stdout.encoding
is not utf8 because print uses __str__ (not __unicode__) for custom objects,
and the terminal expects the result to be encoded in sys.stdout.encoding
(print encodes unicode strings to sys.stdout.encoding, but doesn't
use __unicode__ of objects; this is hard-coded in Python 2.x).
This may affect REPL in Windows consoles and printing/writing to stdout
in management commands.

Django often defaults to utf-8 rather than try to figure out the
most appropriate charset. Such code generally dates back
to the unicode branch. So, yeah, this problem exists but it's
not really new.

Defaulting to utf8 is better than trying to figure out the most appropriate charset, I fully agree with that (but prefer 7bit ascii myself which is not guessing). However, this makes REPL and printing not usable in environments where sys.stdout.encoding is not utf8. User usually expects to be able to type an object in console to check it, and this won't work (it may raise an exception or produce mangled output) under Windows (and some older *nixes) when there is a non-ascii data. Isn't it a bug? If it is not a bug, and it is an intended behavior that wouldn't be fixed (is backwards compatibility the only issue?), then maybe it should be documented ("sorry, we don't support cmd.exe under Python 2.x") to save developers time? Or maybe there are workarounds that would fix this? Currently REPL may fail because of django behavior, not because of buggy evil Windows or Python 2.x bugs, so IMHO it is django responsibility to either fix it or tell users what to do.

3) @python_2_unicode_compatible produces incorrect results
when applied twice (__str__ is patched by previous decorator application
and returns bytestring because of that).
This is easy to oversight e.g. when applying this decorator to a
subclass of a class which is wrapped to @python_2_unicode_compatible
and deleting the overridden __str__ afterwards.

This is a good point. I looked at the `_was_fixed` technique you're
proposing below. It does work in more cases, but I'm pretty sure
you can still break it by overriding __str__ or __unicode__ in a
sufficiently complex class hierarchy.

__unicode__ becomes pretty internal with @python_2_unicode_compatible, and overriding it may break things. I was about to introduce checking for overridden __unicode__ (by something like adding _nltk_is_generated attribute), but decided not to be too smart (there may be reasons to override __unicode__ I don't know about).

On the other hand, overriding __str__ is fully legit. Can you please create an example on how can __str__ be broken in complex class hierarchy with _was_fixed?

4) __str__ is not always properly implemented for this decorator in django
code. To work properly with @python_2_unicode_compatible,
__str__ must return unicode string. This is quite subtle.

This is destabilizing for Python 2 developers, but once you get it,
you get the whole "write Python 3 code that also works under
Python 2" philosophy, and it's documented. I stand by this decision.

I mean that sometimes __str__ is not properly implemented in django code itself :) @python_2_unicode_compatible + __str__ returning unicode is a good decision in my view. Check all __str__ methods in files where unicode_literals is not in effect (django.contrib.comments.Comment.__str__ and CommentFlag has the same subtle, almost theoretical issue which won't occur often because self.name and self.comment would be unicode strings most of time).

For example, take a look at django.contrib.gis.maps.google.GEvent.
__str__ is implemented as
"return mark_safe('"%s", %s' %(self.event, self.action))",
but "from __future__ import unicode_literals" is not applied to the file.

django.contrib.gis is a large piece of code that hasn't received as
much attention as core. There's certainly a bunch of bugs like this one.

This means that if event and action are Python objects with both __str__
and __unicode__ methods defined (e.g. object of class wrapped with
python_2_unicode_compatible) then __str__ would be called for these objects,
not __unicode__ (because the format string is a bytestring). Generally,
"%s" % something is a good and correct pattern for __str__ implementation
(it does the right thing under both Python 2.x and 3.x when
unicode_literals future import is there), but it is incorrect under Python
2.x if unicode_literals is not imported.

I don't understand. Unless you're using @python_2_unicode_compatible,
__str__ must return a string. `"%s" % foo` works in Python 2 and 3 *unless*
you enable unicode_literals; if you do, the right pattern is `str("%s") % foo`.

I mean "correct pattern for __str__" for classes that use @python_2_unicode_compatible.

No, the regression is in modules that use "%r" format specifier and the string containing this specifier becomes unicode. "byte string %r" % obj calls obj.__repr__() and uses the result directly, while u"unicode_string %r" % obj calls obj.__repr__() and decodes the result using sys.getdefaultencoding().

I've implemented an another python_2_unicode_compatible decorator (inspired by django's, the idea is cool) for NLTK: https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which resolves some of issues above (it handles __repr__, limits __str__ and __repr__ to ascii and supports subclassing better). The article (rather lengthy, with some django bashing :) that provides motivation for the decorator used in NLTK: http://kmike.ru/python-with-strings-attached/ (the code in the article is a bit outdated, it is not the code used in NLTK; NLTK version was improved, but I didn't update the article yet).

Obviously your version is more elaborate and handles more cases.
The concept of `_was_fixed` is interesting but as explained above
I'm not sure we can make it fully reliable. I'm also concerned about
the transliteration / 7 bit stuff — it smell like something people are
going to complain about or bikeshed forever.

The real question here is -- how much complexity is a core dev willing
to introduce into Django, and defend in front of tens of thousands of
developers for the next few years? As far as I'm concerned, the answer
is "very little", because I'm not very smart and my days only have so
many hours.

I agree that this all is too elaborate (awfully elaborate). But I don't know how to fix REPL issues in another way. The alternative (as far as I can tell) is not fixing an real issue in order to simplify the library implementation. This may be a legit approach if users of library are aware of the limitation though.

Mikhail Korobov

unread,

Dec 28, 2012, 4:14:17 AM12/28/12

to django-d...@googlegroups.com

пятница, 28 декабря 2012 г., 14:37:46 UTC+6 пользователь Aymeric Augustin написал:

There is no a single "system charset" and that's the problem :)

>>> import sys
>>> import locale
>>> sys.getdefaultencoding()
'ascii'
>>> locale.getpreferredencoding()
'cp1251'
>>> sys.stdout.encoding
'cp866'

(REPL in Windows XP with Russian locale with all defaults; sys.stdout.encoding would be None for scripts). It is not possible (as far as I can tell) to encode __str__ or __repr__ results to some encoding (except 7bit ascii) and expect it to work reliably.