python_2_unicode_compatible pitfalls

648 views
Skip to first unread message

Mikhail Korobov

unread,
Dec 27, 2012, 2:20:24 PM12/27/12
to django-d...@googlegroups.com
Hi there,

First of all, many kudos for the Python 3.x support in upcoming django 1.5, and for the way it is handled (the approach, the docs, etc)! 

I think there are some pitfalls with @python_2_unicode_compatible decorator as it currently implemented in django (and __str__/__repr__ in general), and want to share the thoughts before the 1.5 release. I'm sorry that this message is pretty vague; it points to some problems with the current approach (some of them are real, some would occur very rarely) but it doesn't propose  the solution for django other than "please review the code once more".

1) @python_2_unicode_compatible doesn't handle __repr__.
   For example, this affects django.db.models.options.Options,
   django.core.files.base.File (and ContentFile),
   django.contrib.admin.models.LogEntry, django.template.base.Variable
   and probably many others (their __repr__ incorrectly returns unicode).

   It also may be the cause why django.db.models.Model.__repr__ doesn't
   follow Python conventions ("__repr__ should be information-rich and
   unambiguous" - unicode values are replaced with "[Bad Unicode data]").
   By the way, the way django detects whether value needs replacing 
   is not correct and doesn't prevent all errors because what
   "u = six.text_type(self)" do for bytestring is decode data using
   sys.getdefaultencoding() while repr is (most?) often used in console,
   where sys.stdout.encoding matters.

2) under Python 2.x __str__ is implemented as __unicode__
   encoded to utf8. This breaks 'print django_obj' when sys.stdout.encoding
   is not utf8 because print uses __str__ (not __unicode__) for custom objects,
   and the terminal expects the result to be encoded in sys.stdout.encoding
   (print encodes unicode strings to sys.stdout.encoding, but doesn't
   use __unicode__ of objects; this is hard-coded in Python 2.x). 
   This may affect REPL in Windows consoles and printing/writing to stdout 
   in management commands.

3) @python_2_unicode_compatible produces incorrect results 
   when applied twice (__str__ is patched by previous decorator application 
   and returns bytestring because of that).
   This is easy to oversight e.g. when applying this decorator to a
   subclass of a class which is wrapped to @python_2_unicode_compatible
   and deleting the overridden __str__ afterwards.

4) __str__ is not always properly implemented for this decorator in django
   code. To work properly with @python_2_unicode_compatible,
   __str__ must return unicode string. This is quite subtle.
   For example, take a look at django.contrib.gis.maps.google.GEvent.
   __str__ is implemented as 
   "return mark_safe('"%s", %s' %(self.event, self.action))",
   but "from __future__ import unicode_literals" is not applied to the file.
   This means that if event and action are Python objects with both __str__
   and __unicode__ methods defined (e.g. object of class wrapped with
   python_2_unicode_compatible) then __str__ would be called for these objects,
   not __unicode__ (because the format string is a bytestring). Generally,
   "%s" % something is a good and correct pattern for __str__ implementation
   (it does the right thing under both Python 2.x and 3.x when
   unicode_literals future import is there), but it is incorrect under Python
   2.x if unicode_literals is not imported.

5) %r is very tricky. If unicode_literals are in effect, or some 
   arguments for string formatting are unicode,
   "%r" % obj would trigger bytes decoding using sys.getdefaultencoding() under
   Python 2.x (unless obj is an unicode string), and if obj.__repr__ returns
   non-ascii text or obj is a bytestring, exception would be raised
   (because sys.getdefaultencoding() is usually ascii).
   This format specifier is used, for example, in a default_error_messages
   for django.db.models.fields.Field; after switching to unicode_literals
   this may start raising UnicodeDecodeExceptions for non-ascii choices
   if they are custom objects (not unicode strings).
   Another example is django.http.response.HttpResponseBase._convert_to_charset
   where BadHeaderError exception is raised: after switching to unicode_literals
   %r format specifier start triggering decoding of "value" using sys.getdefaultencoding()
   which is incorrect because "value" is a bytestring of 'charset' encoding under
   Python 2.x. Another example is django.utils.datastructures.SortedDict:
   its __repr__ uses '%r: %r' % (k, v) for k, v in six.iteritems(self)
   which may fail if key is an unicode string and a value is a bytestring
   or an object with __repr__ returning non-ascii text. Another example
   is django.utils.encoding.DjangoUnicodeDecodeError
   (it has incorrect __str__ by the way because it returns unicode) -
   it uses "%r" for self.obj, with unicode string formatter,
   and this would blow up if __repr__ of obj returns non-ascii text.
   There are other places where %r is used and they all are fragile.

I've implemented an another python_2_unicode_compatible decorator (inspired by django's, the idea is cool) for NLTK: https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which resolves some of issues above (it handles __repr__, limits __str__ and __repr__ to ascii and supports subclassing better). The article (rather lengthy, with some django bashing :) that provides motivation for the decorator used in NLTK: http://kmike.ru/python-with-strings-attached/ (the code in the article is a bit outdated, it is not the code used in NLTK; NLTK version was improved, but I didn't update the article yet).

Mikhail Korobov

unread,
Dec 27, 2012, 3:26:14 PM12/27/12
to django-d...@googlegroups.com
Oh, my description of (5) is not totally correct: u'%r' % bytestring_value is fine because repr(non_ascii_bytestring) is an escaped 7bit ascii; this mean HttpResponseBase._convert_to_charset is almost fine (bytes would be of incorrect encoding, but this won't raise an exception). The argument about "%r" % object_with_non_ascii__repr__ still apply.

пятница, 28 декабря 2012 г., 1:20:24 UTC+6 пользователь Mikhail Korobov написал:

Aymeric Augustin

unread,
Dec 27, 2012, 7:15:06 PM12/27/12
to django-d...@googlegroups.com
Hi Mikhail,

Thanks for your feedback! Your email touches many topics; can we try
to extract the most important ones and identify what really needs fixing
before the 1.5 release?

1) @python_2_unicode_compatible doesn't handle __repr__.

Indeed, this decorator isn't designed to handle __repr__ — I haven't
even tried.

At first sight, I'd say this isn't a sufficient reason to delay the 1.5 release,
but it's a feature we should consider for 1.6.

   For example, this affects django.db.models.options.Options,
   django.core.files.base.File (and ContentFile),
   django.contrib.admin.models.LogEntry, django.template.base.Variable
   and probably many others (their __repr__ incorrectly returns unicode).

Is it a regression in 1.5? Or did the problem already exist?


2) under Python 2.x __str__ is implemented as __unicode__
   encoded to utf8.

Yes, this is a legacy behavior that I strongly disagree with. It
makes __str__ / __unicode__ handling for Model subclasses
exceedingly sketchy.* The only reason why I didn't remove
it is backwards compatibility.

* I expect future maintainers, including myself, to hate this:

This breaks 'print django_obj' when sys.stdout.encoding
   is not utf8 because print uses __str__ (not __unicode__) for custom objects,
   and the terminal expects the result to be encoded in sys.stdout.encoding
   (print encodes unicode strings to sys.stdout.encoding, but doesn't
   use __unicode__ of objects; this is hard-coded in Python 2.x). 
   This may affect REPL in Windows consoles and printing/writing to stdout 
   in management commands.

Django often defaults to utf-8 rather than try to figure out the
most appropriate charset. Such code generally dates back
to the unicode branch. So, yeah, this problem exists but it's
not really new.


3) @python_2_unicode_compatible produces incorrect results 
   when applied twice (__str__ is patched by previous decorator application 
   and returns bytestring because of that).
   This is easy to oversight e.g. when applying this decorator to a
   subclass of a class which is wrapped to @python_2_unicode_compatible
   and deleting the overridden __str__ afterwards.

This is a good point. I looked at the `_was_fixed` technique you're
proposing below. It does work in more cases, but I'm pretty sure
you can still break it by overriding __str__ or __unicode__ in a
sufficiently complex class hierarchy.


4) __str__ is not always properly implemented for this decorator in django
   code. To work properly with @python_2_unicode_compatible,
   __str__ must return unicode string. This is quite subtle.

This is destabilizing for Python 2 developers, but once you get it,
you get the whole "write Python 3 code that also works under
Python 2" philosophy, and it's documented. I stand by this decision.

   For example, take a look at django.contrib.gis.maps.google.GEvent.
   __str__ is implemented as 
   "return mark_safe('"%s", %s' %(self.event, self.action))",
   but "from __future__ import unicode_literals" is not applied to the file.

django.contrib.gis is a large piece of code that hasn't received as
much attention as core. There's certainly a bunch of bugs like this one.

   This means that if event and action are Python objects with both __str__
   and __unicode__ methods defined (e.g. object of class wrapped with
   python_2_unicode_compatible) then __str__ would be called for these objects,
   not __unicode__ (because the format string is a bytestring). Generally,
   "%s" % something is a good and correct pattern for __str__ implementation
   (it does the right thing under both Python 2.x and 3.x when
   unicode_literals future import is there), but it is incorrect under Python
   2.x if unicode_literals is not imported.

I don't understand. Unless you're using @python_2_unicode_compatible,
__str__ must return a string. `"%s" % foo` works in Python 2 and 3 *unless*
you enable unicode_literals; if you do, the right pattern is `str("%s") % foo`.

5) %r is very tricky. If unicode_literals are in effect, or some 
   arguments for string formatting are unicode,
   "%r" % obj would trigger bytes decoding using sys.getdefaultencoding() under
   Python 2.x (unless obj is an unicode string), and if obj.__repr__ returns
   non-ascii text or obj is a bytestring, exception would be raised
   (because sys.getdefaultencoding() is usually ascii).
   This format specifier is used, for example, in a default_error_messages
   for django.db.models.fields.Field; after switching to unicode_literals
   this may start raising UnicodeDecodeExceptions for non-ascii choices
   if they are custom objects (not unicode strings).
   Another example is django.http.response.HttpResponseBase._convert_to_charset
   where BadHeaderError exception is raised: after switching to unicode_literals
   %r format specifier start triggering decoding of "value" using sys.getdefaultencoding()
   which is incorrect because "value" is a bytestring of 'charset' encoding under
   Python 2.x. Another example is django.utils.datastructures.SortedDict:
   its __repr__ uses '%r: %r' % (k, v) for k, v in six.iteritems(self)
   which may fail if key is an unicode string and a value is a bytestring
   or an object with __repr__ returning non-ascii text. Another example
   is django.utils.encoding.DjangoUnicodeDecodeError
   (it has incorrect __str__ by the way because it returns unicode) -
   it uses "%r" for self.obj, with unicode string formatter,
   and this would blow up if __repr__ of obj returns non-ascii text.
   There are other places where %r is used and they all are fragile.

If I understand correctly, you're saying there's a regression in
modules that define __repr__ methods and where unicode_literals
were enabled?

I've implemented an another python_2_unicode_compatible decorator (inspired by django's, the idea is cool) for NLTK: https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which resolves some of issues above (it handles __repr__, limits __str__ and __repr__ to ascii and supports subclassing better). The article (rather lengthy, with some django bashing :) that provides motivation for the decorator used in NLTK: http://kmike.ru/python-with-strings-attached/ (the code in the article is a bit outdated, it is not the code used in NLTK; NLTK version was improved, but I didn't update the article yet).

Obviously your version is more elaborate and handles more cases.
The concept of `_was_fixed` is interesting but as explained above
I'm not sure we can make it fully reliable. I'm also concerned about
the transliteration / 7 bit stuff — it smell like something people are
going to complain about or bikeshed forever.

The real question here is -- how much complexity is a core dev willing
to introduce into Django, and defend in front of tens of thousands of
developers for the next few years? As far as I'm concerned, the answer
is "very little", because I'm not very smart and my days only have so
many hours.


To sum up, I think most of your remarks are valid. But taken all together
they aren't easily actionable. You're mixing existing behaviors that we
must keep for backwards-compatibility (even if they're debatable) and
new behaviors that we still have a chance to fix before 1.5 sets them in
stone. And you're requesting a massive lot of changes.

To move forward, I'd suggest to divide and classify all this into smaller
sets of problems, and describe them precisely:
- regressions introduced by Python 3 support (release blockers),
- limitations of the current Python 3 support,
- str / repr method that don't return the correct type,
- legacy design decisions that appear suboptimal,
- etc.

Thanks,

-- 
Aymeric.



Karen Tracey

unread,
Dec 27, 2012, 8:09:47 PM12/27/12
to django-d...@googlegroups.com
On Thu, Dec 27, 2012 at 7:15 PM, Aymeric Augustin <aymeric....@polytechnique.org> wrote:
2) under Python 2.x __str__ is implemented as __unicode__
   encoded to utf8.

Yes, this is a legacy behavior that I strongly disagree with. It
makes __str__ / __unicode__ handling for Model subclasses
exceedingly sketchy.* The only reason why I didn't remove
it is backwards compatibility.

I'm curious what would you do instead, if backward-incompatibility were not a concern?

Karen

Aymeric Augustin

unread,
Dec 28, 2012, 3:37:46 AM12/28/12
to django-d...@googlegroups.com
2012/12/28 Karen Tracey <kmtr...@gmail.com>
In Python __unicode__ is __str__ + decode(sys.getdefaultencoding()):

Django reverses this behavior for Model subclasses and uses utf-8.

I was first exposed to unicode handling in Python via Django and
it took me a long time to discover that writing a __unicode__ method
wasn't the canonical way to define an object's string representation.

If we set aside backwards-compatibility, I would probably
define a conventional, Django-specific method (eg. __display__)
and derive __unicode__ and __str__ from there, according to
Django's rules. Not using Python's default names makes it explicit
that Django adds specific behavior. This vastly simplifies the
definition of __str__ and __unicode__ in Python 2 and 3, and also
avoids the semantic issue of __str__ returning unicode on Python 2
under @python_2_unicode_compatible.

An alternative would be to provide an explicit way to change
Python's behavior -- either a decorator or a class attribute.
Though this isn't going to be very DRY.

I haven't had issues with using utf-8 instead of the system charset,
but Mikhail seems to suggest it can be a problem on Windows.
Using the system charset, like Python does, might be less surprising.
I don't have enough experience to tell. This won't be a problem for
displaying objects in templates, because they're rendered in unicode.

-- 
Aymeric.

Mikhail Korobov

unread,
Dec 28, 2012, 4:07:33 AM12/28/12
to django-d...@googlegroups.com
Hi Aymeric,

FYI: I didn't check anything in my message manually, this is a "code review" based on grepping over django source code for common problems. 

пятница, 28 декабря 2012 г., 6:15:06 UTC+6 пользователь Aymeric Augustin написал:
Hi Mikhail,

Thanks for your feedback! Your email touches many topics; can we try
to extract the most important ones and identify what really needs fixing
before the 1.5 release?

1) @python_2_unicode_compatible doesn't handle __repr__.

Indeed, this decorator isn't designed to handle __repr__ — I haven't
even tried.

At first sight, I'd say this isn't a sufficient reason to delay the 1.5 release,
but it's a feature we should consider for 1.6.

   For example, this affects django.db.models.options.Options,
   django.core.files.base.File (and ContentFile),
   django.contrib.admin.models.LogEntry, django.template.base.Variable
   and probably many others (their __repr__ incorrectly returns unicode).

 
Is it a regression in 1.5? Or did the problem already exist?


This is a regression. It is not very serious (the unicode would contain only ascii characters in all the cases above), but it may have some subtle consequences, and it seems that these methods was written as returning unicode intentionally (e.g. LogEntry.__repr__ returns smart_text(self.action_time), not str(self.action_time)). 

Oh, and my late night review of Model.__repr__ was not correct :) six.text_type(self) has nothing to do with encoding/decoding, it calls __unicode__ method and __repr__ returns utf8-encoded text as it supposed.

 
 
2) under Python 2.x __str__ is implemented as __unicode__
   encoded to utf8.

Yes, this is a legacy behavior that I strongly disagree with. It
makes __str__ / __unicode__ handling for Model subclasses
exceedingly sketchy.* The only reason why I didn't remove
it is backwards compatibility.

* I expect future maintainers, including myself, to hate this:

This breaks 'print django_obj' when sys.stdout.encoding
   is not utf8 because print uses __str__ (not __unicode__) for custom objects,
   and the terminal expects the result to be encoded in sys.stdout.encoding
   (print encodes unicode strings to sys.stdout.encoding, but doesn't
   use __unicode__ of objects; this is hard-coded in Python 2.x). 
   This may affect REPL in Windows consoles and printing/writing to stdout 
   in management commands.

Django often defaults to utf-8 rather than try to figure out the
most appropriate charset. Such code generally dates back
to the unicode branch. So, yeah, this problem exists but it's
not really new.


Defaulting to utf8 is better than trying to figure out the most appropriate charset, I fully agree with that (but prefer 7bit ascii myself which is not guessing). However, this makes REPL and printing not usable in environments where sys.stdout.encoding is not utf8. User usually expects to be able to type an object in console to check it, and this won't work (it may raise an exception or produce mangled output) under Windows (and some older *nixes) when there is a non-ascii data. Isn't it a bug? If it is not a bug, and it is an intended behavior that wouldn't be fixed (is backwards compatibility the only issue?), then maybe it should be documented ("sorry, we don't support cmd.exe under Python 2.x") to save developers time? Or maybe there are workarounds that would fix this? Currently REPL may fail because of django behavior, not because of buggy evil Windows or Python 2.x bugs, so IMHO it is django responsibility to either fix it or tell users what to do. 

 

3) @python_2_unicode_compatible produces incorrect results 
   when applied twice (__str__ is patched by previous decorator application 
   and returns bytestring because of that).
   This is easy to oversight e.g. when applying this decorator to a
   subclass of a class which is wrapped to @python_2_unicode_compatible
   and deleting the overridden __str__ afterwards.

This is a good point. I looked at the `_was_fixed` technique you're
proposing below. It does work in more cases, but I'm pretty sure
you can still break it by overriding __str__ or __unicode__ in a
sufficiently complex class hierarchy.



__unicode__ becomes pretty internal with @python_2_unicode_compatible, and overriding it may break things. I was about to introduce checking for overridden __unicode__ (by something like adding _nltk_is_generated attribute), but decided not to be too smart (there may be reasons to override __unicode__ I don't know about).

On the other hand, overriding __str__ is fully legit. Can you please create an example on how can __str__ be broken in complex class hierarchy with _was_fixed?

4) __str__ is not always properly implemented for this decorator in django
   code. To work properly with @python_2_unicode_compatible,
   __str__ must return unicode string. This is quite subtle.

This is destabilizing for Python 2 developers, but once you get it,
you get the whole "write Python 3 code that also works under
Python 2" philosophy, and it's documented. I stand by this decision.


I mean that sometimes __str__ is not properly implemented in django code itself :)  @python_2_unicode_compatible + __str__ returning unicode is a good decision in my view. Check all __str__ methods in files where unicode_literals is not in effect (django.contrib.comments.Comment.__str__ and CommentFlag has the same subtle, almost theoretical issue which won't occur often because self.name and self.comment would be unicode strings most of time). 
   For example, take a look at django.contrib.gis.maps.google.GEvent.
   __str__ is implemented as 
   "return mark_safe('"%s", %s' %(self.event, self.action))",
   but "from __future__ import unicode_literals" is not applied to the file.

django.contrib.gis is a large piece of code that hasn't received as
much attention as core. There's certainly a bunch of bugs like this one.

   This means that if event and action are Python objects with both __str__
   and __unicode__ methods defined (e.g. object of class wrapped with
   python_2_unicode_compatible) then __str__ would be called for these objects,
   not __unicode__ (because the format string is a bytestring). Generally,
   "%s" % something is a good and correct pattern for __str__ implementation
   (it does the right thing under both Python 2.x and 3.x when
   unicode_literals future import is there), but it is incorrect under Python
   2.x if unicode_literals is not imported.

I don't understand. Unless you're using @python_2_unicode_compatible,
__str__ must return a string. `"%s" % foo` works in Python 2 and 3 *unless*
you enable unicode_literals; if you do, the right pattern is `str("%s") % foo`.

I mean "correct pattern for __str__" for classes that use @python_2_unicode_compatible.
No, the regression is in modules that use "%r" format specifier and the string containing this specifier becomes unicode. "byte string %r" % obj calls obj.__repr__() and uses the result directly, while u"unicode_string %r" % obj calls obj.__repr__() and decodes the result using sys.getdefaultencoding().
 

I've implemented an another python_2_unicode_compatible decorator (inspired by django's, the idea is cool) for NLTK: https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which resolves some of issues above (it handles __repr__, limits __str__ and __repr__ to ascii and supports subclassing better). The article (rather lengthy, with some django bashing :) that provides motivation for the decorator used in NLTK: http://kmike.ru/python-with-strings-attached/ (the code in the article is a bit outdated, it is not the code used in NLTK; NLTK version was improved, but I didn't update the article yet).

Obviously your version is more elaborate and handles more cases.
The concept of `_was_fixed` is interesting but as explained above
I'm not sure we can make it fully reliable. I'm also concerned about
the transliteration / 7 bit stuff — it smell like something people are
going to complain about or bikeshed forever.


The real question here is -- how much complexity is a core dev willing
to introduce into Django, and defend in front of tens of thousands of
developers for the next few years? As far as I'm concerned, the answer
is "very little", because I'm not very smart and my days only have so
many hours.


I agree that this all is too elaborate (awfully elaborate). But I don't know how to fix REPL issues in another way. The alternative (as far as I can tell) is not fixing an real issue in order to simplify the library implementation. This may be a legit approach if users of library are aware of the limitation though.

Mikhail Korobov

unread,
Dec 28, 2012, 4:14:17 AM12/28/12
to django-d...@googlegroups.com


пятница, 28 декабря 2012 г., 14:37:46 UTC+6 пользователь Aymeric Augustin написал:
There is no a single "system charset" and that's the problem :) 

>>> import sys
>>> import locale
>>> sys.getdefaultencoding()
'ascii'
>>> locale.getpreferredencoding()
'cp1251'
>>> sys.stdout.encoding
'cp866'
(REPL in Windows XP with Russian locale with all defaults; sys.stdout.encoding would be None for scripts). It is not possible (as far as I can tell) to encode __str__ or __repr__ results to some encoding (except 7bit ascii) and expect it to work reliably.

 
-- 
Aymeric.

Reply all
Reply to author
Forward
0 new messages