Unicodification of Django

7 views
Skip to first unread message

Andrey Golovizin

unread,
Jun 28, 2006, 6:42:02 AM6/28/06
to django-d...@googlegroups.com
Hi,

I am using Django for about half a year and it rocks. Indeed, it would rock
even more if it switched from using UTF-8 bytestrings to use unicode strings
everywhere.

The main drawbacks of using UTF-8 strings are:
- regexps won't work on utf-8 bytestrings containing non-ASCII characters;
- lower(), upper(), etc. won't work on utf-8 bytestrings containing non-ASCII
characters;
- slicing, like s[0:3], won't work on utf-8 bytestrings containing non-ASCII
characters.

So, what's stopping Django from switching to unicode? Is someone working on
it? And finally, what should I do to see my sweet Django fully
unicode-aware? :)

Andrey

hugo

unread,
Jun 28, 2006, 6:57:41 AM6/28/06
to Django developers
Hi,

> So, what's stopping Django from switching to unicode? Is someone working on
> it? And finally, what should I do to see my sweet Django fully
> unicode-aware? :)

Well, as a start, take a look at the impact analysis page at
http://code.djangoproject.com/wiki/UnicodeInDjango and contribute
updates to that list with regard to m-r. :)

I started on collecting that stuff to see what would actually have to
be done for unicodefication of Django, but due to other work related
projects that didn't include Django stuff, I am a bit out of the loop.

The problem with unicodefication is, it will most definitely be not
backward compatible in all circumstances, so it really would be a good
idea to have a list of things that are affected by it.

bye, Georg

Gábor Farkas

unread,
Jun 28, 2006, 7:07:20 AM6/28/06
to django-d...@googlegroups.com

what i think we are missing the most is to hear about the "main"
developers (project owners?) (adrian, malcolm, jacob etc.) opinion about
unicode-ification. if they think we should switch django completely to
unicode, then fine. but if they think that django should still support
bytestrings, i really don't see how we could do the unicode-ification
without breaking backwards compatibility.


gabor

Jacob Kaplan-Moss

unread,
Jun 28, 2006, 8:20:24 AM6/28/06
to django-d...@googlegroups.com
On Jun 28, 2006, at 6:07 AM, Gábor Farkas wrote:
> what i think we are missing the most is to hear about the "main"
> developers (project owners?) (adrian, malcolm, jacob etc.) opinion
> about
> unicode-ification. if they think we should switch django completely to
> unicode, then fine. but if they think that django should still support
> bytestrings, i really don't see how we could do the unicode-ification
> without breaking backwards compatibility.

In a nutshell: I think it's too much work, with too many backwards-
incompatible changes, with too little payoff.

Let me expand a bit on each of those points:

"Too much work..." -- there's quite a bit that would need to be
changed, and a number of sticky problems to be solved. Just one
example is the issue of template encodings -- do we need to start
indicating that a certain template is UTF-8 or whatever?

"... with too many backwards-incompatible changes ..." -- as Hugo
points out, this will break a lot of existing code. My experience is
that Unicode issues are the worst types of bugs since they only crop
up when dealing with particular data.

"... with too little payoff." -- right now it's completely possible
to nicely handle Unicode data in Django as long as you're careful.
Yes, it's not as easy as it might be, but the net result of a Unicode-
ification would be an incremental improvement at best.

So I think -- for now -- there are more important places to spend our
energy.

Jacob

Ivan Sagalaev

unread,
Jun 28, 2006, 8:53:11 AM6/28/06
to django-d...@googlegroups.com
Jacob Kaplan-Moss wrote:
> Just one
> example is the issue of template encodings -- do we need to start
> indicating that a certain template is UTF-8 or whatever?

May be I don't understand what do you mean... But this problem is not
related to internals being in unicode or not. Most templates are fed to
the browser and they have to be encoded in DEFAULT_CHARSET already
(which is UTF-8 in most cases, unless someone have to deal with legacy
code). So this issue is already here because you have to feed string
data to templates also in UTF-8.

> "... with too little payoff." -- right now it's completely possible
> to nicely handle Unicode data in Django as long as you're careful.

With this I agree (with little addition that I still think we should
commit a patch fixing many template filters for non-ascii letters).

Gábor Farkas

unread,
Jun 28, 2006, 8:58:56 AM6/28/06
to django-d...@googlegroups.com

thanks a lot for the clarification.
i understand all the points you raised, and think that they all are
valid points. i personally think that it still would be better to switch
django to unicode, but i can live also with django being in bytestring,
no problem :-)

btw. regarding your last sentence:

1. i think there never will be a better situation for such a change.
after 1.0 is released, there will be no way to do it (well, except doing
it in 2.0)

2. 'to spend our energy'. i think it's a little more complicated. if
someone is willing to help-with/work-on django-unicode, it does not mean
that otherwise he would work on let's say model-validation.maybe other
django tasks do not interest him etc. what i want to say is that imho
it's not that a developer has his django-time that he spends on whatever
django-related. people usually work on things that's fun for them to
work on.

but as i said. if the devels say no unicode-django, then no
unicode-django, no problem :-)

gabor

Gábor Farkas

unread,
Jun 28, 2006, 9:08:18 AM6/28/06
to django-d...@googlegroups.com
Gábor Farkas wrote:

> Jacob Kaplan-Moss wrote:
>>
>> So I think -- for now -- there are more important places to spend our
>> energy.
>
>
> 2. 'to spend our energy'. i think it's a little more complicated. if
> someone is willing to help-with/work-on django-unicode, it does not mean
> that otherwise he would work on let's say model-validation.maybe other
> django tasks do not interest him etc. what i want to say is that imho
> it's not that a developer has his django-time that he spends on whatever
> django-related. people usually work on things that's fun for them to
> work on.

maybe i can formulate this part simpler:

the question is:

let's imagine for a second that the unicode-django patch is done and
available (it's not, but let's imagine it is)

would there be a chance to get it applied?

:)

thanks,
gabor

Andrey Golovizin

unread,
Jun 28, 2006, 9:43:28 AM6/28/06
to django-d...@googlegroups.com
Jacob Kaplan-Moss wrote:
> "... with too many backwards-incompatible changes ..." -- as Hugo
> points out, this will break a lot of existing code.
Well, some day Django will have to switch to unicode anyway (even Python-3000
is going to use unicode strings everywhere). Right now is a good time for it.
After 1.0 switching will be much more problematic.

> My experience is
> that Unicode issues are the worst types of bugs since they only crop
> up when dealing with particular data.

That's true. But don't you think Django can avoid Unicode-related bugs by not
using unicode strings? :) There _are_ bugs alreadly (#924 being a good
example).

> "... with too little payoff." -- right now it's completely possible
> to nicely handle Unicode data in Django as long as you're careful.
> Yes, it's not as easy as it might be, but the net result of a Unicode-
> ification would be an incremental improvement at best.

I'd say handling Unicode data is _quite_ uneasy with current state of things.
Instead of, say, s.upper() one has to do
unicode(s,DEFAULT_CHARSET).upper().decode(DEFAULT_CHARSET).

In other words, currently one has to bother of manually converting strings to
Unicode and back for every simple operation. If one forgets about it and just
processes raw strings, it is definitely a BUG. It will work for ASCII but not
for UTF-8. And sad to say, there are plenty of such bugs, in Django itself
and in Django-based software.
When using Unicode strings inside Django, such conversion has to be done only
on input and output. Moreover, it will be done automatically, without any
need to handle it explicitly. So I'm sure that switching to Unicode will not
increase the number of those nasty Unicode issues, but will only help to
avoid them.

> So I think -- for now -- there are more important places to spend our
> energy.

Unicode awareness may seem not a big issue for English-speakers (for whom
plain ASCII is perfectly enough :)), but for others (like me) it's of crucial
importance. So if you need my energy for Unicode-ification, it's yours. :)

Andrey

Simon Willison

unread,
Jun 28, 2006, 10:03:59 AM6/28/06
to django-d...@googlegroups.com
On 28 Jun 2006, at 14:43, Andrey Golovizin wrote:

> Unicode awareness may seem not a big issue for English-speakers
> (for whom
> plain ASCII is perfectly enough :)), but for others (like me) it's
> of crucial
> importance.

I don't think that's true. On today's Web there's no guarantee at all
that you won't get comments (for example) posted by someone with a
non-ascii character in their name. If you want to consume data from
other services (RSS feeds are a particularly good example here)
character encoding stuff is also bound to turn up.

I seem to remember that last time people looked at unicode with
Django one of the sticking points was database stuff - some of the
adapters are unicode-string aware, others choke and burn. That
shouldn't be an insurmountable problem though, it would just require
a bit more logic in the database adapters.

If we're going to add unicode support it really should happen before
1.0. One point that's worth considering is how much of a marketing
coup out-of-the-box unicode would be, especially in comparison to
Ruby and Rails, neither of which are very good at this stuff.

As far as engineering goes, developing a water-tight test suite seems
like a critical component for confidently adding unicode support.

Cheers,

Simon

Jacob Kaplan-Moss

unread,
Jun 28, 2006, 10:12:50 AM6/28/06
to django-d...@googlegroups.com
On Jun 28, 2006, at 8:08 AM, Gábor Farkas wrote:
> let's imagine for a second that the unicode-django patch is done and
> available (it's not, but let's imagine it is)
>
> would there be a chance to get it applied?

Obviously that would depend on the quality of the patch and the
ramifications of its application, but I'd think it's pretty likely
the answer is "yes".

I'm not trying to tell you or anyone else what to work on; I'm just
explaining why *I* don't view this as a top priority.

Yes, as an English speaker it *is* easy for me to discount the need
for bulletproof Unicode support, but frankly that's not really
figuring into it. It's simply that I have *far* to many other things
on my plate to find room for this.

Again, if it's something *you* feel strongly about, please don't
think I'm trying to hold you back!

Jacob

Jacob Kaplan-Moss

unread,
Jun 28, 2006, 10:17:23 AM6/28/06
to django-d...@googlegroups.com
On Jun 28, 2006, at 9:03 AM, Simon Willison wrote:
> As far as engineering goes, developing a water-tight test suite seems
> like a critical component for confidently adding unicode support.

I couldn't agree more strongly.

Sucks that writing good tests is so damn hard :)

Jacob

Adrian Holovaty

unread,
Jun 28, 2006, 10:32:15 AM6/28/06
to django-d...@googlegroups.com
On 6/28/06, Andrey Golovizin <golo...@gmail.com> wrote:
> I am using Django for about half a year and it rocks. Indeed, it would rock
> even more if it switched from using UTF-8 bytestrings to use unicode strings
> everywhere.

Some quick thoughts --

I think we should do this.

We are, after all, perfectionists.

Not only do we want to show even more love toward the international
community, I just like the idea of passing Unicode strings everywhere.
It seems so clean.

The only big problem I see is that it could confuse the (unfortunately
large) mass of programmers who don't understand Unicode yet. That is a
big potential pitfall.

If we're going to do it, we should do it before 1.0, and we'd need
extensive tests.

Adrian

--
Adrian Holovaty
holovaty.com | djangoproject.com

Simon Willison

unread,
Jun 28, 2006, 10:40:25 AM6/28/06
to django-d...@googlegroups.com
On 28 Jun 2006, at 15:32, Adrian Holovaty wrote:

> The only big problem I see is that it could confuse the (unfortunately
> large) mass of programmers who don't understand Unicode yet. That is a
> big potential pitfall.

That's very true. The documentation overhead will be considerable. It
should be possible to get most things to Just Work, but the edge
cases will be hard to explain. A classic example is working with a
Python interactive prompt and trying to "print article.title" when
title contains higher order unicode characters - this causes ascii
decode errors in terminals that haven't been correctly configured.

Having been bitten by unicode problems in the past while dealing with
RSS feeds (from feedparser, which always returns unicode strings) I'd
definitely love to see this feature in Django, and I'd be willing to
help out with the effort.

Shane McChesney

unread,
Jun 28, 2006, 11:07:54 AM6/28/06
to django-d...@googlegroups.com
Hi all, longtime listener, first time caller...

We're learning Django now and hope to use it for next gen online surveys
later this year. We're doing an 8-language app now in Unicode in Active
Server Pages and while so far it hasn't been as painful as I'd expected,
I'd be even more drawn to Django if it made this sort of thing even
easier than it does already (helping me get the heck out of ASP town).

As both a Django and Unicode newbie, I was surprised to see this topic
come up here, I didn't realize that the current Unicode-byte strings (or
whatever they're called) weren't "good enough" i18n... from the docs
I'd thought Django would take care of all this for me.

I'd vote with the perfectionists, please find a way to get this in
before 1.0 and it'll be one more advantage Django has over other
frameworks. The world is flat, etc.

Thanks all for a terrific tool and excuse my ignorance as a Django wannabe.


Shane McChesney
Nooro Online Research: Your Online Research Department
http://www.nooro.com

hugo

unread,
Jun 28, 2006, 11:31:11 AM6/28/06
to Django developers
Hi,

> I think we should do this.
>
> We are, after all, perfectionists.
>
> Not only do we want to show even more love toward the international
> community, I just like the idea of passing Unicode strings everywhere.
> It seems so clean.

I whole-heartedly agree! It's just much cleaner and actually the
current way to handle stuff _is_ problematic, like already pointed out.
The constant need to juggle from bytestrings to unicode strings for
some stuff - or the decision to ignore the problems of applying regular
expressions to utf-8 bytestrings for example - is neither perfectionist
nor DRY :)

bye, Georg

Bill de hÓra

unread,
Jun 29, 2006, 12:37:16 PM6/29/06
to django-d...@googlegroups.com
Jacob Kaplan-Moss wrote:
> On Jun 28, 2006, at 6:07 AM, Gábor Farkas wrote:
>> what i think we are missing the most is to hear about the "main"
>> developers (project owners?) (adrian, malcolm, jacob etc.) opinion
>> about
>> unicode-ification. if they think we should switch django completely to
>> unicode, then fine. but if they think that django should still support
>> bytestrings, i really don't see how we could do the unicode-ification
>> without breaking backwards compatibility.
>
> In a nutshell: I think it's too much work, with too many backwards-
> incompatible changes, with too little payoff.
>
> Let me expand a bit on each of those points:
>
> "Too much work..." -- there's quite a bit that would need to be
> changed, and a number of sticky problems to be solved. Just one
> example is the issue of template encodings -- do we need to start
> indicating that a certain template is UTF-8 or whatever?

And then there's letting the database know wtf is going into it. And
rich text editors, and third party libs, and... Unicode is just hard work.

Why just the other day :) I had to fix up an FCKEditor installation -
every time you entered a question mark it got converted to an omega or a
euro symbol after being saved. Merit points to anyone who can figure out
what happened there... *


> "... with too many backwards-incompatible changes ..." -- as Hugo
> points out, this will break a lot of existing code. My experience is
> that Unicode issues are the worst types of bugs since they only crop
> up when dealing with particular data.

My experience is similar, but also that Unicode/Encoding issues crop up
where you have libraries that have different approaches (or assumptions)
about either the encoding or whether the thing being passed in is a str
or unicode object. Managing inter-module clashes is harder than
scrubbing incoming data - I have it down to the cost of doing business
with Python at this point.


> "... with too little payoff." -- right now it's completely possible
> to nicely handle Unicode data in Django as long as you're careful.
> Yes, it's not as easy as it might be, but the net result of a Unicode-
> ification would be an incremental improvement at best.
>
> So I think -- for now -- there are more important places to spend our
> energy.

Actually, now's a good time to do it. So long as Django is a closed
world, it's a manageable problem. I suspect being full stack is one
reason what this is not biting people hard atm. Once people start
building module and plugins on top, it'll be damn hard to do the right
thing later on. However there will be a lot of people with incentive to
help out at that point :)

cheers
Bill

* The FCKEditor file should have been stored as UTF16 (ff ee) to handle
things like the euro symbol, but it had been down-converted to 8bit at
some point - all the symbols were remapped to '?' so questions were
replaced with whatever html entity that got pulled out of the lookup.

Bjørn Stabell

unread,
Jun 30, 2006, 7:07:25 AM6/30/06
to Django developers
Jacob Kaplan-Moss wrote:
> On Jun 28, 2006, at 8:08 AM, Gábor Farkas wrote:
> > let's imagine for a second that the unicode-django patch is done and
> > available (it's not, but let's imagine it is)
> >
> > would there be a chance to get it applied?
>
> Obviously that would depend on the quality of the patch and the
> ramifications of its application, but I'd think it's pretty likely
> the answer is "yes".

What if the patch required everything to be Unicode, meaning:

* all programmers would have to become aware of Unicode to some extent
* all code would suffer the (minior) performance penalty of encoding
and decoding all text

If we're not willing to make those two trade-offs, we'll have to
support both Unicode and bytestrings, and then we're potentially in a
whole different kind of hell that I've seen many systems go.

Unicode is a all-or-nothing thing.

Rgds,
Bjorn

Ivan Sagalaev

unread,
Jun 30, 2006, 1:57:11 PM6/30/06
to django-d...@googlegroups.com
Bjørn Stabell wrote:
> What if the patch required everything to be Unicode, meaning:
>
> * all programmers would have to become aware of Unicode to some extent
> * all code would suffer the (minior) performance penalty of encoding
> and decoding all text

The second point is arguable. Currently there are many cases where
conversion is done twice. For example to check the length of any string
field in validator it should be decoded to unicode, counted and then
encoded back to bytes because it's what users expect in views. Same with
string processing filters in templates. By converting internals to
unicode we are requiring conversion everywhere between outside and
inside but removing these double conversions at the same time. So this
may be a performance gain as much as a loss. But anyway this seems to me
absolutely negligible in practice.

Filipe

unread,
Jul 5, 2006, 10:05:11 AM7/5/06
to Django developers
Andrey Golovizin wrote:
> Jacob Kaplan-Moss wrote:
> > "... with too many backwards-incompatible changes ..." -- as Hugo
> > points out, this will break a lot of existing code.
> Well, some day Django will have to switch to unicode anyway (even Python-3000
> is going to use unicode strings everywhere). Right now is a good time for it.
> After 1.0 switching will be much more problematic.

and the first changes towards that will start appearing in python 2.5:
http://www.python.org/dev/peps/pep-0332/

Jacob Kaplan-Moss

unread,
Jul 5, 2006, 10:21:26 AM7/5/06
to django-d...@googlegroups.com
On Jul 5, 2006, at 9:05 AM, Filipe wrote:
> and the first changes towards that will start appearing in python 2.5:
> http://www.python.org/dev/peps/pep-0332/

Er, no -- that PEP was actually rejected. My impression is that the
str/unicode distinction won't be eliminated until Py3k.

Jacob

Jeremy Dunck

unread,
Jul 5, 2006, 10:31:38 AM7/5/06
to django-d...@googlegroups.com
On 7/5/06, Filipe <fcor...@gmail.com> wrote:
> and the first changes towards that will start appearing in python 2.5:
> http://www.python.org/dev/peps/pep-0332/

That's rejected, so not actually in 2.5, right?

Filipe

unread,
Jul 5, 2006, 10:42:24 AM7/5/06
to Django developers

you're right. I managed to miss that somehow, sorry

Simon Willison

unread,
Jul 6, 2006, 4:58:34 AM7/6/06
to django-d...@googlegroups.com

On 5 Jul 2006, at 15:21, Jacob Kaplan-Moss wrote:

> Er, no -- that PEP was actually rejected. My impression is that the
> str/unicode distinction won't be eliminated until Py3k.

Guido's keynote at EuroPython confirmed that - one of the big changes
in Py3K (which is now planned for an initial release within two
years) is unicode strings throughout.

Doesn't help us very much though :/

Reply all
Reply to author
Forward
0 new messages