Unicode vs UTF-8 redux

2 views
Skip to first unread message

Cliff Wells

unread,
Apr 3, 2008, 12:09:30 AM4/3/08
to breve-...@googlegroups.com
I've been leaning toward leaving the status quo (not forcing UTF-8
internally, just using Python Unicode objects), but then today I came
across this article, and it's got me leaning the other way (yeah, it
shows how little I know of the topic):

http://hsivonen.iki.fi/producing-xml/#utf

Any thoughts on this? Olivier, are you still maintaining your patches?

Overall this seems to be a clash between Python practices (Unicode) and
XML practices (UTF-8). Breve, of course, lies at the crux of the two,
so pardon me for feeling uncertain =)

Regards,
Cliff


Cliff Wells

unread,
Apr 3, 2008, 3:13:06 AM4/3/08
to breve-...@googlegroups.com

Based on Olivier's suggestions from long ago, I've added a small patch
to Breve's flatten function:

def flatten ( o ):
if type ( o ) == type ( '' ):
return unicode ( o, 'utf-8' )
try:
return __registry [ type ( o ) ] ( o )
except KeyError:
return unicode ( o )


This enabled me to render the following template without error:

html [
head [
title [ 'Unicode' ]
],

body [
'Brev\xc3\xa9 converts plain strings', br,
u'Brev\xe9 handles unicode strings', br,
div [ "äåå? ▸ ", em["я не понимаю"], "▸ 3 km²" ]
]
]

This seems to address Olivier's concerns over being forced to use u""
rather than just "", and doesn't seem to break anything else. Thoughts?
I am somewhat concerned that part of the flattener returns utf-8 and the
rest doesn't.

Regards,
Cliff

Cliff Wells

unread,
Apr 3, 2008, 3:49:11 AM4/3/08
to breve-...@googlegroups.com
Okay, I think I've found the "correct" solution. I'm not sure why this
was so hard to discover, but for people who don't want to use u"" you
can simply add a file named sitecustomize.py
to /usr/lib/python2.x/site-packages (or anywhere on your PYTHONPATH,
actually) with the following in it:

import sys
sys.setdefaultencoding('utf-8')

This neatly solves the issue (and probably lots of other ones too). The
default for Python is apparently 'us-ascii' which is the root of this
issue.

Regards,
Cliff


On Wed, 2008-04-02 at 21:09 -0700, Cliff Wells wrote:

cbrain

unread,
Apr 4, 2008, 6:38:28 AM4/4/08
to brevé template engine
Hello Cliff,

This solution in fact looks very sensible to me. Byte strings are
interpreted as UTF-8 encoded Unicode string and Unicode strings simply
stay Unicode strings. This seems to solve the problems that we have.

--
Regards,
Sven

cbrain

unread,
Apr 4, 2008, 6:40:02 AM4/4/08
to brevé template engine
Hello Cliff,

I'm very much against this approach because it affects the entire
Python system, not just Brevé. Furthermore, in some shared hosting
situations, people may actually not have access to the file that you
talked about (at least, no writing permission).

--
Regards,
Sven

Cliff Wells

unread,
Apr 4, 2008, 6:18:38 PM4/4/08
to breve-...@googlegroups.com

On Fri, 2008-04-04 at 03:40 -0700, cbrain wrote:
> Hello Cliff,
>
> I'm very much against this approach because it affects the entire
> Python system, not just Brevé. Furthermore, in some shared hosting
> situations, people may actually not have access to the file that you
> talked about (at least, no writing permission).

Actually you can put it anywhere on your PYTHONPATH. If you put it in a
Python application's directory it would affect only that application.

Cliff

Cliff Wells

unread,
Apr 4, 2008, 6:22:15 PM4/4/08
to breve-...@googlegroups.com

On Fri, 2008-04-04 at 03:38 -0700, cbrain wrote:
> Hello Cliff,
>
> This solution in fact looks very sensible to me. Byte strings are
> interpreted as UTF-8 encoded Unicode string and Unicode strings simply
> stay Unicode strings. This seems to solve the problems that we have.

Well, since you were the main voice of dissent in the previous argument,
and now we all seem to agree that this will work, I'll commit this.

(Although I still think sitecustomize.py is the *right* solution, I'll
agree this is practical enough).

Regards,
Cliff

Cliff Wells

unread,
Apr 5, 2008, 12:56:58 PM4/5/08
to breve-...@googlegroups.com
Okay, I think this is preferable:

# breve/tags/__init__.py

register_flattener ( str, lambda s: escape ( unicode ( s, 'utf-8' ) ) )

This is better than my previous solution as:
1) it doesn't short-circuit the flattener
2) it can be easily overridden by a user who doesn't like this
behaviour.

I'm committing this and will review Olivier's patch for doing something
similar to attributes.

Regards,
Cliff

On Fri, 2008-04-04 at 03:38 -0700, cbrain wrote:

Reply all
Reply to author
Forward
0 new messages