Unicode strings go wrong?

4 views
Skip to first unread message

cbrain

unread,
Sep 26, 2007, 8:51:05 AM9/26/07
to brevé template engine
Hello,

First off, thanks for the brilliant engine that Brevé is!

I believe I've found a bug. Doing something like the following
produces an error:

import sys
from breve.tags.html import tags as T

sys.stdout.write(unicode(T.html [
T.head [
T.title [
'Hello'
],
],
T.body [
T.h1 [
'Title'
],
u'Some \u20ac text'
]
]).encode("utf-8"))

This wil produce the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in
position 66: ordinal not in range(128)

As far as I can tell, this is because the file "breve/tags/
__init__.py" defines a __str__ method for the "Tag" class, which
converts the object to a unicode string. HOWEVER, that does not work
as one might expect, because a call to str() or unicode() does NOT
simply return whatever the __str__ method returns. It converts it to a
str string first (using the default encoding, which is ascii in my
case).

The "breve/tags/__init__.py" file should define a __unicode__ method
for the "Tag" class as well, which should do exactly the same as the
__str__ method. But now, when one calls the unicode() function for a
Tag object, it will use the __unicode__ method instead, which DOES
return unicode string instead of a str string.

I hope this helps!
Sven

Cliff Wells

unread,
Sep 27, 2007, 4:16:44 PM9/27/07
to breve-...@googlegroups.com
I'm working on improving Unicode support but frankly it's not an area I
understand fully (or use enough to catch problems). I'll try your
suggestions. Thanks for the feedback!

Regards,
Cliff

Olivier Verdier

unread,
Sep 28, 2007, 5:53:42 PM9/28/07
to brevé template engine
I had a similar problem with unicode decoding. Basically i would like
to avoid using the u"" format everywhere. I modified the escape
function in the following way:
return unicode(s.replace ( "&", "&"
).replace ( ">", ">"
).replace ( "<", "&lt;" ), 'utf-8')
so that all the strings will be considered as encoded in utf-8.

== Olivier

On Sep 27, 10:16 pm, Cliff Wells <cl...@twisty-industries.com> wrote:
> I'm working on improving Unicode support but frankly it's not an area I
> understand fully (or use enough to catch problems). I'll try your
> suggestions. Thanks for the feedback!
>
> Regards,
> Cliff
>
> On Wed, 2007-09-26 at 12:51 +0000, cbrain wrote:
> > Hello,
>

> > First off, thanks for the brilliant engine that Brev? is!

Cliff Wells

unread,
Sep 28, 2007, 11:02:39 PM9/28/07
to breve-...@googlegroups.com
On Fri, 2007-09-28 at 14:53 -0700, Olivier Verdier wrote:
> I had a similar problem with unicode decoding. Basically i would like
> to avoid using the u"" format everywhere. I modified the escape
> function in the following way:
> return unicode(s.replace ( "&", "&amp;"
> ).replace ( ">", "&gt;"
> ).replace ( "<", "&lt;" ), 'utf-8')
> so that all the strings will be considered as encoded in utf-8.

What's not clear to me is what happens (or rather, should happen) when
unicode strings with a different encoding are embedded in a template
(whether directly or via a variable/function). Is it safe to assume
utf-8? What if the default encoding is something else?

Regards,
Cliff

Olivier Verdier

unread,
Sep 29, 2007, 12:33:02 PM9/29/07
to brevé template engine
IMHO, yes, it is safe to assume utf-8 as a default encoding. Of course
i suppose it would also be possible to allow Brevé to be configured to
support other encodings, but i think that it's better to ask users to
convert their files to utf-8, which is the most used unicode encoding.

Besides, if you only allow utf-8 you can give users a helpful error
message when the unicode function fails, because in most of the cases
a document encoded in non-utf-8 cannot be read as utf-8. This is not
the case for 8-bits encodings for example, so if you add support for
non utf-8 encodings, be prepared for lots of questions by confused
users seeing strange signs appearing instead of the expected
diacritics! :-)

regards,

== Olivier

Olivier Verdier

unread,
Sep 30, 2007, 10:02:50 AM9/30/07
to brevé template engine
Here is a fix to have unicode (utf-8) working with inherits and
override: replace the __str__ methods of override and inherits by
__unicode__ and change flatten into this:
def flatten ( o ):
try:
return __registry [ type ( o ) ] ( o )
except KeyError:
try:
return unicode ( o , 'utf-8') # converts a string to utf-8
unicode
except TypeError:
return unicode(o) # not a string, we use the unicode method
of the object o

This will make sure that all your template strings will be treated as
unicode (even without the u"").

I'm not sure whether my solution is the best or the simplest but it
works for me.

cheers,

== Olivier

cbrain

unread,
Sep 30, 2007, 11:41:23 AM9/30/07
to brevé template engine
Hello Olivier,

I personally would not recommend doing it this way, because it is much
cleaner to just inject real unicode strings (not str representations
of unicode strings) into the template. That means that if you have a
component that produces UTF-8 strings, you need to decode that UTF-8
string into a unicode object using its decode() method before handing
it to Brevé. For example:

result = something_that_produces_utf_8()
result = result.decode("utf-8")

I think that Brevé should treat any 'str' strings as ASCII strings, so
that it assumes no encoding at all, and if an encoding IS used by
mistake, that that mistake will be caught immediately.

Do you agree that using unicode everywhere would be the cleaner
option?

--
With kind regards,
Sven


On Sep 29, 6:33 pm, Olivier Verdier <Olivier.Verd...@gmail.com> wrote:

cbrain

unread,
Sep 30, 2007, 11:49:21 AM9/30/07
to brevé template engine
Sorry, I already sent a reply, but sent it only to Cliff by mistake. I
hope I can reproduce the text that I sent him.

Cliff,

My view on this is that Brevé should accept unicode strings and str
strings that only contain ASCII characters. That way, Brevé does not
assume anything (which is very Pythonic: in case of ambiguity, Python
always raises an exception). This can be accomplished by adding a
flattener for the 'str' type that does:

return the_str_string.decode("us-ascii")

That way, an exception will be raised if str string are passed in with
any encoding except for ASCII (which I regard as a kind of null-
encoding).

The programmer is the only one who knows for sure in what encoding
components outside of Brevé produce their strings. So, let the
programmer make sure that the encoding is handled correctly by
demanding that the encoding be undone before handing the result to
Brevé.

Just my view,
Sven


On Sep 29, 5:02 am, Cliff Wells <cl...@twisty-industries.com> wrote:

Cliff Wells

unread,
Oct 8, 2007, 5:23:09 PM10/8/07
to breve-...@googlegroups.com
On Wed, 2007-09-26 at 12:51 +0000, cbrain wrote:
> import sys
> from breve.tags.html import tags as T
>
> sys.stdout.write(unicode(T.html [
> T.head [
> T.title [
> 'Hello'
> ],
> ],
> T.body [
> T.h1 [
> 'Title'
> ],
> u'Some \u20ac text'
> ]
> ]).encode("utf-8"))


Sven,

Out of curiosity, what version of Breve are you using? 1.1.6 or SVN?
The above code appears to work under 1.1.6, but not SVN so I'm assuming
the latter.

Cliff

Cliff Wells

unread,
Oct 8, 2007, 5:26:05 PM10/8/07
to breve-...@googlegroups.com

I stand corrected. It appears to work under 1.1.7 (an unreleased
version that lies somewhere between 1.1.6 and SVN head).

Would you mind testing against 1.1.6? SVN is known to be broken in a
couple places (I'm probably going to revert it back to a previous state
if I can't track down the issues).

Regards,
Cliff

cbrain

unread,
Oct 10, 2007, 3:47:54 AM10/10/07
to brevé template engine
Hello Cliff,

I seem to be using version 1.1.7 according to:
>>> breve.__version__
'1.1.7'

I installed it on my Red Hat box using easy_install, which grabbed it
from the PyPi module repository. I just installed 1.1.6 on my FreeBSD
machine using its ports tree.

For some reason, I can't seem to reproduce the problem, neither with
version 1.1.6 nor with 1.1.7, neither on Linux not on FreeBSD. Using
my code snippet in my first posting works in both versions. I don't
get it :-(

--
Regards,
Sven

chtito

unread,
Oct 10, 2007, 11:42:46 AM10/10/07
to brevé template engine
I certainly disagree with this point of view. You can't force people
to have u"string" instead of "string" everywhere in their code.
Assuming utf-8 will allow ascii users to use Brevé anyway, with no
modification at all. This is something that people tend to look over.
Allowing utf-8 will not change anything at all to ascii users. Why
just spite utf-8 users?

I'm trying Brevé on a server now with utf-8 string, and it is so
annoying that it doesn't just work.

Please allow support for non ascii strings for all of us using more
than the 127 ascii characters (unicode allows you to use tens of
thousands of characters!!). Again, the ascii users *won't see any
difference*.

Thanks a lot!

== Olivier

Cliff Wells

unread,
Oct 10, 2007, 8:29:19 PM10/10/07
to breve-...@googlegroups.com
On Wed, 2007-10-10 at 08:42 -0700, chtito wrote:
> I certainly disagree with this point of view. You can't force people
> to have u"string" instead of "string" everywhere in their code.
> Assuming utf-8 will allow ascii users to use Brevé anyway, with no
> modification at all. This is something that people tend to look over.
> Allowing utf-8 will not change anything at all to ascii users. Why
> just spite utf-8 users?
>
> I'm trying Brevé on a server now with utf-8 string, and it is so
> annoying that it doesn't just work.
>
> Please allow support for non ascii strings for all of us using more
> than the 127 ascii characters (unicode allows you to use tens of
> thousands of characters!!). Again, the ascii users *won't see any
> difference*.

What I actually have in mind is that the global encoding will define
this "assumption". That is, if you set encoding='us-ascii' then it will
work like Sven suggests, if it's set to 'utf-8' then it works as you
suggest (and 'utf-8' will be the default).

If anyone sees an issue with this, please speak up =)

Regards,
Cliff

chtito

unread,
Oct 11, 2007, 5:00:29 AM10/11/07
to brevé template engine
That sounds great to me!

Thanks a lot, Cliff, working with Brevé is really a treat. It's a
really clever way of doing templates. I don't think that i will touch
html code ever again. :-)

cheers,

== Olivier

Olivier Verdier

unread,
Oct 21, 2007, 2:54:04 PM10/21/07
to brevé template engine
There is one more unicode issue with the django adapter. The function
flatten_string shouldn't do anything. In my case it just returns obj.

cheers!

== Olivier

Olivier Verdier

unread,
Oct 22, 2007, 1:26:23 PM10/22/07
to brevé template engine
One more adjustment has to be done: in breve.util.quoteattrs replace
"v = str(v)" by "v = unicode(v, 'utf-8')".

I hope that all those change will somehow be implemented in brevé in a
not so far future. ;-)

Thanks!

cheers,

== Olivier

Cliff Wells

unread,
Oct 22, 2007, 4:17:37 PM10/22/07
to breve-...@googlegroups.com
On Mon, 2007-10-22 at 10:26 -0700, Olivier Verdier wrote:
> One more adjustment has to be done: in breve.util.quoteattrs replace
> "v = str(v)" by "v = unicode(v, 'utf-8')".

Or rather:

def quote_attrs ( attrs, default_encoding = 'utf-8'):
...
v = unicode ( v, default_encoding )


I haven't been able to devote much time to working on these fixes of
late, but I'm hoping to get to them soon.


Cliff

Olivier Verdier

unread,
Oct 25, 2007, 3:26:57 PM10/25/07
to brevé template engine
I realised that i was using a very old version of brevé (although I
had downloaded it using easy_install...). Here is the complete diff
with respect to the new (svn rev 267) version. Note that those changes
allow to use plain strings in the templates and that the assumed
encoding is utf-8 (as I explained earlier this is by no means a
limitation for ascii users).

Index: loaders.py
===================================================================
--- loaders.py (revision 267)
+++ loaders.py (working copy)
@@ -9,4 +9,4 @@
return uid, timestamp

def load ( self, uid ):
- return unicode ( file ( uid, 'U' ).read ( ) )
+ return unicode ( file ( uid, 'U' ).read ( ), 'utf-8' )
Index: template.py
===================================================================
--- template.py (revision 267)
+++ template.py (working copy)
@@ -155,7 +155,7 @@

try:
bytecode = _cache.compile ( filename, T.root, T.loaders
[ -1 ] )
- output = flatten ( eval ( bytecode, _g, { } ) ).encode
( T.encoding )
+ output = flatten ( eval ( bytecode, _g, { } ) )
T.xml_encoding = kw.get ( 'xml_encoding',
'''<?xml version="%s"
encoding="%s"?>''' % ( T.xml_version, T.encoding ) )
except:
Index: util.py
===================================================================
--- util.py (revision 267)
+++ util.py (working copy)
@@ -40,7 +40,7 @@
quoted = [ ]
for a, v in attrs.items ( ):
if v is None: continue
- v = str ( v )
+ v = unicode ( v, 'utf-8' )
v = '"' + v.replace ( "&", "&amp;"


).replace ( ">", "&gt;"
).replace ( "<", "&lt;"

Index: plugin/django_adapter.py
===================================================================
--- plugin/django_adapter.py (revision 267)
+++ plugin/django_adapter.py (working copy)
@@ -9,7 +9,7 @@
BREVE_ROOT = settings.BREVE_ROOT

def flatten_string ( obj ):
- return unicode ( obj ).encode ( settings.DEFAULT_CHARSET )
+ return obj

class _loader ( object ):
def __init__ ( self, root, breve_opts = None ):
@@ -40,6 +40,7 @@
self.breve_opts = breve_opts

def render ( self, vars = None ):
+ import os # why??
if vars == None:
vars = { }
elif isinstance ( vars, Context ):

Reply all
Reply to author
Forward
0 new messages