Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode again ... default codec ...

18 views
Skip to first unread message

Stef Mientki

unread,
Oct 20, 2009, 4:13:52 PM10/20/09
to pytho...@python.org
hello,

As someone else already said,
"every time I think : now I understand it completely, and a few weeks
later ..."

Form the thread "how to write a unicode string to a file ?"
and my specific situation:

- reading data from Excel, Delphi and other Windows programs and unicode
Python
- using wxPython, which forces unicode
- writing to Excel and other Windows programs

almost all answers, directed to the following solution:
- in the python program, turn every string as soon as possible into unicode
- in Python all processing is done in unicode
- at the end, translate unicode into the windows specific character set
(if necessary)

The above approach seems to work nicely,
but manipulating heavily with string like objects it's a crime.
It's impossible to change all my modules from strings to unicode at once,
and it's very tempting to do it just the opposite : convert everything
into strings !

# adding unicode string and windows strings, results in an error:
my_u = u'my_u'
my_w = 'my_w' + chr ( 246 )
x = my_s + my_u

# to correctly handle the above ( in my situation), I need to write the
following code (which my code quite unreadable
my_u = u'my_u'
my_w = 'my_w' + chr ( 246 )
x = unicode ( my_s, 'windows-1252' ) + my_u

# converting to strings gives much better readable code:
my_u = u'my_u'
my_w = 'my_w' + chr ( 246 )
x = my_s + str(my_u)

until I found this website:
http://diveintopython.org/xml_processing/unicode.html

By settings the default encoding:
I now can go to unicode much more elegant and almost fully automatically:
(and I guess the writing to a file problem is also solved)
# now the manipulations of strings and unicode works OK:
my_u = u'my_u'
my_w = 'my_w' + chr ( 246 )
x = my_s + my_u

The only disadvantage is that you've to put a special named file into
the Python directory !!
So if someone knows a more elegant way to set the default codec,
I would be much obliged.

cheers,
Stef

Gabriel Genellina

unread,
Oct 20, 2009, 11:50:57 PM10/20/09
to pytho...@python.org
En Tue, 20 Oct 2009 17:13:52 -0300, Stef Mientki <stef.m...@gmail.com>
escribi�:

> Form the thread "how to write a unicode string to a file ?"
> and my specific situation:
>
> - reading data from Excel, Delphi and other Windows programs and unicode
> Python
> - using wxPython, which forces unicode
> - writing to Excel and other Windows programs
>
> almost all answers, directed to the following solution:
> - in the python program, turn every string as soon as possible into
> unicode
> - in Python all processing is done in unicode
> - at the end, translate unicode into the windows specific character set
> (if necessary)

Yes. That's the way to go; if you follow the above guidelines when working
with character data, you should not encounter big unicode problems.

> The above approach seems to work nicely,
> but manipulating heavily with string like objects it's a crime.
> It's impossible to change all my modules from strings to unicode at once,
> and it's very tempting to do it just the opposite : convert everything
> into strings !

Wide is the road to hell...

> # adding unicode string and windows strings, results in an error:
> my_u = u'my_u'
> my_w = 'my_w' + chr ( 246 )
> x = my_s + my_u

(I guess you meant my_w + my_u). Formally:

x = my_w.decode('windows-1252') + my_u # [1]

but why are you using a byte string in the first place? Why not:

my_w = u'my_w' + u'�'

so you can compute my_w + my_u directly?

> # to correctly handle the above ( in my situation), I need to write the
> following code (which my code quite unreadable
> my_u = u'my_u'
> my_w = 'my_w' + chr ( 246 )
> x = unicode ( my_s, 'windows-1252' ) + my_u
>
> # converting to strings gives much better readable code:
> my_u = u'my_u'
> my_w = 'my_w' + chr ( 246 )
> x = my_s + str(my_u)

But it's not the same thing, i.e., in the former case x is an unicode
object, in the later x is a byte string. Also, str(my_u) only works if it
contains just ascii characters. The counterpart of my code [1] above would
be:

x = my_w + my_u.encode('windows-1252')

That is, you use some_unicode_object.encode("desired-encoding") to do the
unicode->bytestring conversion, and
some_string_object.decode("known-encoding") to convert in the opposite
sense.

> until I found this website:
> http://diveintopython.org/xml_processing/unicode.html
>
> By settings the default encoding:
> I now can go to unicode much more elegant and almost fully automatically:
> (and I guess the writing to a file problem is also solved)
> # now the manipulations of strings and unicode works OK:
> my_u = u'my_u'
> my_w = 'my_w' + chr ( 246 )
> x = my_s + my_u
>
> The only disadvantage is that you've to put a special named file into
> the Python directory !!
> So if someone knows a more elegant way to set the default codec,
> I would be much obliged.

DON'T do that. Really. Changing the default encoding is a horrible,
horrible hack and causes a lot of problems. 'Dive into Python' is a great
book, but suggesting to alter the default character encoding is very, very
bad advice:

- site.py and sitecustomize.py contain *global* settings, affecting
*all*
users and *all* scripts running on that machine. Other users may get very
angry at you when their own programs break or give incorrect results when
run with a different encoding.
- you must have administrative rights to alter those files.
- you won't be able to distribute your code, since almost everyone else
in the world won't be using *your* default encoding.
- what if another library/package/application wants to set a different
default encoding?
- the default encoding for Python>=3.0 is now 'utf-8' instead of 'ascii'

More reasons:
http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/
See also this recent thread in python-dev:
http://comments.gmane.org/gmane.comp.python.devel/106134

--
Gabriel Genellina

Lele Gaifax

unread,
Oct 21, 2009, 5:24:55 AM10/21/09
to pytho...@python.org
"Gabriel Genellina" <gags...@yahoo.com.ar> writes:

> DON'T do that. Really. Changing the default encoding is a horrible,
> horrible hack and causes a lot of problems.

> ...

This is a problem that appears quite often, against which I have yet to
see a general workaround, or even a "safe pattern". I must confess that
most often I just give up and change the "if 0:" line in
sitecustomize.py to enable a reasonable default...

A week ago I met another incarnation of the problem that I finally
solved by reloading the sys module, a very ugly way, don't tell me, and
I really would like to know a better way of doing it.

The case is simple enough: a unit test started failing miserably, with a
really strange traceback, and a quick pdb session revealed that the
culprit was nosetest, when it prints out the name of the test, using
some variant of "print testfunc.__doc__": since the latter happened to
be a unicode string containing some accented letters, that piece of
nosetest's code raised an encoding error, that went untrapped...

I tried to understand the issue, until I found that I was inside a fresh
new virtualenv with python 2.6 and the sitecustomize wasn't even
there. So, even if my shell environ was UTF-8 (the system being a Ubuntu
Jaunty), within that virtualenv Python's stdout encoding was
'ascii'. Rightly so, nosetest failed to encode the accented letters to
that.

I could just rephrase the test __doc__, or remove it, but to avoid
future noise I decided to go with the deprecated "reload(sys)" trick,
done as early as possible... damn, it's just a test suite after all!

Is there a "correct" way of dealing with this? What should nosetest
eventually do to initialize it's sys.output.encoding reflecting the
system's settings? And on the user side, how could I otherwise fix it (I
mean, without resorting to the reload())?

Thank you,
ciao, lele.
--
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
le...@nautilus.homeip.net | -- Fortunato Depero, 1929.

Gabriel Genellina

unread,
Oct 21, 2009, 8:52:39 PM10/21/09
to pytho...@python.org
En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax <le...@metapensiero.it>
escribi�:

That seems to imply that in your "normal" environment you altered the
default encoding to utf-8 -- if so: don't do that!

> I could just rephrase the test __doc__, or remove it, but to avoid
> future noise I decided to go with the deprecated "reload(sys)" trick,
> done as early as possible... damn, it's just a test suite after all!
>
> Is there a "correct" way of dealing with this? What should nosetest
> eventually do to initialize it's sys.output.encoding reflecting the
> system's settings? And on the user side, how could I otherwise fix it (I
> mean, without resorting to the reload())?

nosetest should do nothing special. You should configure the environment
so Python *knows* that your console understands utf-8. Once Python is
aware of the *real* encoding your console is using, sys.stdout.encoding
will be utf-8 automatically and your problem is solved. I don't know how
to do that within virtualenv, but the answer certainly does NOT involve
sys.setdefaultencoding()

On Windows, a "normal" console window on my system uses cp850:


D:\USERDATA\Gabriel>chcp
Tabla de c�digos activa: 850

D:\USERDATA\Gabriel>python
Python 2.6.3 (r263rc1:75186, Oct 2 2009, 20:40:30) [MSC v.1500 32 bit
(Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
py> import sys
py> sys.getdefaultencoding()
'ascii'
py> sys.stdout.encoding
'cp850'
py> u = u"���"
py> print u
���
py> u
u'\xe1\xf1\xe7'
py> u.encode("cp850")
'\xa0\xa4\x87'
py> import unicodedata
py> unicodedata.name(u[0])
'LATIN SMALL LETTER A WITH ACUTE'

I opened another console, changed the code page to 1252 (the one used in
Windows applications; `chcp 1252`) and invoked Python again:

py> import sys
py> sys.getdefaultencoding()
'ascii'
py> sys.stdout.encoding
'cp1252'
py> u = u"���"
py> print u
���
py> u
u'\xe1\xf1\xe7'
py> u.encode("cp1252")
'\xe1\xf1\xe7'
py> import unicodedata
py> unicodedata.name(u[0])
'LATIN SMALL LETTER A WITH ACUTE'

As you can see, everything works fine without any need to change the
default encoding... Just make sure Python *knows* which encoding is being
used in the console on which it runs. On Ubuntu you may need to set the
LANG environment variable.

--
Gabriel Genellina

Lele Gaifax

unread,
Oct 22, 2009, 4:25:16 AM10/22/09
to pytho...@python.org
"Gabriel Genellina" <gags...@yahoo.com.ar> writes:

> En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax <le...@metapensiero.it>

> escribió:


>
>> "Gabriel Genellina" <gags...@yahoo.com.ar> writes:
>>
> nosetest should do nothing special. You should configure the environment
> so Python *knows* that your console understands utf-8. Once Python is
> aware of the *real* encoding your console is using, sys.stdout.encoding
> will be utf-8 automatically and your problem is solved. I don't know how
> to do that within virtualenv, but the answer certainly does NOT involve
> sys.setdefaultencoding()
>
> On Windows, a "normal" console window on my system uses cp850:
>
>
> D:\USERDATA\Gabriel>chcp

> Tabla de códigos activa: 850


>
> D:\USERDATA\Gabriel>python
> Python 2.6.3 (r263rc1:75186, Oct 2 2009, 20:40:30) [MSC v.1500 32 bit
> (Intel)]
> on win32
> Type "help", "copyright", "credits" or "license" for more information.
> py> import sys
> py> sys.getdefaultencoding()
> 'ascii'
> py> sys.stdout.encoding
> 'cp850'

> py> u = u"áñç"
> py> print u
> áñç

This is the same on my virtualenv:

$ python -c "import sys; print sys.getdefaultencoding(), sys.stdout.encoding"
ascii UTF-8
$ python -c "print u'\xe1\xf1\xe7'"
áñç

But look at this:

$ cat test.py
# -*- coding: utf-8 -*-

class TestAccents(object):
u'\xe1\xf1\xe7'

def test_simple(self):
u'cioè'

pass


$ nosetests test.py
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK

$ nosetests -v test.py
ERROR

======================================================================
Traceback (most recent call last):
File "/tmp/env/bin/nosetests", line 8, in <module>
load_entry_point('nose==0.11.1', 'console_scripts', 'nosetests')()
File "/tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py", line 113, in __init__
argv=argv, testRunner=testRunner, testLoader=testLoader)
File "/usr/lib/python2.6/unittest.py", line 817, in __init__
self.runTests()
File "/tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py", line 192, in runTests
result = self.testRunner.run(self.test)
File "/tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py", line 63, in run
result.printErrors()
File "/tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/result.py", line 81, in printErrors
_TextTestResult.printErrors(self)
File "/usr/lib/python2.6/unittest.py", line 724, in printErrors
self.printErrorList('ERROR', self.errors)
File "/usr/lib/python2.6/unittest.py", line 730, in printErrorList
self.stream.writeln("%s: %s" % (flavour,self.getDescription(test)))
File "/usr/lib/python2.6/unittest.py", line 665, in writeln
if arg: self.write(arg)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 10: ordinal not in range(128)

Who is the culprit here?

The fact is, encodings are the real Y2k problem, and they are here to
stay for a while!

thank you,

Gabriel Genellina

unread,
Oct 22, 2009, 6:17:15 AM10/22/09
to pytho...@python.org
En Thu, 22 Oct 2009 05:25:16 -0300, Lele Gaifax <le...@metapensiero.it>
escribi�:

> "Gabriel Genellina" <gags...@yahoo.com.ar> writes:
>
>> En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax <le...@metapensiero.it>
>> escribi�:

>>
>>> "Gabriel Genellina" <gags...@yahoo.com.ar> writes:
>>>
>> nosetest should do nothing special. You should configure the environment
>> so Python *knows* that your console understands utf-8.

> This is the same on my virtualenv:


>
> $ python -c "import sys; print sys.getdefaultencoding(),
> sys.stdout.encoding"
> ascii UTF-8
> $ python -c "print u'\xe1\xf1\xe7'"

> ���

Good, so stdout's encoding isn't really the problem.

> But look at this:


>
> File "/usr/lib/python2.6/unittest.py", line 730, in printErrorList
> self.stream.writeln("%s: %s" %
> (flavour,self.getDescription(test)))
> File "/usr/lib/python2.6/unittest.py", line 665, in writeln
> if arg: self.write(arg)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in
> position 10: ordinal not in range(128)
>
> Who is the culprit here?

unittest, or ultimately, this bug: http://bugs.python.org/issue4947

This is not specific to nosetest; unittest in verbose mode fails in the
same way.

fix: add this method to the _WritelnDecorator class in unittest.py (near
line 664):

def write(self, arg):
if isinstance(arg, unicode):
arg = arg.encode(self.stream.encoding, "replace")
self.stream.write(arg)

> The fact is, encodings are the real Y2k problem, and they are here to
> stay for a while!

Ok, but the idea is to solve the problem (or not let it happen in the
first place!), not hide it under the rug :)

--
Gabriel Genellina

Lele Gaifax

unread,
Oct 22, 2009, 7:59:00 AM10/22/09
to pytho...@python.org
"Gabriel Genellina" <gags...@yahoo.com.ar> writes:

> En Thu, 22 Oct 2009 05:25:16 -0300, Lele Gaifax <le...@metapensiero.it>

> escribió:


>> Who is the culprit here?
>
> unittest, or ultimately, this bug: http://bugs.python.org/issue4947

Thank you. In particular I found
http://bugs.python.org/issue4947#msg87637 as the best fit, I think
that may be what's happening here.

> fix: add this method to the _WritelnDecorator class in unittest.py
> (near line 664):
>
> def write(self, arg):
> if isinstance(arg, unicode):
> arg = arg.encode(self.stream.encoding, "replace")
> self.stream.write(arg)

Uhm, that's almost as dirty as my reload(), you must admit! :-)

bye, lele.

Wolodja Wentland

unread,
Oct 22, 2009, 9:00:20 PM10/22/09
to pytho...@python.org
On Thu, Oct 22, 2009 at 13:59 +0200, Lele Gaifax wrote:
> "Gabriel Genellina" <gags...@yahoo.com.ar> writes:

>> unittest, or ultimately, this bug: http://bugs.python.org/issue4947

You might also want to have a look at:

http://bugs.python.org/issue1293741

I hope this helps and that these bugs will be solved soon.

Wolodja

signature.asc

zooko

unread,
Oct 30, 2009, 12:40:14 PM10/30/09
to
On Oct 20, 9:50 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:

> DON'T do that. Really. Changing the default encoding is a horrible,
> horrible hack and causes a lot of problems.

I'm not convinced. I've read all of the posts and web pages and blog
entries decrying this practice over the last several years, but as far
as I can tell the actual harm that can result is limited (as long as
you set it to utf-8) and the practical benefits are substantial. This
is a pattern that I have no problem using:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

The reason this doesn't cause too much harm is that anything that
would have worked with the original default encoding ('ascii') will
also work with the new utf-8 default encoding. As far as I've seen
from the aforementioned mailing list threads and blog posts and so on,
the worst thing that has ever happened as a result of this technique
is that something works for you but fails for someone else who doesn't
have this stanza. (http://tarekziade.wordpress.com/2008/01/08/
syssetdefaultencoding-is-evil/ .) That's bad, but probably just
including this stanza at the top of the file that you are sharing with
that other person instead of doing it in a sitecustomize.py file will
avoid that problem.

Regards,

Zooko

Gabriel Genellina

unread,
Oct 30, 2009, 6:33:41 PM10/30/09
to pytho...@python.org
En Fri, 30 Oct 2009 13:40:14 -0300, zooko <zoo...@gmail.com> escribió:
> On Oct 20, 9:50 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
> wrote:
>
>> DON'T do that. Really. Changing the default encoding is a horrible,
>> horrible hack and causes a lot of problems.
>
> I'm not convinced. I've read all of the posts and web pages and blog
> entries decrying this practice over the last several years, but as far
> as I can tell the actual harm that can result is limited (as long as
> you set it to utf-8) and the practical benefits are substantial. This
> is a pattern that I have no problem using:
>
> import sys
> reload(sys)
> sys.setdefaultencoding("utf-8")
>
> The reason this doesn't cause too much harm is that anything that
> would have worked with the original default encoding ('ascii') will
> also work with the new utf-8 default encoding.

Wrong. Dictionaries may start behaving incorrectly, by example. Normally,
two keys that compare equal cannot coexist in the same dictionary:

py> 1 == 1.0
True
py> d = {}
py> d[1] = '*'
py> d[1.0]
'*'
py> d[1.0] = '$'
py> d
{1: '$'}

1 and 1.0 are the same key, as far as the dictionary is concerned. For
this to work, both keys must have the same hash:

py> hash(1) == hash(1.0)
True

Now, let's set the default encoding to utf-8:

py> import sys
py> reload(sys)
<module 'sys' (built-in)>
py> sys.setdefaultencoding('utf-8')
py> x = u'á'
py> y = u'á'.encode('utf-8')
py> x
u'\xe1'
py> y
'\xc3\xa1'

(same as y='á' if the source encoding is set to utf-8, but I don't want to
depend on that). Just to be sure we're dealing with the right character:

py> import unicodedata
py> unicodedata.name(x)


'LATIN SMALL LETTER A WITH ACUTE'

py> unicodedata.name(y.decode('utf-8'))


'LATIN SMALL LETTER A WITH ACUTE'

Now, we can see that both x and y are equal:

py> x == y
True

x is an accented a, y is the same thing encoded using the default
encoding, both are equal. Fine. Now create a dictionary:

py> d = {}
py> d[x] = '*'
py> d[x]
'*'
py> x in d
True
py> y in d
False # ???
py> d[y] = 2
py> d
{u'\xe1': '*', '\xc3\xa1': 2} # ????

Since x==y, one should expect a single entry in the dictionary - but we
got two. That's because:

py> x == y
True
py> hash(x) == hash(y)
False

and this must *not* happen according to
http://docs.python.org/reference/datamodel.html#object.__hash__
"The only required property is that objects which compare equal have the
same hash value"
Considering that dictionaries in Python are used almost everywhere,
breaking this basic asumption is a really bad problem.

Of course, all of this applies to Python 2.x; in Python 3.0 the problem
was solved differently; strings are unicode by default, and the default
encoding IS utf-8.

> As far as I've seen
> from the aforementioned mailing list threads and blog posts and so on,
> the worst thing that has ever happened as a result of this technique
> is that something works for you but fails for someone else who doesn't
> have this stanza. (http://tarekziade.wordpress.com/2008/01/08/
> syssetdefaultencoding-is-evil/ .) That's bad, but probably just
> including this stanza at the top of the file that you are sharing with
> that other person instead of doing it in a sitecustomize.py file will
> avoid that problem.

And then you break all other libraries that the program is using,
including the Python standard library, because the default encoding is a
global setting. What if another library decides to use latin-1 as the
default encoding, using the same trick? Latest one wins...

You said "the practical benefits are substantial" but I, for myself,
cannot see any benefit. Perhaps if you post your real problems, someone
can find the solution.
The right way is to fix your program to do the right thing, not to hide
the bugs under the rug.

--
Gabriel Genellina

0 new messages