print() and unicode strings (python 3.1)

7stud

unread,

Aug 24, 2009, 11:29:39 AM8/24/09

to

======python 2.6 ======
import sys

print sys.getdefaultencoding()

s = u"\u20ac"
print s.encode("utf-8")

$ python2.6 1test.py
ascii
€

=====python 3.1 =======
import sys

print(sys.getdefaultencoding())

s = "€"
print(s.encode("utf-8"))
print(s)

$ python3.1 1test.py
utf-8
b'\xe2\x82\xac'

Traceback (most recent call last):
File "1test.py", line 7, in <module>
print(s)
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
position 0: ordinal not in range(12

I don't understand why I'm getting an encode error in python 3.1.

"Martin v. Löwis"

unread,

Aug 24, 2009, 11:56:43 AM8/24/09

to

> I don't understand why I'm getting an encode error in python 3.1.

The default encoding is not relevant here at all. Look at
sys.stdout.encoding.

Regards,
Martin

7stud

unread,

Aug 24, 2009, 12:54:03 PM8/24/09

to

Hi,

Thanks for the response. I get US-ASCII for both 2.6 and 3.1:

===python 3.1======
import sys

print(sys.stdout.encoding)

$ python3.1 1test.py
US-ASCII

I can't figure out a way to programatically set the encoding for
sys.stdout. So where does that leave me? python 3.1 won't let me
explicitly encode my unicode string, and python 3.1 implicitly does
the encoding with the wrong codec. And why would any programmer rely
on python 3.1's implicit encoding of unicode strings anyway?
Presumably, different systems will have different encodings for
sys.stdout, some encodings might cause encode errors.

Stefan Behnel

unread,

Aug 24, 2009, 2:19:28 PM8/24/09

to

7stud wrote:
> python 3.1 won't let me
> explicitly encode my unicode string

Sure it does. But encoding a non-ASCII string to ASCII will necessarily fail.

> and python 3.1 implicitly does
> the encoding with the wrong codec.

That's not a Python problem, though. Your terminal is configured for
US-ASCII, so you can't output anything but US-ASCII characters.

Change your terminal setup to e.g. UTF-8 and see how things start working.

Stefan

7stud

unread,

Aug 24, 2009, 2:45:17 PM8/24/09

to

On Aug 24, 12:19 pm, Stefan Behnel <stefan...@behnel.de> wrote:
> 7stud wrote:
> > python 3.1 won't let me
> > explicitly encode my unicode string
>
> Sure it does. But encoding a non-ASCII string to ASCII will necessarily fail.
>

As you should be able to see in the python 3.1 example I posted, I did
not encode the string using the ascii codec. I encoded it with the
utf-8 codec, and unfortunately in python 3.1 that creates a "bytes
string", and print()'ing a bytes string does not produce human
readable text.

> > and python 3.1 implicitly does
> > the encoding with the wrong codec.
>
> That's not a Python problem, though. Your terminal is configured for
> US-ASCII, so you can't output anything but US-ASCII characters.
>

My terminal is configured for utf-8, and from the output of the python
2.6 example I posted, it should be apparent that my terminal is
capable of rendering the euro character.

"Martin v. Löwis"

unread,

Aug 24, 2009, 4:41:37 PM8/24/09

to

> I can't figure out a way to programatically set the encoding for
> sys.stdout. So where does that leave me?

You should be setting the terminal encoding administratively, not
programmatically.

Regards,
Martin

7stud

unread,

Aug 24, 2009, 10:42:36 PM8/24/09

to

The terminal encoding has always been utf-8. It was not set
programmatically.

It seems to me that python 3.1's string handling is broken.
Apparently, in python 3.1 I am unable to explicitly set the encoding
of a string and print() it out with the result being human readable
text. On the other hand, if I let python do the encoding implicitly,
python uses a codec I don't want it to.

Ned Deily

unread,

Aug 25, 2009, 12:09:53 AM8/25/09

to pytho...@python.org

In article
<e5e2ec2e-2b4a-4ca8...@v23g2000pro.googlegroups.com>,
7stud <bbxx78...@yahoo.com> wrote:

If you are running on a Unix-y system, check your locale settings (LANG,
LC.*, et al). I think you'll likely find that your locale is really not
UTF-8. The following was on Python 3.1 on OS X 10.5, similar results
on Debian Linux:

$ cat t3.py
import sys
print(sys.stdout.encoding)

s = "�"
print(s.encode("utf-8"))
print(s)

$ export LANG=en_US.UTF-8
$ python3.1 t3.py
UTF-8
b'\xe2\x82\xac'
�

$ export LANG=C
$ python3.1 t3.py
US-ASCII

b'\xe2\x82\xac'
Traceback (most recent call last):

File "t3.py", line 7, in <module>

print(s)
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in

position 0: ordinal not in range(128)

--
Ned Deily,
n...@acm.org

7stud

unread,

Aug 25, 2009, 6:41:54 AM8/25/09

to

On Aug 24, 10:09 pm, Ned Deily <n...@acm.org> wrote:
> In article
> <e5e2ec2e-2b4a-4ca8-8c0f-109e5f4eb...@v23g2000pro.googlegroups.com>,
>
>
>
> 7stud <bbxx789_0...@yahoo.com> wrote:

Hi,

Thanks for the response. My OS is mac osx 10.4.11. I'm not really
sure how to check my locale settings. Here is some stuff I tried:

$ echo $LANG

$ echo $LC_ALL

$ echo $LC_CTYPE

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

$man locale
...
...
...

ENVIRONMENT:
LANG
Used as a substitute for any unset LC_* variable. If LANG is unset it
will act as if set to "C". If any of LANG or LC_* are set to invalide
values locale acts as if they are all unset.

===========

As in your last example, my 'C' settings mean that an ascii codec is
used somewhere to encode() the unicode string.

--
The locale C or POSIX is a portable locale; its LC_CTYPE part
corresponds to the 7-bit ASCII character set.

http://linux.about.com/library/cmd/blcmdl3_setlocale.htm
--

Is this the way it works:

1) python sets the codec for sys.stdout to the LANG environment
variable.
2) It doesn't matter that my terminal's encoding is set to utf-8
because output has to pass through sys.stdout first.

So:

a) My terminal's environment is telling python(and all other programs
running in the terminal) that output sent to sys.stdout must be
encoded in ascii.
b) The solution is to set a LANG environment variable.

Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?

Previously, I've set environment variables that I want to be
permanent, e.g PATH, in ~/.bash_profile, so I did this:

~/.bash_profile:
--------------
...
...
LANG="en_US.UTF-8"
export LANG

and now python 3.1 acts like I expect it to:

-------
import locale
import sys

print(locale.getlocale(locale.LC_CTYPE))
print(sys.stdout.encoding)

s = "€"
print(s)

print(s.encode("utf-8"))

--output:--
('en_US', 'UTF8')

UTF-8
€
b'\xe2\x82\xac'

----------

In conclusion, as far as I can tell, if your python 3.1 program tries
to output a unicode string, and the unicode string cannot be encoded
by the codec specified in the user's LANG environment variable**, then
the user will get an encode error. Just because the programmer's
system can handle the output doesn't mean that another user's system
can. I guess that's the way it goes: if a user's environment is
telling all programs that it only wants ascii output to go to the
screen(sys.stdout), you can't(or shouldn't) do anything about it.

**Or if the LANG environment variable is not present, then the codec
corresponding to the locale settings(C' corresponds to ascii).

some good locale info:
http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html

Nobody

unread,

Aug 25, 2009, 8:34:33 AM8/25/09

to

On Tue, 25 Aug 2009 03:41:54 -0700, 7stud wrote:

> Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?

Because the variables aren't set.

The default locale for a particular category (e.g. LC_CTYPE) is taken from
$LC_ALL if that is set, otherwise $LC_CTYPE, otherwise $LANG, otherwise
"C" is used.

Normally, you would either set LANG (and possibly some individual LC_*
variables), or LC_ALL. There's no point in setting all of them.

> In conclusion, as far as I can tell, if your python 3.1 program tries
> to output a unicode string, and the unicode string cannot be encoded
> by the codec specified in the user's LANG environment variable**, then
> the user will get an encode error. Just because the programmer's
> system can handle the output doesn't mean that another user's system
> can. I guess that's the way it goes: if a user's environment is
> telling all programs that it only wants ascii output to go to the
> screen(sys.stdout), you can't(or shouldn't) do anything about it.
>
> **Or if the LANG environment variable is not present, then the codec
> corresponding to the locale settings(C' corresponds to ascii).

The underlying OS primitive can only handle bytes. If you read or write a
(unicode) string, Python needs to know which encoding is used. For Python
file objects created by the user (via open() etc), you can specify the
encoding; for those created by the runtime (e.g. sys.stdin), Python uses
the locale's LC_CTYPE category to select an encoding.

Data written to or read from text streams is encoded or decoded using the
stream's encoding. Filenames are encoded and decoded using the
filesystem encoding (sys.getfilesystemencoding()). Anything else uses the
default encoding (sys.getdefaultencoding()).

In Python 3, text streams are handled using io.TextIOWrapper:

http://docs.python.org/3.1/library/io.html#text-i-o

This implements a stream which can read and/or write text data on top of
one which can read and/or write binary data. The sys.std{in,out,err}
streams are instances of TextIOWrapper. You can get the underlying
binary stream from the "buffer" attribute, e.g.:

sys.stdout.buffer.write(b'hello world\n')

If you need to force a specific encoding (e.g. if the user has specified
an encoding via a command-line option), you can detach the existing
wrapper and create a new one, e.g.:

sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = new_encoding)

Piet van Oostrum

unread,

Aug 26, 2009, 4:41:56 AM8/26/09

to

>>>>> 7stud <bbxx78...@yahoo.com> (7) wrote:

>7> Thanks for the response. My OS is mac osx 10.4.11. I'm not really
>7> sure how to check my locale settings. Here is some stuff I tried:

>7> $ echo $LANG

>7> $ echo $LC_ALL

>7> $ echo $LC_CTYPE

>7> $ locale
>7> LANG=
>7> LC_COLLATE="C"
>7> LC_CTYPE="C"
>7> LC_MESSAGES="C"
>7> LC_MONETARY="C"
>7> LC_NUMERIC="C"
>7> LC_TIME="C"
>7> LC_ALL="C"

IIRC, Mac OS X 10.4 does not set LANG or LC_* automatically. In 10.5
Terminal has an option in the preferences to set LANG according to the
encoding chosen (and presumably the language of the user).

--
Piet van Oostrum <pi...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: pi...@vanoostrum.org

7stud

unread,

Aug 26, 2009, 9:31:48 AM8/26/09

to

On Aug 25, 6:34 am, Nobody <nob...@nowhere.com> wrote:
> The underlying OS primitive can only handle bytes. If you read or write a
> (unicode) string, Python needs to know which encoding is used. For Python
> file objects created by the user (via open() etc), you can specify the
> encoding; for those created by the runtime (e.g. sys.stdin), Python uses
> the locale's LC_CTYPE category to select an encoding.
>
> Data written to or read from text streams is encoded or decoded using the
> stream's encoding. Filenames are encoded and decoded using the
> filesystem encoding (sys.getfilesystemencoding()). Anything else uses the
> default encoding (sys.getdefaultencoding()).
>
> In Python 3, text streams are handled using io.TextIOWrapper:
>
> http://docs.python.org/3.1/library/io.html#text-i-o
>
> This implements a stream which can read and/or write text data on top of
> one which can read and/or write binary data. The sys.std{in,out,err}
> streams are instances of TextIOWrapper. You can get the underlying
> binary stream from the "buffer" attribute, e.g.:
>
> sys.stdout.buffer.write(b'hello world\n')
>
> If you need to force a specific encoding (e.g. if the user has specified
> an encoding via a command-line option), you can detach the existing
> wrapper and create a new one, e.g.:
>
> sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = new_encoding)

Thanks for the details.

Anonymous

unread,

Aug 26, 2009, 4:56:16 PM8/26/09

to

Have you considered including an encoding line at the top of your file, as described in PEP 0263:
http://www.python.org/dev/peps/pep-0263/

I just ran into a similar error, but it went away when I included
# coding: utf-8
as the first line in my file.