the stupid encoding problem to stdout

Sérgio Monteiro Basto

unread,

Jun 8, 2011, 10:18:37 PM6/8/11

to

hi,
cat test.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode("utf-8")
print u

chmod +x test.py
./test.py
moçambique
moçambique

./test.py > output.txt
Traceback (most recent call last):
File "./test.py", line 5, in <module>
print u
UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position 2: ordinal not in range(128)

in python 2.7
how I explain to python to send the same thing to stdout and
the file output.txt ?

Don't seems logic, when send things to a file the beaviour
change.

Thanks,
Sérgio M. B.

Ben Finney

unread,

Jun 8, 2011, 10:39:44 PM6/8/11

to

Sérgio Monteiro Basto <serg...@sapo.pt> writes:

> ./test.py
> moçambique
> moçambique

In this case your terminal is reporting its encoding to Python, and it's
capable of taking the UTF-8 data that you send to it in both cases.

> ./test.py > output.txt
> Traceback (most recent call last):
> File "./test.py", line 5, in <module>
> print u
> UnicodeEncodeError: 'ascii' codec can't encode character
> u'\xe7' in position 2: ordinal not in range(128)

In this case your shell has no preference for the encoding (since you're
redirecting output to a file).

In the first print statement you specify the encoding UTF-8, which is
capable of encoding the characters.

In the second print statement you haven't specified any encoding, so the
default ASCII encoding is used.

Moral of the tale: Make sure an encoding is specified whenever data
steps between bytes and characters.

> Don't seems logic, when send things to a file the beaviour change.

They're different files, which have been opened with different
encodings. If you want a different encoding, you need to specify that.

--
\ “There's no excuse to be bored. Sad, yes. Angry, yes. |
`\ Depressed, yes. Crazy, yes. But there's no excuse for boredom, |
_o__) ever.” —Viggo Mortensen |
Ben Finney

Benjamin Kaplan

unread,

Jun 8, 2011, 11:00:38 PM6/8/11

to pytho...@python.org

2011/6/8 Sérgio Monteiro Basto <serg...@sapo.pt>:

That's not a terminal vs file thing. It's a "file that declares it's
encoding" vs a "file that doesn't declare it's encoding" thing. Your
terminal declares that it is UTF-8. So when you print a Unicode string
to your terminal, Python knows that it's supposed to turn it into
UTF-8. When you pipe the output to a file, that file doesn't declare
an encoding. So rather than guess which encoding you want, Python
defaults to the lowest common denominator: ASCII. If you want
something to be a particular encoding, you have to encode it yourself.

You have a couple of choices on how to make it work:
1) Play dumb and always encode as UTF-8. This would look really weird
if someone tried running your program in a terminal with a CP-847
encoding (like cmd.exe on at least the US version of Windows), but it
would never crash.
2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
string in the string-escape encoding, which substitutes the escape
sequence in for all non-ASCII characters.
3) Check to see if sys.stdout.isatty() and have different behavior for
terminals vs files. If you're on a terminal that doesn't declare its
encoding, encoding it as UTF-8 probably won't help. If you're writing
to a file, that might be what you want to do.

Sérgio Monteiro Basto

unread,

Jun 9, 2011, 5:16:25 PM6/9/11

to

Ben Finney wrote:

> Sérgio Monteiro Basto <serg...@sapo.pt> writes:
>
>> ./test.py
>> moçambique
>> moçambique
>
> In this case your terminal is reporting its encoding to Python, and it's
> capable of taking the UTF-8 data that you send to it in both cases.
>
>> ./test.py > output.txt
>> Traceback (most recent call last):
>> File "./test.py", line 5, in <module>
>> print u
>> UnicodeEncodeError: 'ascii' codec can't encode character
>> u'\xe7' in position 2: ordinal not in range(128)
>
> In this case your shell has no preference for the encoding (since you're
> redirecting output to a file).
>

How I say to python that I want that write in utf-8 to files ?

Sérgio Monteiro Basto

unread,

Jun 9, 2011, 5:14:17 PM6/9/11

to

Benjamin Kaplan wrote:

Exactly the opposite , if python don't know the encoding should not try
decode to ASCII.

>
> You have a couple of choices on how to make it work:
> 1) Play dumb and always encode as UTF-8. This would look really weird
> if someone tried running your program in a terminal with a CP-847
> encoding (like cmd.exe on at least the US version of Windows), but it
> would never crash.

I want python don't care about encoding terminal and send characters as they
are or for a file .

> 2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
> string in the string-escape encoding, which substitutes the escape
> sequence in for all non-ASCII characters.

How I change sys.stdout.encoding always to UTF-8 ? at least have a
consistent sys.stdout.encoding

> 3) Check to see if sys.stdout.isatty() and have different behavior for
> terminals vs files. If you're on a terminal that doesn't declare its
> encoding, encoding it as UTF-8 probably won't help. If you're writing
> to a file, that might be what you want to do.

Thanks,

Nobody

unread,

Jun 9, 2011, 5:46:34 PM6/9/11

to

On Thu, 09 Jun 2011 22:14:17 +0100, Sérgio Monteiro Basto wrote:

> Exactly the opposite , if python don't know the encoding should not try
> decode to ASCII.

What should it decode to, then?

You can't write characters to a stream, only bytes.

> I want python don't care about encoding terminal and send characters as they
> are or for a file .

You can't write characters to a stream, only bytes.

Ben Finney

unread,

Jun 9, 2011, 7:19:52 PM6/9/11

to

Sérgio Monteiro Basto <serg...@sapo.pt> writes:

> Ben Finney wrote:
>
> > In this case your shell has no preference for the encoding (since
> > you're redirecting output to a file).
>
> How I say to python that I want that write in utf-8 to files ?

You already did:

> > In the first print statement you specify the encoding UTF-8, which
> > is capable of encoding the characters.

If you want UTF-8 on the byte stream for a file, specify it when opening
the file, or when reading or writing the file.

--
\ “But Marge, what if we chose the wrong religion? Each week we |
`\ just make God madder and madder.” —Homer, _The Simpsons_ |
_o__) |
Ben Finney

Terry Reedy

unread,

Jun 9, 2011, 8:14:26 PM6/9/11

to pytho...@python.org

Characters, representations are for people, byte representations are for
computers.

--
Terry Jan Reedy

Mark Tolonen

unread,

Jun 9, 2011, 8:57:03 PM6/9/11

to pytho...@python.org

"Sérgio Monteiro Basto" <serg...@sapo.pt> wrote in message
news:4df137a7$0$30580$a729...@news.telepac.pt...

> How I change sys.stdout.encoding always to UTF-8 ? at least have a
> consistent sys.stdout.encoding

There is an environment variable that can force Python I/O to be a specfic
encoding:

PYTHONIOENCODING=utf-8

-Mark

Sérgio Monteiro Basto

unread,

Jun 9, 2011, 9:11:10 PM6/9/11

to

Nobody wrote:

>> Exactly the opposite , if python don't know the encoding should not try
>> decode to ASCII.
>
> What should it decode to, then?

UTF-8, as in tty, how I change this default ?

> You can't write characters to a stream, only bytes.
>

ok got the point .
Thanks,

Sérgio Monteiro Basto

unread,

Jun 9, 2011, 9:17:16 PM6/9/11

to

Mark Tolonen wrote:

Excellent thanks , double thanks.

BTW: should be set by default on a utf-8 systems like Fedora, Ubuntu, Debian
, Redhat, and all Linuxs. For sure I will put this on startup of my systems.

> -Mark
--
Sérgio M. B.

Ben Finney

unread,

Jun 9, 2011, 9:45:51 PM6/9/11

to

Sérgio Monteiro Basto <serg...@sapo.pt> writes:

> Nobody wrote:
>
> >> Exactly the opposite , if python don't know the encoding should not
> >> try decode to ASCII.

Are you advocating that Python should refuse to write characters unless
the encoding is specified? I could sympathise with that, but currently
that's not what Python does; instead it defaults to the ASCII codec.

> > What should it decode to, then?
>
> UTF-8, as in tty

But when you explicitly redirect to a file, it's *not* going to a TTY.
It's going to a file whose encoding isn't known unless you specify it.

--
\ “Reality must take precedence over public relations, for nature |
`\ cannot be fooled.” —Richard P. Feynman |
_o__) |
Ben Finney

Sérgio Monteiro Basto

unread,

Jun 9, 2011, 9:59:52 PM6/9/11

to

Ben Finney wrote:

>> >> Exactly the opposite , if python don't know the encoding should not
>> >> try decode to ASCII.
>
> Are you advocating that Python should refuse to write characters unless
> the encoding is specified? I could sympathise with that, but currently
> that's not what Python does; instead it defaults to the ASCII codec.

could be a solution ;) or a smarter default based on LANG for example (as
many GNU does).

--
Sérgio M. B.

Laurent Claessens

unread,

Jun 10, 2011, 1:47:34 AM6/10/11

to Sérgio Monteiro Basto

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
> hi,
> cat test.py
> #!/usr/bin/env python
> #-*- coding: utf-8 -*-
> u = u'moçambique'
> print u.encode("utf-8")
> print u
>
> chmod +x test.py

> ../test.py
> moçambique
> moçambique

The following tries to encode before to print. If you pass an already
utf-8 object, it just print it; if not it encode it. All the "print"
statements pass by MyPrint.write

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode("utf8")
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)

MyPrint()

u = u'moçambique'
print u.encode("utf-8")
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py > test.txt
$ cat test.txt
moçambique
moçambique

By the way, my code will not help for error message. I think that the
errors are printed by sys.stderr.write. So if you want to do
raise "moçambique"
you should think about add stderr to the class MyPrint

If you know French, I strongly recommend "Comprendre les erreurs
unicode" by Victor Stinner :
http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent

Laurent Claessens

unread,

Jun 10, 2011, 1:47:46 AM6/10/11

to Sérgio Monteiro Basto

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :

> hi,
> cat test.py
> #!/usr/bin/env python
> #-*- coding: utf-8 -*-
> u = u'moçambique'
> print u.encode("utf-8")
> print u
>
> chmod +x test.py

> ../test.py
> moçambique
> moçambique

The following tries to encode before to print. If you pass an already
utf-8 object, it just print it; if not it encode it. All the "print"
statements pass by MyPrint.write

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode("utf8")
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)

MyPrint()

u = u'moçambique'

print u.encode("utf-8")
print u

TEST :

Laurent Claessens

unread,

Jun 10, 2011, 1:47:56 AM6/10/11

to

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :

> hi,
> cat test.py
> #!/usr/bin/env python
> #-*- coding: utf-8 -*-
> u = u'moçambique'
> print u.encode("utf-8")
> print u
>
> chmod +x test.py

> ../test.py
> moçambique
> moçambique

The following tries to encode before to print. If you pass an already
utf-8 object, it just print it; if not it encode it. All the "print"
statements pass by MyPrint.write

#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode("utf8")
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)

MyPrint()

u = u'moçambique'

print u.encode("utf-8")
print u

TEST :

Sérgio Monteiro Basto

unread,

Jun 10, 2011, 11:11:10 AM6/10/11

to

Ben Finney wrote:

>> > What should it decode to, then?
>>
>> UTF-8, as in tty
>

> But when you explicitly redirect to a file, it's not going to a TTY.

> It's going to a file whose encoding isn't known unless you specify it.

ok after thinking about this, this problem exist because Python want be
smart with ttys, which is in my point of view is wrong, should not encode to
utf-8, because tty is in utf-8. Python should always encode to the same
thing. If the default is ascii, should always encode to ascii.
yeah should send to tty in ascii, if I send my code to a guy in windows
which use tty with cp1000whatever , shouldn't give decoding errors and
should send in ascii .
If we want we change default for whatever we want, but without this "default
change" Python should not change his behavior depending on output.
yeah I prefer strange output for a different platform, to a decode errors.
And I have /usr/bin/iconv .

Thanks for attention, sorry about my very limited English.
--
Sérgio M. B.

Ian Kelly

unread,

Jun 10, 2011, 12:58:56 PM6/10/11

to Python

2011/6/10 Sérgio Monteiro Basto <serg...@sapo.pt>:

> ok after thinking about this, this problem exist because Python want be
> smart with ttys, which is in my point of view is wrong, should not encode to
> utf-8, because tty is in utf-8. Python should always encode to the same
> thing. If the default is ascii, should always encode to ascii.
> yeah should send to tty in ascii, if I send my code to a guy in windows
> which use tty with cp1000whatever , shouldn't give decoding errors and
> should send in ascii .

You can't have your cake and eat it too. If Python needs to output a
string in ascii, and that string can't be represented in ascii, then
raising an exception is the only reasonable thing to do. You seem to
be suggesting that Python should do an implicit output.encode('ascii',
'replace') on all Unicode output, which might be okay for a TTY, but
you wouldn't want that for file output; it would allow Python to
silently create garbage data.

And what if you send your code to somebody with a UTF-16 terminal?
You try to output ASCII to that, and you're just going to get complete
garbage.

If you want your output to behave that way, then all you have to do is
specify that with an explicit encode step.

> If we want we change default for whatever we want, but without this "default
> change" Python should not change his behavior depending on output.
> yeah I prefer strange output for a different platform, to a decode errors.

Sorry, I disagree. If your program is going to fail, it's better that
it fail noisily (with an error) than silently (with no notice that
anything is wrong).

Chris Angelico

unread,

Jun 10, 2011, 6:07:04 PM6/10/11

to pytho...@python.org

2011/6/11 Sérgio Monteiro Basto <serg...@sapo.pt>:

> ok after thinking about this, this problem exist because Python want be
> smart with ttys

The *anomaly* (not problem) exists because Python has a way of being
told a target encoding. If two parties agree on an encoding, they can
send characters to each other. I had this discussion at work a while
ago; my boss was talking about being "binary-safe" (which really meant
"8-bit safe"), while I was saying that we should support, verify, and
demand properly-formed UTF-8. The main significance is that agreeing
on an encoding means we can change the encoding any time it's
convenient, without having to document that we've changed the data -
because we haven't. I can take the number "twelve thousand three
hundred and forty-five" and render that as a string of decimal digits
as "12345", or as hexadecimal digits as "3039", but I haven't changed
the number. If you know that I'm giving you a string of decimal
digits, and I give you "12345", you will get the same number at the
far side.

Python has agreed with stdout that it will send it characters encoded
in UTF-8. Having made that agreement, Python and stdout can happily
communicate in characters, not bytes. You don't need to explicitly
encode your characters into bytes - and in fact, this would be a very
bad thing to do, because you don't know _what_ encoding stdout is
using. If it's expecting UTF-16, you'll get a whole lot of rubbish if
you send it UTF-8 - but it'll look fine if you send it Unicode.

Chris Angelico

Sérgio Monteiro Basto

unread,

Jun 13, 2011, 10:15:25 AM6/13/11

to

Ian Kelly wrote:

> If you want your output to behave that way, then all you have to do is
> specify that with an explicit encode step.

ok

>> If we want we change default for whatever we want, but without this
>> "default change" Python should not change his behavior depending on
>> output. yeah I prefer strange output for a different platform, to a
>> decode errors.
>
> Sorry, I disagree. If your program is going to fail, it's better that
> it fail noisily (with an error) than silently (with no notice that
> anything is wrong).

Hi,
ok a little resume, I got the solution which is setting env with
PYTHONIOENCODING=utf-8, which if it was a default for modern GNU Linux, was
made me save lots of time.
My practical problem is simple like, I make a script that want run in shell
for testing and log to a file when use with a configuration.
Everything runs well in a shell and sometimes (later) fails when log to a
file, with a "UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position".
So to work in both cases (tty and files), I filled all code with string
.encode('utf-8') to workaround, when what always I want was use
PYTHONIOCONDIG=utf-8. I got anything in utf-8, database is in utf-8, I
coding in utf-8, my OS is in utf-8. In last about 3 years of learning Python
I lost many many hours to understand this problem.
And see, I can send ascii and utf-8 to utf-8 output and never have problems,
but if I send ascii and utf-8 to ascii files sometimes got encode errors.
So you please consider, at least on Linux, default encode to utf-8 (because
we have less problems) or make more clear that pipe to a file is different
to a tty and problem was in files that defaults to ascii. Or
make the default of IOENCONDIG based on env LANG.

Anyway many thanks for your time and for help me out.
I don't know how run the things in Python 3 , in python 3 defaults are utf-8
?

Thanks,
--
Sérgio M. B.

Chris Angelico

unread,

Jun 13, 2011, 10:49:06 AM6/13/11

to pytho...@python.org

2011/6/14 Sérgio Monteiro Basto <serg...@sapo.pt>:

> And see, I can send ascii and utf-8 to utf-8 output and never have problems,
> but if I send ascii and utf-8 to ascii files sometimes got encode errors.
>

If something fits inside 7-bit ASCII, it is by definition valid UTF-8.
This is not a coincidence.

Those hours you've spent grokking this are not wasted, if you now have
a comprehension of characters vs encodings. More people in the world
need to understand that difference! :)

Chris Angelico