the official way of printing unicode strings

Piotr Sobolewski

unread,

Dec 14, 2008, 12:48:19 AM12/14/08

to

Hello,

in Python (contrary to Perl, for instance) there is one way to do common
tasks. Could somebody explain me what is the official python way of
printing unicode strings?

I tried to do this such way:
s = u"Stanisław Lem"
print u.encode('utf-8')
This works, but is very cumbersome.

Then I tried to do this that way:
s = u"Stanisław Lem"
print u
This breaks when I redirect the output of my program to some file, like
that:
$ example.py > log

Then I tried to do this that way:
sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__)
s = u"Stanisław Lem"
print u
This works but is even more combersome.

So, my question is: what is the official, recommended Python way?

Marc 'BlackJack' Rintsch

unread,

Dec 14, 2008, 4:25:27 AM12/14/08

to

On Sun, 14 Dec 2008 06:48:19 +0100, Piotr Sobolewski wrote:

> Then I tried to do this that way:
> sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__)
> s = u"Stanisław Lem"
> print u
> This works but is even more combersome.
>
> So, my question is: what is the official, recommended Python way?

I'd make that first line:

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

Why is it even more cumbersome to execute that line *once* instead
encoding at every ``print`` statement?

Ciao,
Marc 'BlackJack' Rintsch

Piotr Sobolewski

unread,

Dec 14, 2008, 5:16:56 AM12/14/08

to

Marc 'BlackJack' Rintsch wrote:

> I'd make that first line:
> sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
>
> Why is it even more cumbersome to execute that line *once* instead
> encoding at every ``print`` statement?

Oh, maybe it's not cumbersome, but a little bit strange - but sure, I can
get used to it.

My main problem is that when I use some language I want to use it the way it
is supposed to be used. Usually doing like that saves many problems.
Especially in Python, where there is one official way to do any elementary
task. And I just want to know what is the normal, official way of printing
unicode strings. I mean, the question is not "how can I print the unicode
string" but "how the creators of the language suppose me to print the
unicode string". I couldn't find an answer to this question in docs, so I
hope somebody here knows it.

So, is it _the_ python way of printing unicode?

J. Clifford Dyer

unread,

Dec 14, 2008, 4:55:07 PM12/14/08

to Marc 'BlackJack' Rintsch, pytho...@python.org

On Sun, 2008-12-14 at 11:16 +0100, Piotr Sobolewski wrote:

> Marc 'BlackJack' Rintsch wrote:
>
> > I'd make that first line:
> > sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
> >
> > Why is it even more cumbersome to execute that line *once* instead
> > encoding at every ``print`` statement?
>

> Oh, maybe it's not cumbersome, but a little bit strange - but sure, I can
> get used to it.
>
> My main problem is that when I use some language I want to use it the way it
> is supposed to be used. Usually doing like that saves many problems.
> Especially in Python, where there is one official way to do any elementary
> task. And I just want to know what is the normal, official way of printing
> unicode strings. I mean, the question is not "how can I print the unicode
> string" but "how the creators of the language suppose me to print the
> unicode string". I couldn't find an answer to this question in docs, so I
> hope somebody here knows it.
>
> So, is it _the_ python way of printing unicode?
>

The "right way" to print a unicode string is to encode it in the
encoding that is appropriate for your needs (which may or may not be
UTF-8), and then to print it. What this means in terms of your three
examples is that the first and third are correct, and the second is
incorrect. The second one breaks when writing to a file, so don't use
it. Both the first and third behave in the way that I suggest. The
first (print u'foo'.encode('utf-8')) is less cumbersome if you do it
once, but the third method (rebinding sys.stdout using codecs.open) is
less cumbersome if you'll be doing a lot of printing on stdout.

In the end, they are the same method, but one of them introduces another
layer of abstraction. If you'll be using more than two print statements
that need to be bound to a non-ascii encoding, I'd recommend the third,
as it rapidly becomes less cumbersome, the more you print.

That said, you should also consider whether you want to rebind
sys.stdout or not. It makes your print statements less verbose, but it
also loses your reference to the basic stdout. What if you want to
print using UTF-8 for a while, but then you need to switch to another
encoding later? If you've used a new name, you can still refer back to
the original sys.stdout.

Right:

my_out = codecs.getwriter('utf-8')(sys.stdout)
print >> my_out u"Stuff"
my_out = codecs.getwriter('ebcdic')(sys.stdout)
print >> my_out u"Stuff"

Wrong

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

print u"Stuff"
sys.stdout = codecs.getwriter('ebcdic')(sys.stdout)
# Now sys.stdout is geting encoded twice, and you'll probably
# get garbage out. :(
print u"Stuff"

Though I guess this is why the OP is doing:

sys.stdout = codecs.getwriter('utf-8')(sys.__stdout__)

That avoids the problem by not rebinding the original file object.
sys.__stdout__ is still in its original state.

Carry on, then.

Cheers,
Cliff

Ben Finney

unread,

Dec 14, 2008, 4:55:44 PM12/14/08

to

Piotr Sobolewski <NIE_D...@gazeta.pl> writes:

> in Python (contrary to Perl, for instance) there is one way to do
> common tasks.

More accurately: the ideal is that there should be only one *obvious*
way to do things. Other ways may also exist.

> Could somebody explain me what is the official python way of
> printing unicode strings?

Try these:

<URL:http://effbot.org/zone/unicode-objects.htm>
<URL:http://www.reportlab.com/i18n/python_unicode_tutorial.html>
<URL:http://www.amk.ca/python/howto/unicode>

If you want something more official, try the PEP that introduced
Unicode objects, PEP 100:

<URL:http://www.python.org/dev/peps/pep-0100/>.

> I tried to do this such way:
> s = u"Stanisław Lem"
> print u.encode('utf-8')
> This works, but is very cumbersome.

Nevertheless, that says everything that needs to be said: You've
defined a Unicode text object, and you've printed it specifying which
character encoding to use.

When dealing with text, the reality is that there is *always* an
encoding at the point where program objects must interface to or from
a device, such as a file, a keyboard, or a display. There is *no*
sensible default encoding, except for the increasingly-inadequate
7-bit ASCII.

<URL:http://www.joelonsoftware.com/articles/Unicode.html>

Since there is no sensible default, Python needs to be explicitly told
at some point which encoding to use.

> Then I tried to do this that way:
> s = u"Stanisław Lem"
> print u
> This breaks when I redirect the output of my program to some file,
> like that:
> $ example.py > log

How does it “break”? What behaviour did you expect, and what
behaviour did you get instead?

--
\ “I hope that after I die, people will say of me: ‘That guy sure |
`\ owed me a lot of money’.” —Jack Handey |
_o__) |
Ben Finney

"Martin v. Löwis"

unread,

Dec 14, 2008, 6:25:23 PM12/14/08

to

> My main problem is that when I use some language I want to use it the way it
> is supposed to be used. Usually doing like that saves many problems.
> Especially in Python, where there is one official way to do any elementary
> task. And I just want to know what is the normal, official way of printing
> unicode strings. I mean, the question is not "how can I print the unicode
> string" but "how the creators of the language suppose me to print the
> unicode string". I couldn't find an answer to this question in docs, so I
> hope somebody here knows it.
>
> So, is it _the_ python way of printing unicode?

The official way to write Unicode strings into a file is not to do that.
Explicit is better then implicit - always explicitly pick an encoding,
and encode the Unicode string to that encoding. Doing so is possible in
any of the forms that you have shown.

Now, Python does not mandate any choice of encoding. The right way to
encode data is in the encoding that readers of your data expect it in.

For printing to the terminal, it is clear what the encoding needs to
be (namely, the one that is used by the terminal), hence Python choses
that one when printing to the terminal. When printing to the file, the
application needs to make a choice.

If you have no idea what encoding to use, your best choice is the one
returned by locale.getpreferredencoding(). This is the encoding that
the user is most likely to expect.

Regards,
Martin