preferred way to set encoding for print

_wolf

unread,

Sep 15, 2009, 9:28:06 AM9/15/09

to

hi folks,

i am doing my first steps in the wonderful world of python 3.

some things are good.
some things have to be relearned.
some things drive me crazy.

sadly, i'm working on a windows box. which, in germany, entails that
python thinks it to be a good idea to take cp1252 as the default
encoding.

so just coz i got my box in germany means i can never print out a
chinese character? say what?

i have no troubles with people configuring their python installation
to use any encoding in the world, but wouldn't it have been less of a
surprise to just assume utf-8 for any file in/output? after all, it is
already the default for python source files as far as i understand.
someone might think they're clever to sniff into the system and make
the somehwat educated guess that this dude's using cp1252 for his
files. but they would be wrong.

so: how can i tell python, in a configuration or using a setting in
sitecustomize.py, or similar, to use utf-8 as a default encoding?
there used to be a trick to say `reload(sys);sys.setdefaultencoding
('utf-8')`, but that has no effect in py3.0.1. also, i cannot set
`sys.stdout.encoding`; is there a way to re-open that stream with a
different encoding?

in all, i believe it is quite unsettling to me to see that, on my py3
installation,

sys.getdefaultencoding() == 'utf-8'
sys.stdout.encoding == 'cp1252'
locale.getlocale() == (None, None)
locale.getdefaultlocale() == ('de_DE', 'cp1252')

which to me makes as much sense as a blackcurrant tart thrown into
space. worse,

locale.setlocale( locale.LC_ALL, locale.getdefaultlocale() )

results in

locale.Error: unsupported locale setting

this bloody thing doesn't accept its *own* output. attempts to feed
that locale beast with anything but the empty string or 'C' were all
doomed. it would take a very patient and eloquent person to explain
that in a credible fashion to me. my word for this is, 'broken'.

i would very much like to rid myself of these considerations. just say
it's all utf-8, wash'n'go.

my attempts of changing python's mind using the locale module have
failed so far. otherwise, i for one don't want to touch that locale
thing with a very long pole. as far as i can see, it does not work as
documented. the platform dependencies are also a clear OFF LIMITS sign
to me.

any suggestions?

cheers,

~flow

Mark Tolonen

unread,

Sep 16, 2009, 1:16:07 AM9/16/09

to pytho...@python.org

"_wolf" <wolfga...@gmail.com> wrote in message
news:22991c72-d00f-45cd...@k26g2000vbp.googlegroups.com...

What specifically do you want to do? I work with Chinese all the time on a
U.S. Windows system. Do you want to print Chinese characters in a console
window? In a Python IDE? FYI, I don't use the locale module for much at
all.

I can't type or print Chinese to a console window unless I change Control
Panel, Regional and Language Options, Advanced Tab, Language for non-Unicode
Programs to a Chinese selection (and reboot). Then the default
sys.stdout.encoding is something like cp936.

The Pythonwin IDE in the latest version of pywin32, however, supports UTF-8
in its interactive window and displays Chinese fine.

Setting PYTHONIOENCODING overrides the encoding used for stdin/stdout/stderr
(See the Python help for details), but if your terminal doesn't support the
encoding that won't help.

Let me know what you're trying to do.

-Mark

~flow

unread,

Sep 16, 2009, 3:39:55 PM9/16/09

to

On Sep 16, 7:16 am, "Mark Tolonen" <metolone+gm...@gmail.com> wrote:
> Setting PYTHONIOENCODING overrides the encoding used for stdin/stdout/stderr
> (See the Python help for details), but if your terminal doesn't support the
> encoding that won't help.

thx for these two tips. of course, that was a bit misleading by me to
complain that a cp850 terminal can't display chinese characters from
python----it cannot do it all, of course.

i've gone on to experiment. what i do not want is python to stop
execution when an encoding error occurs on printing and perhaps
logging. so far, i used to do this by convincing python to use utf-8
in any and all cases, and then live with the amount garbish that
appears on screen when using cp850 and cp1252 terminals.

what has changed in python is that they now somehow find out about the
terminal's encoding, and then put that encoding into place and defend
it with teeth and claws. it is simply not easy to take control of that
setting.

this is in itself unfortunate; i believe that users should have a
right to determine what to do in case of stdout encoding problems.
these are a little different from i-wrote-to-that-file-and-boom
experiences. *there* the encoding exception is fully warranted, and
could be easily fixed by allowing a less-than-strict encoding mode.

but print is different, and of all situations where encoding errors
can occur, this is the hardest to take hold of. and much more so in
python3 it seems than in python2.

printing to the screen is often purely meta-informative in nature, a
side-effect e.g. of a webserver really doing web pages. i don't want
to bring my entire system down just because some output into some
terminal in the back orifice produced a some amount of grabish. maybe
only a single chinese character amongst thousands of done this done
that red tape.

i think web browsers are a good example here. i don't know whether it
was a good idea to let clients reassemble broken web pages in an order
as they see fit, but the policy to just output broken encoding
character instances instead of terminating the browser process with a
lengthy stacktrace was probably somehow good for the poopularity of
the web as we know it.

my current patch looks like this:

class Stdout_writer_with ncrs( object ):

def write( self, p ):
"""See to it that all write encodings are done using numerical
character references (NCRs) that
circumvents Python’s default behavior of raising an exception
whenever it encounters an
unrepresentable character while printing."""
enc = sys.__stdout__.encoding
p = p if isinstance( p, str ) else str( p )
p = p.encode( enc, 'xmlcharrefreplace' ).decode( enc )
sys.__stdout__.write( p )

sys.stdout = Stdout_writer_with ncrs()

this method picks up anything to be printed, makes sure it is a text,
and then encodes it to the terminal encoding using numerical character
references (NCRs), then decodes it again since the underlying wrapper
class wants to do encodings itself and refuses bytes in place of
strings to be sent (again, this is not nice: an array of byte values
sent to the print method is a clear request to send exactly those
bytes, verbatim, one by one, to the terminal. no mucking around with
my bytes, pls! maybe i can implement that in the code above, too.)

of course, this simplistic scaffold will break if anyone uses
sys.stdout for anything but issue sys.stdout.write(), but so far it
has worked fine despite of being a defective, tiny shim. maybe
inheriting from sys.stdout.__class__ would help.

> "_wolf" <wolfgang.l...@gmail.com> wrote in message

Terry Reedy

unread,

Sep 16, 2009, 6:04:06 PM9/16/09

to pytho...@python.org

Mark Tolonen wrote:

>> ('utf-8')`, but that has no effect in py3.0.1. also, i cannot set

Even if not relevant to your immediate problem, if you can, upgrade to
3.1, with its many important bug fixes.

Mark Tolonen

unread,

Sep 17, 2009, 12:50:03 AM9/17/09

to pytho...@python.org

"~flow" <wolfga...@gmail.com> wrote in message
news:643ca91c-b81c-483c...@r33g2000vbp.googlegroups.com...

>On Sep 16, 7:16 am, "Mark Tolonen" <metolone+gm...@gmail.com> wrote:
>> Setting PYTHONIOENCODING overrides the encoding used for
>> stdin/stdout/stderr
>> (See the Python help for details), but if your terminal doesn't support
>> the
>> encoding that won't help.

[snip]

>what has changed in python is that they now somehow find out about the
>terminal's encoding, and then put that encoding into place and defend
>it with teeth and claws. it is simply not easy to take control of that
>setting.

A couple more tips, PYTHONIOENCODING takes an optional errorhandler:

C:\>set PYTHONIOENCODING=cp437:xmlcharrefreplace
C:\>python
Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)]
on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print('Hello \u5000\u5001')
Hello 倀倁

You can also write directly to stdout with byte strings (Note: my terminal
doesn't support UTF-8, but no error):

>>> import sys
>>> sys.stdout.buffer.write('\u5000'.encode('utf8'))
sï¿½ï¿½3

-Mark

Miles Kaufmann

unread,

Sep 17, 2009, 1:19:57 AM9/17/09

to pytho...@python.org

On Sep 16, 2009, at 12:39 PM, ~flow wrote:
>>> so: how can i tell python, in a configuration or using a setting in
>>> sitecustomize.py, or similar, to use utf-8 as a default encoding?
>>
>

> [snip Stdout_writer_with_ncrs solution]

This should work:

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,
encoding=sys.stdout.encoding,
errors='xmlcharrefreplace')

http://mail.python.org/pipermail/python-list/2009-August/725100.html

-Miles