# As expected
Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = '\u9876'
>>> x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>
# *NOT* as expected (by me, that is)
Is this the intended outcome?
I also found this a bit surprising, but it seems to be the intended
behaviour (on a non-unicode console)
http://docs.python.org/3.0/whatsnew/3.0.html
"PEP 3138: The repr() of a string no longer escapes non-ASCII
characters. It still escapes control characters and code points with
non-printable status in the Unicode standard, however."
I get the same error in windows cmd, (Idle prints the respective glyph
correctly).
To get the old behaviour of repr, one can use ascii, I suppose.
Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> repr('\u9876')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\Python30\lib\encodings\cp852.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in position
2: character maps to <undefined>
>>> '\u9876'.encode("unicode-escape")
b'\\u9876'
>>> ascii('\u9876')
"'\\u9876'"
>>>
When Python tries to display the character, it must first encode it
because IO is done in bytes, not Unicode codepoints. When it tries to
encode it in CP850 (apparently your system's default encoding judging
by the traceback), it unsurprisingly fails (CP850 is an old Western
Europe codec, which obviously can't encode an Asian character like the
one in question). To signal that failure, it raises an exception, thus
the error you see.
This is intended behavior. Either change your default system/terminal
encoding to one that can handle such characters or explicitly encode
the string and use one of the provided options for dealing with
unencodable characters.
Also, please don't call it a "crash" as that's very misleading. The
Python interpreter didn't dump core, an exception was merely thrown.
There's a world of difference.
Cheers,
Chris
--
Follow the path of the Iguana...
http://rebertia.com
I see. That means that the behaviour in Python 1.6 to 2.6 (i.e.
encoding the text using the repr() function (as then defined) was not
intended behaviour?
> Either change your default system/terminal
> encoding to one that can handle such characters or explicitly encode
> the string and use one of the provided options for dealing with
> unencodable characters.
You are missing the point. I don't care about the visual
representation. What I care about is an unambiguous representation
that can be used when communicating about problems across cultures/
networks/mail-clients/news-readers ... the sort of problems that are
initially advised as "I got this UnicodeEncodeError" and accompanied
by no data or garbled data.
> Also, please don't call it a "crash" as that's very misleading. The
> Python interpreter didn't dump core, an exception was merely thrown.
"spew nonsense on the screen and then stop" is about as useful and as
astonishing as "dump core".
core? You mean like ferrite doughnuts on a wire trellis? I thought
that went out of fashion before cp850 was invented :-)
Sure. This behavior has not changed. It still uses repr().
Of course, the string type has changed in 3.0, and now uses a different
definition of repr.
Regards,
Martin
"Sure" as in "sure, it was not intended behaviour"?
> This behavior has not changed. It still uses repr().
>
> Of course, the string type has changed in 3.0, and now uses a different
> definition of repr.
So was the above-reported non-crash consequence of the change of
definition of repr intended?
Python defaulted to using strict encoding, which means to throw errors on
unencodable characters, but this is NOT the only behavior, you can change
the behavior to "replace using placeholder character" or "ignore any
errors and discard unencodable characters"
| errors can be 'strict', 'replace' or 'ignore' and defaults
| to 'strict'.
If you don't like the default behavior or you want another kind of
behavior, you're welcome to file a bug report at http://bugs.python.org
>> Also, please don't call it a "crash" as that's very misleading. The
>> Python interpreter didn't dump core, an exception was merely thrown.
>
> "spew nonsense on the screen and then stop" is about as useful and as
> astonishing as "dump core".
That's an interesting definition of crash. You're just like saying: "C
has crashed because I made a bug in my program". In this context, it is
your program that crashes, not python nor C, it is misleading to say so.
It will be python's crash if:
1. Python 'segfault'ed
2. Python interpreter exits before there is instruction to exit (either
implicit (e.g. falling to the last line of the script) or explicit (e.g
sys.exit or raise SystemExit))
3. Python core dumped
4. Python does something that is not documented
It is intended. I ran into your same issue and voiced a similar complaint,
but Martin pointed out that users of other characters sets wanted to see the
characters they were using. If you were in China, would you rather see:
IDLE 2.6.1
>>> x=u'\u9876'
>>> x
u'\u9876'
>>> x=u'顶'
>>> x
u'\u9876'
or:
IDLE 3.0
>>> x='\u9876'
>>> x
'顶'
>>> x='顶'
>>> x
'顶'
On Asian consoles that support the required characters, 3.0 makes much more
sense. Your cp850 console or my cp437 console can't support the characters,
so we get the encoding error. I'm sure our Asian colleagues love it, but
our encoding-challenged consoles now need:
>>> x='\u9876'
>>> print(ascii(x))
'\u9876'
It's not very convenient, and I've found it is easier to use IDLE (or any
other IDE supporting UTF-8) rather than the console when dealing with
characters outside what the console supports.
-Mark
It was intended behavior, and still is in 3.0.
>> This behavior has not changed. It still uses repr().
>>
>> Of course, the string type has changed in 3.0, and now uses a different
>> definition of repr.
>
> So was the above-reported non-crash consequence of the change of
> definition of repr intended?
Yes. If you want a display that is guaranteed to work on your terminal,
use the ascii() builtin function.
py> x = '\u9876'
py> ascii(x)
"'\\u9876'"
py> print(ascii(x))
'\u9876'
Regards,
Martin
But shouldn't the production of an object's representation via repr be
a "safe" operation? That is, the operation should always produce a
result, regardless of environmental factors like the locale or
terminal's encoding support. If John were printing the object, it
would be a different matter, but he apparently just wants to see a
sequence of characters which represents the object.
Paul
It's a trade-off. It should also be legible.
Regards,
Martin
It seems to me to be a generally accepted term when an application
stops due to an unhandled error to say that it crashed.
Michael Foord
http://www.ironpythoninaction.com/
it == application
Yes.
--------------------
#!/usr/bin/env python
from traceback import format_exc
def foo():
print "Hello World!"
def main():
try:
foo()
except Exception, error:
print "ERROR: %s" % error
print format_exc()
if __name__ == "__main__":
main()
--------------------
--JamesMills
Right. I can understand that unlike Python 2.x, a representation of a
string in Python 3.x (whose equivalent in Python 2.x would be a
Unicode object) must also be a string (as opposed to a byte string in
Python 2.x), and that no decision can be taken to choose "safe"
representations for characters which cannot be displayed in a
terminal. In examples, for Python 2.x...
>>> u"æøå"
u'\xe6\xf8\xe5'
>>> repr(u"æøå")
"u'\\xe6\\xf8\\xe5'"
...and for Python 3.x...
>>> "æøå"
'æøå'
>>> repr("æøå")
"'æøå'"
...with an ISO-8859-15 terminal. Python 2.x could conceivably be
smarter about encoding representations, but chooses not to be since
the smarter behaviour would need to involve knowing that an "output
situation" was imminent. Python 3.x, on the other hand, leaves issues
of encoding to the generic I/O pipeline, causing the described
problem.
Of course, repr will always work if its output does not get sent to
sys.stdout or an insufficiently capable output stream, but I suppose
usage of repr for debugging purposes, where one may wish to inspect
character values, must be superseded by usage of the ascii function,
as you point out. It's unfortunate that the default behaviour isn't
optimal at the interactive prompt for some configurations, though.
Paul
As I said, it's a trade-off. The alternative, if it was the default,
wouldn't be optimal at the interactive prompt for some other
configurations.
In particular, users of non-latin scripts have been complaining that
they can't read their strings - hence the change, which now actually
allows these users to read the text that is stored in the strings.
The question really is why John Machin has a string that contains
'\u9876' (which is a Chinese character), yet his terminal is incapable
of displaying that character. More likely, people will typically
encounter only characters in their data that their terminals are
also capable of displaying (or else the terminal would be pretty
useless)
In the long run, it might be useful to have an error handler on
sys.stdout in interactive mode, which escapes characters that
cannot be encoded (perhaps in a different color, if the terminal
supports colors, to make it clear that it is an escape sequence)
Regards,
Martin