Unicode characters in btye-strings

Steven D'Aprano

unread,

Mar 12, 2010, 7:35:57 AM3/12/10

to

I know this is wrong, but I'm not sure just how wrong it is, or why.
Using Python 2.x:

>>> s = "éâÄ"
>>> print s
éâÄ
>>> len(s)
6
>>> list(s)
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']

Can somebody explain what happens when I put non-ASCII characters into a
non-unicode string? My guess is that the result will depend on the
current encoding of my terminal.

In this case, my terminal is set to UTF-8. If I change it to ISO 8859-1,
and repeat the above, I get this:

>>> list("éâÄ")
['\xe9', '\xe2', '\xc4']

If I do this:

>>> s = u"éâÄ"
>>> s.encode('utf-8')
'\xc3\xa9\xc3\xa2\xc3\x84'
>>> s.encode('iso8859-1')
'\xe9\xe2\xc4'

which at least explains why the bytes have the values which they do.

Thank you,

--
Steven

Robert Kern

unread,

Mar 12, 2010, 10:37:36 AM3/12/10

to pytho...@python.org

On 2010-03-12 06:35 AM, Steven D'Aprano wrote:
> I know this is wrong, but I'm not sure just how wrong it is, or why.
> Using Python 2.x:
>
>>>> s = "éâÄ"
>>>> print s
> éâÄ
>>>> len(s)
> 6
>>>> list(s)
> ['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']
>
> Can somebody explain what happens when I put non-ASCII characters into a
> non-unicode string? My guess is that the result will depend on the
> current encoding of my terminal.

Exactly right.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Martin v. Loewis

unread,

Mar 12, 2010, 3:56:42 PM3/12/10

to

>> Can somebody explain what happens when I put non-ASCII characters into a
>> non-unicode string? My guess is that the result will depend on the
>> current encoding of my terminal.
>
> Exactly right.

To elaborate on the "what happens" part: the string that gets entered is
typically passed as a byte sequence, from the terminal (application) to
the OS kernel, from the OS kernel to Python's stdin, and from there to
the parser. Python recognizes the string delimiters, but (practically)
leaves the bytes between the delimiters as-is (*), creating a byte
string object with the very same bytes.

The more interesting question is what happens when you do

py> s = u"éâÄ"

Here, Python needs to decode the bytes, according to some encoding.
Usually, it would want to use the source encoding (as given through
-*- Emacs -*- markers), but there are none. Various Python versions then
try different things; what they should do is to determine the terminal
encoding, and decode the bytes according to that one.

Regards,
Martin

(*) If a source encoding was given, the source is actually recoded to
UTF-8, parsed, and then re-encoded back into the original encoding.

Michael Rudolf

unread,

Mar 12, 2010, 5:55:49 PM3/12/10

to

Am 12.03.2010 21:56, schrieb Martin v. Loewis:
> (*) If a source encoding was given, the source is actually recoded to
> UTF-8, parsed, and then re-encoded back into the original encoding.

Why is that? So "unicode"-strings (as in u"string") are not really
unicode-, but utf8-strings?

Need citation plz.

Thx,
Michael

John Bokma

unread,

Mar 12, 2010, 6:37:54 PM3/12/10

to

Michael Rudolf <spamf...@ch3ka.de> writes:

> Am 12.03.2010 21:56, schrieb Martin v. Loewis:
>> (*) If a source encoding was given, the source is actually recoded to
>> UTF-8, parsed, and then re-encoded back into the original encoding.
>
> Why is that? So "unicode"-strings (as in u"string") are not really
> unicode-, but utf8-strings?

utf8 is a Unicode *encoding*.

--
John Bokma j3b

Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development

Martin v. Loewis

unread,

Mar 12, 2010, 6:51:42 PM3/12/10

to Michael Rudolf

Michael Rudolf wrote:
> Am 12.03.2010 21:56, schrieb Martin v. Loewis:
>> (*) If a source encoding was given, the source is actually recoded to
>> UTF-8, parsed, and then re-encoded back into the original encoding.
>
> Why is that?

Why is what? That string literals get reencoded into the source encoding?

> So "unicode"-strings (as in u"string") are not really
> unicode-, but utf8-strings?

No. String literals, in 2.x, are not written with u"", and are stored in
the source encoding. Above procedure applies to regular strings (see
where the "*" goes in my original article).

> Need citation plz.

You really want a link to the source code implementing that?

Regards,
Martin