diferences between 22 and python 23

Enrique

unread,

Dec 3, 2003, 12:15:15 PM12/3/03

to Python-List (E-mail)

Hi all.

running a script that works fine in python 22 in python 23 i find something
like:

unicodedecodeerror: "ascii" codec dan+t decode byte 0xed in position
37:ordinal not in range (128)

Usually major versions of python were courteus with the previous versions...

How must i do to run in python 23 what i'm running in 22????

I write you from Spain.

Thanks.

Enrique

**AVISO DE CONFIDENCIALIDAD**
La información contenida en este mensaje y archivos es privada y confidencial estando dirigida solamente al destinatario. Si Ud. ha recibido esta información por error, por favor, proceda a su inmediata destrucción. Cualquier opinión o punto de vista contenido en este mensaje corresponde al remitente y necesariamente no representa la opinión del GRUPO XEROX.

Fredrik Lundh

unread,

Dec 3, 2003, 1:35:52 PM12/3/03

to pytho...@python.org

Enrique wrote:

> running a script that works fine in python 22 in python 23 i find something
> like:
>
> unicodedecodeerror: "ascii" codec dan+t decode byte 0xed in position
> 37:ordinal not in range (128)
>
> Usually major versions of python were courteus with the previous versions...

0xED has never been a valid 7-bit ASCII character.

you've probably used a modified 2.2 interpreter; most likely, someone
has hacked the site.py or sitecustomize.py files to make Python use a
non-standard default encoding.

you should either fix your program, or figure out how 2.2 was modified,
and modify 2.3 in the same way.

</F>

Fredrik Lundh

unread,

Dec 3, 2003, 2:46:17 PM12/3/03

to pytho...@python.org

Mike C. Fletcher wrote:

> >0xED has never been a valid 7-bit ASCII character.
>

> Sure, but Python used to accept 8-bit characters in the platform's
> default encoding as part of string characters...

all 2.3 installs I have give me a DeprecationWarning when I do that,
not a UnicodeDecodeError.

what version are you using?

</F>

Mike C. Fletcher

unread,

Dec 3, 2003, 2:02:42 PM12/3/03

to pytho...@python.org

Fredrik Lundh wrote:

>Enrique wrote:
>
>
>
>>running a script that works fine in python 22 in python 23 i find something
>>like:
>>
>>unicodedecodeerror: "ascii" codec dan+t decode byte 0xed in position
>>37:ordinal not in range (128)
>>
>>Usually major versions of python were courteus with the previous versions...
>>
>>
>

>0xED has never been a valid 7-bit ASCII character.
>
>
Sure, but Python used to accept 8-bit characters in the platform's
default encoding as part of string characters...

Most likely Enrique has a \xED somewhere in a string literal in his code
that is intended to be an i-accent-ague. That would have worked fine in
all versions of Python before 2.3, but started failing in 2.3 due to the
decision that all string literals would be converted to unicode and back
and that the default encoding for such conversions would be ASCII
(whereas previously it would most closely have been approximated by
"platform's local 256-char encoding").

PythonWin 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32.
Portions Copyright 1994-2001 Mark Hammond (mham...@skippinet.com.au) -
see 'Help/About PythonWin' for further copyright information.
>>> print '23\xED'
23í

So, Enrique, what you're probably looking for is this:
# -*- coding: ISO-8859-1 -*-

for latin-1, or

# -*- coding: cp1252 -*-

for Windows code-page.

You add these "magic" comments to the top of your Python source files to
tell the interpreter that you're using a particular encoding for your
Python string literals. Even if you're just using string literals to
store binary data, you'll still need to use a dummy encoding, such as
latin-1.

Yes it's a bit of a pain, but the decision was made, so we have to deal
with it :) . I'm assuming that somewhere in the "new in 2.3" pages is a
huge warning to the effect that this breaks lots of old code, but
Enrique can be forgiven for missing it, as I think I managed to miss it
too, all I found was this:

*Encoding declarations* - you can put a comment of the form "# -*-
coding: <encodingname> -*-" in the first or second line of a Python
source file to indicate the encoding (e.g. utf-8). (PEP 263
<http://www.python.org/peps/pep-0263.html> phase 1)

Which doesn't actually mention the breakage of code that results. True,
theoretically the code was never valid, but *lots* of people used 8-bit
encodings quite happily with earlier versions and do find their code
breaking in 2.3 because of this.

Have fun,
Mike

_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/

Peter Hansen

unread,

Dec 3, 2003, 3:16:46 PM12/3/03

to

"Mike C. Fletcher" wrote:
>
> Fredrik Lundh wrote:
>
> >Enrique wrote:
> >>Usually major versions of python were courteus with the previous versions...
> >
> >0xED has never been a valid 7-bit ASCII character.
> >
> Sure, but Python used to accept 8-bit characters in the platform's
> default encoding as part of string characters...
>
> Most likely Enrique has a \xED somewhere in a string literal in his code
> that is intended to be an i-accent-ague. That would have worked fine in
> all versions of Python before 2.3, but started failing in 2.3 due to the
> decision that all string literals would be converted to unicode and back
> and that the default encoding for such conversions would be ASCII
> (whereas previously it would most closely have been approximated by
> "platform's local 256-char encoding").

[snip]

> Which doesn't actually mention the breakage of code that results. True,
> theoretically the code was never valid, but *lots* of people used 8-bit
> encodings quite happily with earlier versions and do find their code
> breaking in 2.3 because of this.

Wait a sec... are you telling me that my code, which has strings containing
binary data (which I believe has *always* been permitted), and which from
time to time might, say, produce an error traceback containing the content
from one such string and write it to a log file, then continue processing
safely, will now fail with an ugly crash because I haven't changed it to
specify a default encoding? (!!!)

I've been watching this "value XX with high bit set was never a valid 7-bit
ASCII" discussion with only one eye for quite some time now, somewhat curious
why so many people were having troubles. I assume it was merely that they
were using *names* that contained non-ASCII characters, in their source code.

Are you saying that this change is actually breaking code that happens to
have these perfectly valid binary strings stored in string constants?

I'm very unimpressed with this decision if that's the case.

-Peter

Mike C. Fletcher

unread,

Dec 3, 2003, 3:34:02 PM12/3/03

to

Peter Hansen wrote:

>"Mike C. Fletcher" wrote:
>
>
...

>>Which doesn't actually mention the breakage of code that results. True,
>>theoretically the code was never valid, but *lots* of people used 8-bit
>>encodings quite happily with earlier versions and do find their code
>>breaking in 2.3 because of this.
>>
>>
>
>Wait a sec... are you telling me that my code, which has strings containing
>binary data (which I believe has *always* been permitted), and which from
>time to time might, say, produce an error traceback containing the content
>from one such string and write it to a log file, then continue processing
>safely, will now fail with an ugly crash because I haven't changed it to
>specify a default encoding? (!!!)
>
>

Yes, though as Fredrik points out, not just yet. For now you'll get a
DeprecationWarning with 2.3.

>I've been watching this "value XX with high bit set was never a valid 7-bit
>ASCII" discussion with only one eye for quite some time now, somewhat curious
>why so many people were having troubles. I assume it was merely that they
>were using *names* that contained non-ASCII characters, in their source code.
>
>Are you saying that this change is actually breaking code that happens to
>have these perfectly valid binary strings stored in string constants?
>
>

AFAIK, that's the plan. IIRC, rationale was that there would be some
other type for 8-bit data, while all "normal" strings would become
Unicode strings. Of course, I've been known to catch fragments of
conversations and read too much into them. For all I know, it may only
be unicode literals that will be affected. Though I know some of my
users were reporting DeprecationWarnings from resourcepackage, which
doesn't do any Unicode at all (just stuffs binary data into string
constants).

>I'm very unimpressed with this decision if that's the case.
>
>

Doesn't make me ecstatic, either, as I like the simple 8-bit-clean
string type. But maybe we'll luck out and it will turn out that I'm all
wet on this one :) .

Enjoy,

Mike C. Fletcher

unread,

Dec 3, 2003, 3:22:07 PM12/3/03

to pytho...@python.org

Fredrik Lundh wrote:

>Mike C. Fletcher wrote:
>
>
>
>>>0xED has never been a valid 7-bit ASCII character.
>>>
>>>
>>Sure, but Python used to accept 8-bit characters in the platform's
>>default encoding as part of string characters...
>>
>>
>

>all 2.3 installs I have give me a DeprecationWarning when I do that,
>not a UnicodeDecodeError.
>
>what version are you using?
>
>

Hmm, obviously none of 2.3. Oops! Sorry about that. I had to change
resourcepackage to support this change (user error reports) and hadn't
cottoned on to the fact that it's just a warning.

I haven't upgraded to 2.3 for my general development, so I didn't
realise this wasn't yet a hard error. I had thought there was a way to
get an ASCII decoding error from the above... hmm. Guess not...
strange... I've *seen* those errors show up when testing code for a new
version (and I'd though it was of Python).

Enrique, by any chance are you working with a library (such as wxPython)
where the version under 2.2 is a non-unicode build and the version under
2.3 is a unicode build?

Mike C. Fletcher

unread,

Dec 3, 2003, 3:34:21 PM12/3/03

to Peter Hansen, pytho...@python.org

Peter Hansen wrote:

>"Mike C. Fletcher" wrote:
>
>
...

>>Which doesn't actually mention the breakage of code that results. True,
>>theoretically the code was never valid, but *lots* of people used 8-bit
>>encodings quite happily with earlier versions and do find their code
>>breaking in 2.3 because of this.
>>
>>
>
>Wait a sec... are you telling me that my code, which has strings containing
>binary data (which I believe has *always* been permitted), and which from
>time to time might, say, produce an error traceback containing the content
>from one such string and write it to a log file, then continue processing
>safely, will now fail with an ugly crash because I haven't changed it to
>specify a default encoding? (!!!)
>
>

Yes, though as Fredrik points out, not just yet. For now you'll get a
DeprecationWarning with 2.3.

>I've been watching this "value XX with high bit set was never a valid 7-bit

>ASCII" discussion with only one eye for quite some time now, somewhat curious
>why so many people were having troubles. I assume it was merely that they
>were using *names* that contained non-ASCII characters, in their source code.
>
>Are you saying that this change is actually breaking code that happens to
>have these perfectly valid binary strings stored in string constants?
>
>

AFAIK, that's the plan. IIRC, rationale was that there would be some
other type for 8-bit data, while all "normal" strings would become
Unicode strings. Of course, I've been known to catch fragments of
conversations and read too much into them. For all I know, it may only
be unicode literals that will be affected. Though I know some of my
users were reporting DeprecationWarnings from resourcepackage, which
doesn't do any Unicode at all (just stuffs binary data into string
constants).

>I'm very unimpressed with this decision if that's the case.
>
>

Doesn't make me ecstatic, either, as I like the simple 8-bit-clean
string type. But maybe we'll luck out and it will turn out that I'm all
wet on this one :) .

Enjoy,

Fredrik Lundh

unread,

Dec 3, 2003, 4:36:47 PM12/3/03

to pytho...@python.org

Peter Hansen wrote:

> Wait a sec... are you telling me that my code, which has strings containing
> binary data (which I believe has *always* been permitted), and which from
> time to time might, say, produce an error traceback containing the content
> from one such string and write it to a log file, then continue processing
> safely, will now fail with an ugly crash because I haven't changed it to
> specify a default encoding? (!!!)

8-bit strings can still contain 8-bit data.

The only way you'll get a warning is if you use non-ASCII characters in
source code, without specifying a source code encoding. The warning
is issued by the compiler, not the runtime.

(if you've used non-ASCII characters to embed *binary* string literals
in your program, you deserve to be punished).

The only way you'll get the "crash" the original poster had, is if you're
trying to print Unicode strings containing non-ASCII data to an ASCII-
only output stream.

(if you print 8-bit strings to an ASCII stream, Python assumes you know
what you're doing).

</F>

Peter Hansen

unread,

Dec 3, 2003, 10:46:14 PM12/3/03

to

Fredrik Lundh wrote:
>
> Peter Hansen wrote:
>
> > Wait a sec... are you telling me that my code, which has strings containing
> > binary data (which I believe has *always* been permitted), and which from
> > time to time might, say, produce an error traceback containing the content
> > from one such string and write it to a log file, then continue processing
> > safely, will now fail with an ugly crash because I haven't changed it to
> > specify a default encoding? (!!!)
>
> 8-bit strings can still contain 8-bit data.
>
> The only way you'll get a warning is if you use non-ASCII characters in
> source code, without specifying a source code encoding. The warning
> is issued by the compiler, not the runtime.
>
> (if you've used non-ASCII characters to embed *binary* string literals
> in your program, you deserve to be punished).

Ah, so using the proper escapes (\xnn and \nnn) will of course still work.

> The only way you'll get the "crash" the original poster had, is if you're
> trying to print Unicode strings containing non-ASCII data to an ASCII-
> only output stream.
>
> (if you print 8-bit strings to an ASCII stream, Python assumes you know
> what you're doing).

Meaning I won't necessarily see anything meaningful or printable, but
since I knew that and wanted that behaviour, I'll still get it. I can
send 8-bit binary strings to a file or stdout, but not *Unicode* strings
unless they're properly encoded.

In that case, I did misunderstand and I don't see a problem. Thanks, /F

-Peter

Martin v. Löwis

unread,

Dec 4, 2003, 2:23:54 AM12/4/03

to

Peter Hansen <pe...@engcorp.com> writes:

> Wait a sec... are you telling me that my code, which has strings containing
> binary data (which I believe has *always* been permitted)

From a language specification point of view, it was never permitted.
You should use escape codes for binary data in source code.

Regards,
Martin

Martin v. Löwis

unread,

Dec 4, 2003, 2:28:16 AM12/4/03

to

"Mike C. Fletcher" <mcfl...@rogers.com> writes:

> AFAIK, that's the plan. IIRC, rationale was that there would be some
> other type for 8-bit data, while all "normal" strings would become
> Unicode strings.

No. <type 'str'> will remain a byte string type for any foreseeable
future. The only change that is likely to happen is this: To denote
bytes > 128 in source code, you will need to use escape codes.

A change that might happen in the future is this: A string literal
does not create an instance of <type 'str'>, but an instance of <type
'unicode'>. However, IMO, this should only happen after a syntax for
byte string literals has been introduced.

> >I'm very unimpressed with this decision if that's the case.
> >
> Doesn't make me ecstatic, either, as I like the simple 8-bit-clean
> string type. But maybe we'll luck out and it will turn out that I'm
> all wet on this one :) .

The byte string type is not going away. It is a useful type, e.g. when
reading or writing to or from a byte stream.

Regards,
Martin

Mike C. Fletcher

unread,

Dec 4, 2003, 3:07:46 AM12/4/03

to pytho...@python.org

Martin v. Löwis wrote:

>"Mike C. Fletcher" <mcfl...@rogers.com> writes:
>
>
>>AFAIK, that's the plan. IIRC, rationale was that there would be some
>>other type for 8-bit data, while all "normal" strings would become
>>Unicode strings.
>>
>>
>
>No. <type 'str'> will remain a byte string type for any foreseeable
>future. The only change that is likely to happen is this: To denote
>bytes > 128 in source code, you will need to use escape codes.
>
>

Sigh, yes. Bulks up resourcepackage files, but oh well.

>A change that might happen in the future is this: A string literal
>does not create an instance of <type 'str'>, but an instance of <type
>'unicode'>. However, IMO, this should only happen after a syntax for
>byte string literals has been introduced.
>
>

Sorry, yes, that was my understanding, I should have specified a "string
literal", rather than just saying "strings" would produce unicode. And
yes, I'd definitely suggest not getting rid of the current semantics
until there's a way to do byte-strings.

Peace all,

Bengt Richter

unread,

Dec 4, 2003, 5:33:33 AM12/4/03

to

On 04 Dec 2003 08:28:16 +0100, mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>"Mike C. Fletcher" <mcfl...@rogers.com> writes:
>
>> AFAIK, that's the plan. IIRC, rationale was that there would be some
>> other type for 8-bit data, while all "normal" strings would become
>> Unicode strings.
>
>No. <type 'str'> will remain a byte string type for any foreseeable
>future. The only change that is likely to happen is this: To denote
>bytes > 128 in source code, you will need to use escape codes.
>

Anyone considered extending the hex escape with delimiters to make
long runs more dense? E.g.,

'ab\x00\x01\x02\x03cd'

being spellable as

'ab\<00010203>cd'

or
'ab' x'00010203' 'cd'

or
x'6162000102036364'

>A change that might happen in the future is this: A string literal
>does not create an instance of <type 'str'>, but an instance of <type
>'unicode'>. However, IMO, this should only happen after a syntax for
>byte string literals has been introduced.
>

Still, the actual characters used in the _source_ representation will have to
be whatever the -*- xxx -*- thing says, right? -- including the characters
in the source representation of a string that might wind up utf-8 internally?
(so you could have several modules whose sources are encoded differently and
have the run time see a single unified internal representation of utf-8?
Or wchar/utf-16le?

>> >I'm very unimpressed with this decision if that's the case.
>> >
>> Doesn't make me ecstatic, either, as I like the simple 8-bit-clean
>> string type. But maybe we'll luck out and it will turn out that I'm
>> all wet on this one :) .
>
>The byte string type is not going away. It is a useful type, e.g. when
>reading or writing to or from a byte stream.
>

Is this moving towards a single 8-bit str base type with various
encoding-specifying subtypes?

Regards,
Bengt Richter

Logan

unread,

Dec 4, 2003, 7:44:14 AM12/4/03

to

On Wed, 03 Dec 2003 18:15:15 +0100, Enrique wrote:

> running a script that works fine in python 22 in python 23 i find something
> like:
>
> unicodedecodeerror: "ascii" codec dan+t decode byte 0xed in position
> 37:ordinal not in range (128)

(You should post the code snippet which causes the error.)

Just one possibility: maybe you used sitecustomize.py in Python 2.2
to set your default character set to something else than ASCII (e.g.
ISO-8859-1 resp. Latin-1) and now - with Python 2.3 - you do not use this
option any more!?

HTH, L.

--
mailto: logan@phreaker(NoSpam).net

Peter Hansen

unread,

Dec 4, 2003, 8:07:32 AM12/4/03

to

Yes, the misunderstanding was in the fact that you are talking about putting
binary in the source, while I was just interested in strings with binary
data... using escapes. No problemo. -)

-Peter

Skip Montanaro

unread,

Dec 4, 2003, 9:33:51 AM12/4/03

to Martin v. Löwis, pytho...@python.org

Martin> A change that might happen in the future is this: A string
Martin> literal does not create an instance of <type 'str'>, but an
Martin> instance of <type 'unicode'>. However, IMO, this should only
Martin> happen after a syntax for byte string literals has been
Martin> introduced.

b"..." anyone?

Skip

Martin v. Löwis

unread,

Dec 4, 2003, 2:22:52 PM12/4/03

to

bo...@oz.net (Bengt Richter) writes:

> Still, the actual characters used in the _source_ representation will have to
> be whatever the -*- xxx -*- thing says, right? -- including the characters
> in the source representation of a string that might wind up utf-8 internally?

Yes, and no. Yes, characters in the source code have to follow the
source representation. No, they will not wind up utf-8
internally. Instead, (byte) string objects have the same byte
representation that they originally had in the source code.

The source declaration only matters in the following respects:
- the source may be erroneous, if the bytes form illegal encodings
in the declared source encoding.
- a unicode object will be created based upon the source encoding,
by decoding the bytes in the unicode literal.
- the meaning of certain bytes might not be what it would be in
ASCII. In particular, byte 92 does not always denote a
backslash (\), in all encodings. As a result, if byte 92 appears
in a string literal, the end of the string literal might depend
on the encoding.

> >The byte string type is not going away. It is a useful type, e.g. when
> >reading or writing to or from a byte stream.
> >
> Is this moving towards a single 8-bit str base type with various
> encoding-specifying subtypes?

I don't think so. If byte strings where tagged with encoding, you
have to answer many difficult questions, like "what is the result
of adding strings with different encodings?", or "what encoding tag
has a string returned from a socket read", etc.

Instead, applications should apply encodings whereever needed, using
Unicode strings for character data, and byte strings for binary data.

Regards,
Martin

Martin v. Löwis

unread,

Dec 4, 2003, 2:23:17 PM12/4/03

to

Skip Montanaro <sk...@pobox.com> writes:

The PEP suggesting this was withdrawn.

Regards,
Martin

Bengt Richter

unread,

Dec 4, 2003, 3:53:16 PM12/4/03

to

On 04 Dec 2003 20:22:52 +0100, mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bo...@oz.net (Bengt Richter) writes:
>
>> Still, the actual characters used in the _source_ representation will have to
>> be whatever the -*- xxx -*- thing says, right? -- including the characters
>> in the source representation of a string that might wind up utf-8 internally?
>
>Yes, and no. Yes, characters in the source code have to follow the
>source representation. No, they will not wind up utf-8
>internally. Instead, (byte) string objects have the same byte
>representation that they originally had in the source code.

Then they must have encoding info attached?

>
>The source declaration only matters in the following respects:
>- the source may be erroneous, if the bytes form illegal encodings
> in the declared source encoding.
>- a unicode object will be created based upon the source encoding,
> by decoding the bytes in the unicode literal.
>- the meaning of certain bytes might not be what it would be in
> ASCII. In particular, byte 92 does not always denote a
> backslash (\), in all encodings. As a result, if byte 92 appears
> in a string literal, the end of the string literal might depend
> on the encoding.
>
>> >The byte string type is not going away. It is a useful type, e.g. when
>> >reading or writing to or from a byte stream.
>> >
>> Is this moving towards a single 8-bit str base type with various
>> encoding-specifying subtypes?
>
>I don't think so. If byte strings where tagged with encoding, you
>have to answer many difficult questions, like "what is the result
>of adding strings with different encodings?", or "what encoding tag

Isn't that similar to promotion in 123 + 4.56 ? We already do that to some extent:
>>> 'abc' + u'def'
u'abcdef'

IOW, behind the concrete character representations there are abstract entities
(which the unicode charts systematically match up with other abstract entities
from the integer domain), so in the abstract we are representing the concatenation
of abstract character entities of the same universal type (i.e., belonging
to the set of possible characters). The question becomes what encoding is adequate
to represent the result without information loss.

There could even be analogies to roundoff in e.g. dropping accent marks during
some conversion.

But there is another question, and that is whether a concrete encoding of characters
really just represents characters, or whether the intent is actually to represent
a concrete encoding as such (including the info as to which encoding it is). In the
latter case one couldn't convert to a universal character type without loss of information.

IOW, ISTM for literals one would need a way to say:

1. This is a pure character sequence, use the source representation only to determine
what the abstract character entities (ACEs) are, and represent them as necessary to preserve
their unified identities.
2. This is a quote-delimited substring of the source text, use the source encoding cookie
or other governing assumption to determine what the ACEs are, then as in 1.
3. This is an encoding-restricted string literal (though necessarily represented in the concrete
character encoding of the module source, with escapes as necessary). Determine what the ACEs are,
using the encoding information to transform as necessary, but store encoding information along
with with ACE representation, because the programming intent is to represent encoding information
as well as ACE sequence.

3a. Alternatively, store the original _source_ as an ACE sequence with associated _source_ encoding
AND encoding called for by the literal. This is tricky to think about, because there are >= three
encodings to consider -- the source, what's called for by the literal, and possible internal
representations.

>has a string returned from a socket read", etc.

8-bit byte encoding by default, I would think, but if you expand on the idea of cooked
text input, I guess you could specify an encoding much as you specify 'r' vs 'rb' vs 'rU' etc.

BTW, for convenience, will 8-bit byte encoded strings be repr'd as latin-1 + escapes?

>
>Instead, applications should apply encodings whereever needed, using
>Unicode strings for character data, and byte strings for binary data.
>

Still, they have to express that in the encoding(s) of the program sources,
so what will '...' mean? Must it not be normalized to a common internal representation?

BTW, does import see encoding cookies and do the right thing when there are differing ones?

Regards,
Bengt Richter

Martin v. Löwis

unread,

Dec 4, 2003, 4:10:50 PM12/4/03

to

bo...@oz.net (Bengt Richter) writes:

> >Yes, and no. Yes, characters in the source code have to follow the
> >source representation. No, they will not wind up utf-8
> >internally. Instead, (byte) string objects have the same byte
> >representation that they originally had in the source code.
> Then they must have encoding info attached?

I don't understand the question. Strings don't have an encoding
information attached. Why do they have to?

> Isn't that similar to promotion in 123 + 4.56 ?

It is similar, but not the same. The answer is easy for 123+4.56.

The answer would be more difficult for (4/5)+4.56 if 4/5 was a
rational number; for 1 < 0.5+0.5j, Python decides that it just cannot
find a result in a reasonable way. For strings-with-attached encoding,
the answer would always be difficult.

In the face of ambiguity, refuse the temptation to guess.

> We already do that to some extent:
> >>> 'abc' + u'def'
> u'abcdef'

Yes, and that is only possible because the system encoding is
ASCII. So regardles of what the actual encoding of the string is,
assuming it is ASCII will give the expected result, as ASCII is a
universal subset of (nearly) all encodings.

> But there is another question, and that is whether a concrete
> encoding of characters really just represents characters, or whether
> the intent is actually to represent a concrete encoding as such
> (including the info as to which encoding it is).

More interestingly: Do the strings represent characters AT ALL? Some
strings don't represent characters, but bytes.

What is the advantage of having an encoding associated with byte
strings?

> 1. This is a pure character sequence, use the source representation
> only to determine what the abstract character entities (ACEs) are,
> and represent them as necessary to preserve their unified
> identities.

In that case, you should use Unicode literals: They do precisely that.

> BTW, for convenience, will 8-bit byte encoded strings be repr'd as
> latin-1 + escapes?

Currently, they are represented as ASCII+escapes. I see no reason to
change that.

> Still, they have to express that in the encoding(s) of the program
> sources, so what will '...' mean? Must it not be normalized to a
> common internal representation?

At some point in time, '...' will mean the same as u'...'. A Unicode
object *is* a normalized representation of a character string.

There should be one-- and preferably only one --obvious way to do it.

... and Unicode strings are the one obvious way to do a normalized
representation. You should use Unicode literals today whereever
possible.

> BTW, does import see encoding cookies and do the right thing when
> there are differing ones?

In a single file? It is an error to have multiple encoding cookies in
a single file.

In multiple files? Of course, that is the entire purpose: Allow
different encodings in different modules. If only a single encoding
was used, there would be no need to declare that.

Regards,
Martin

Bengt Richter

unread,

Dec 5, 2003, 1:25:05 AM12/5/03

to

On 04 Dec 2003 22:10:50 +0100, mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bo...@oz.net (Bengt Richter) writes:
>
>> >Yes, and no. Yes, characters in the source code have to follow the
>> >source representation. No, they will not wind up utf-8
>> >internally. Instead, (byte) string objects have the same byte
>> >representation that they originally had in the source code.
>> Then they must have encoding info attached?
>
>I don't understand the question. Strings don't have an encoding
>information attached. Why do they have to?

Depends on what you mean by "strings" ;-) One way to look at it would be
because they originate through choosing a sequence of glyphs on key caps,
and there has to be an encoding process between that and a str's internal
representation, something like key->scan_code->[code page]->char_code_of_particular_encoding.

If you put a sequence of those in a "string," ISTM the string should be thought of as
having the same encoding as the characters whose ord() codes are stored.

If the second line of a Python source is
# -*- coding: latin-1 -*-

Then a following line

name = 'Martin Löwis'

would presumably bind name to an internally represented string. I guess right now
it is an ascii string of type str, and if the source encoding was ascii, you would
have to write that statement as

name = 'Martin L\xf6wis'

to get the same internal representation.

But either way, what you wanted to specify was the latin-1 glyph sequence associated
with the number sequence

>>> map(ord, 'Martin L\xf6wis')
[77, 97, 114, 116, 105, 110, 32, 76, 246, 119, 105, 115]

through latin-1 character interpretation. You (probably, never say never ;-) didn't
want just to specify a sequence of bytes. You put them there to be interpreted as
latin-1 at some point.

>
>> Isn't that similar to promotion in 123 + 4.56 ?
>
>It is similar, but not the same. The answer is easy for 123+4.56.
>
>The answer would be more difficult for (4/5)+4.56 if 4/5 was a
>rational number; for 1 < 0.5+0.5j, Python decides that it just cannot
>find a result in a reasonable way. For strings-with-attached encoding,
>the answer would always be difficult.

Why, when unicode includes all?

>
>In the face of ambiguity, refuse the temptation to guess.
>
>> We already do that to some extent:
>> >>> 'abc' + u'def'
>> u'abcdef'
>
>Yes, and that is only possible because the system encoding is
>ASCII. So regardles of what the actual encoding of the string is,

Um, didn't you say, "Strings don't have an encoding information attached.
Why do they have to?" ? What's this about ASCII? ;-)

>assuming it is ASCII will give the expected result, as ASCII is a

^^^^^^^^ oh, ok, it's just an assumption.

>universal subset of (nearly) all encodings.
>
>> But there is another question, and that is whether a concrete
>> encoding of characters really just represents characters, or whether
>> the intent is actually to represent a concrete encoding as such
>> (including the info as to which encoding it is).
>
>More interestingly: Do the strings represent characters AT ALL? Some
>strings don't represent characters, but bytes.

Again it comes down to defining terms ISTM.

>
>What is the advantage of having an encoding associated with byte
>strings?

If e.g. name had latin-1 encoding associated with it by virtue of source like
...
# -*- coding: latin-1 -*-
name = 'Martin Löwis'

then on my cp437 console window, I might be able to expect to see the umlaut
just by writing

print name # implicit conversion from associated encoding to output device encoding

instead of having to write

print name.decode('latin-1').encode('cp437')

or something different on idle, etc.

Instead, there is a you-know-what-you're-doing implicit reinterpret-cast of
the byte string bound to name to whatever-type-the-output-device-currently-attached-is.

Definitely that is necessary functionality, but explicit might be better than implicit.
E.g., one might spell it

print name.bytes() # meaning expose binary data byte sequence for name's encoding.
# the repr format would be like current str ascii-with-escapes

and

bytes = name.bytes()

would result in pure 8-bit data bytes with no implied 'ascii' association whatever. (The
7-bit-ascii-with-escapes repr would only be a data print format, with no other implication).

This could be followed by

s = bytes.associate('latin-1')

to reconstitute the string-of-bytes-with-associated-encoding

>
>> 1. This is a pure character sequence, use the source representation
>> only to determine what the abstract character entities (ACEs) are,
>> and represent them as necessary to preserve their unified
>> identities.
>
>In that case, you should use Unicode literals: They do precisely that.

Why should I have to do that if I have written # -*- coding: latin-1 -*-
in the second line? Why shouldn't s='blah blah' result in s being internally
stored as a latin-1 glyph sequence instead of an 8-bit code sequence that will
trip up ascii assumptions annoyingly ;-)

>
>> BTW, for convenience, will 8-bit byte encoded strings be repr'd as
>> latin-1 + escapes?
>
>Currently, they are represented as ASCII+escapes. I see no reason to
>change that.

Ok, that's no biggie, but even with your name? ;-)

>
>> Still, they have to express that in the encoding(s) of the program
>> sources, so what will '...' mean? Must it not be normalized to a
>> common internal representation?
>
>At some point in time, '...' will mean the same as u'...'. A Unicode

interesting. Will u'...' mean Unicode in the abstract, reserving the
the choice of utf-16(le|be)/wchar or utf-8 to the implementation?

>object *is* a normalized representation of a character string.

Sure. But it will have different possible encodings when you want to
send it to another system or store it etc.

>
>There should be one-- and preferably only one --obvious way to do it.
>
>... and Unicode strings are the one obvious way to do a normalized
>representation. You should use Unicode literals today whereever
>possible.
>
>> BTW, does import see encoding cookies and do the right thing when
>> there are differing ones?
>
>In a single file? It is an error to have multiple encoding cookies in
>a single file.

I didn't mean that ;-)

>
>In multiple files? Of course, that is the entire purpose: Allow
>different encodings in different modules. If only a single encoding
>was used, there would be no need to declare that.

Yes that seems obvious, but I had some inkling that if two modules
m1 and m2 had different source encodings, different codes would be
allowed in '...' literals in each, and e.g.,

import m1,m2
print 'm1: %r, m2: %r' % (m1.s1, m2.s2)

might have ill-defined meaning, which perhaps could be resolved by strings carrying
encoding info along. Of course, if all '...' wind up equivalent to u'...' then that
pretty much goes away (though I suppose %r might not be a good short cut for getting
a plain quoted string into the output any more).

But if s = '...' becomes effectively s = u'...' will type('...') => <type 'unicode'> ?

What will become of str? Will that still be the default pseudo-ascii-but-really-byte-string
general data container that is is now?

Regards,
Bengt Richter

Martin v. Löwis

unread,

Dec 5, 2003, 1:18:50 PM12/5/03

to

bo...@oz.net (Bengt Richter) writes:

> If you put a sequence of those in a "string," ISTM the string should
> be thought of as having the same encoding as the characters whose
> ord() codes are stored.

So this is a matter of "conceptual correctness". I could not care
less: I thought you bring forward real problems that would be solved
if strings had an encoding attached.

> But either way, what you wanted to specify was the latin-1 glyph
> sequence associated with the number sequence

I would use a Unicode object to represent these characters.

> >The answer would be more difficult for (4/5)+4.56 if 4/5 was a
> >rational number; for 1 < 0.5+0.5j, Python decides that it just cannot
> >find a result in a reasonable way. For strings-with-attached encoding,
> >the answer would always be difficult.
> Why, when unicode includes all?

Because at the end, you would produce a byte string. Then the question
is what type the byte string should have.

> >assuming it is ASCII will give the expected result, as ASCII is a
> ^^^^^^^^ oh, ok, it's just an assumption.

Yes. I advocate you should never make use of this assumption, but I
also believe it is a reasonable one - because it would still hold if
the string was Latin-1, KOI-8R, UTF-8, Mac-Roman, ...

> >What is the advantage of having an encoding associated with byte
> >strings?
> If e.g. name had latin-1 encoding associated with it by virtue of source like
> ...
> # -*- coding: latin-1 -*-
> name = 'Martin Löwis'
>
> then on my cp437 console window, I might be able to expect to see the umlaut
> just by writing
>
> print name

I see. To achieve this effect, do

# -*- coding: latin-1 -*-

name = u'Martin Löwis'
print name

> Why should I have to do that if I have written # -*- coding: latin-1 -*-
> in the second line? Why shouldn't s='blah blah' result in s being internally
> stored as a latin-1 glyph sequence instead of an 8-bit code sequence that will
> trip up ascii assumptions annoyingly ;-)

Because adding encoding to strings raise difficult questions, which,
when answered, will result in non-intuitive behaviour.

> >Currently, they are represented as ASCII+escapes. I see no reason to
> >change that.
> Ok, that's no biggie, but even with your name? ;-)

I use Unicode literals in source code. They can represent my name just
fine.

> interesting. Will u'...' mean Unicode in the abstract, reserving the
> the choice of utf-16(le|be)/wchar or utf-8 to the implementation?

You seem to be missing an important point. u'...' is available today.

The choice of representation is currently between UCS-2/UTF-16 and
UCS-4, with UTF-8 being an unlikely candidate for implementation
choice.

> Yes that seems obvious, but I had some inkling that if two modules
> m1 and m2 had different source encodings, different codes would be
> allowed in '...' literals in each, and e.g.,
>
> import m1,m2
> print 'm1: %r, m2: %r' % (m1.s1, m2.s2)
>
> might have ill-defined meaning

That is just one of the problems you run into when associating
encodings with strings. Fortunately, there is no encoding associated
with a byte string.

> But if s = '...' becomes effectively s = u'...' will type('...') =>
> <type 'unicode'> ?

Of course!

> What will become of str? Will that still be the default
> pseudo-ascii-but-really-byte-string general data container that is
> is now?

Well, <type 'str'> will continue to be the byte string type, and
conversion to str() will continue to produce byte strings. It might be
reasonable to add a string() built-in some day, which is a synonym for
unicode().

Regards,
Martin

Bengt Richter

unread,

Dec 5, 2003, 6:16:41 PM12/5/03

to

On 05 Dec 2003 19:18:50 +0100, mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bo...@oz.net (Bengt Richter) writes:
>
>> If you put a sequence of those in a "string," ISTM the string should
>> be thought of as having the same encoding as the characters whose
>> ord() codes are stored.
>
>So this is a matter of "conceptual correctness". I could not care
>less: I thought you bring forward real problems that would be solved
>if strings had an encoding attached.

I thought I did, but the "problem" is not achieving end effects (everyone
appreciates your expert advice on that). The "problem" is a bit of
(UIAM -- and to explore the issue collaboratively is why I post) unnecessary
explicitness required to achieve an end that could possibly happen automatically
(according to a "conceptually correct[-me-if-I'm-wrong]" model ;-)

>
>> But either way, what you wanted to specify was the latin-1 glyph
>> sequence associated with the number sequence
>
>I would use a Unicode object to represent these characters.

Yes, that is an effective explicitness, but not what I was trying to get at.

>
>> >The answer would be more difficult for (4/5)+4.56 if 4/5 was a
>> >rational number; for 1 < 0.5+0.5j, Python decides that it just cannot
>> >find a result in a reasonable way. For strings-with-attached encoding,
>> >the answer would always be difficult.
>> Why, when unicode includes all?
>
>Because at the end, you would produce a byte string. Then the question
>is what type the byte string should have.

Unicode, of course, unless that coercion was not necessary, as in ascii+ascii
or latin-1 + latin-1, etc., where the result could retain the more specific
encoding attribute.

>
>> >assuming it is ASCII will give the expected result, as ASCII is a
>> ^^^^^^^^ oh, ok, it's just an assumption.
>
>Yes. I advocate you should never make use of this assumption, but I
>also believe it is a reasonable one - because it would still hold if
>the string was Latin-1, KOI-8R, UTF-8, Mac-Roman, ...

Why not assume latin-1, if it's just a convenience assumption for certain
contexts? I suspect it would be right more often than not, given that for
other cases explicit unicode or decode/encode calls would probably be used.

>
>> >What is the advantage of having an encoding associated with byte
>> >strings?
>> If e.g. name had latin-1 encoding associated with it by virtue of source like
>> ...
>> # -*- coding: latin-1 -*-
>> name = 'Martin Löwis'
>>
>> then on my cp437 console window, I might be able to expect to see the umlaut
>> just by writing
>>
>> print name
>
>I see. To achieve this effect, do
>
># -*- coding: latin-1 -*-
>name = u'Martin Löwis'
>print name

Right, but that is a workaround w.r.t the possibility I am trying to discuss.

>
>
>> Why should I have to do that if I have written # -*- coding: latin-1 -*-
>> in the second line? Why shouldn't s='blah blah' result in s being internally
>> stored as a latin-1 glyph sequence instead of an 8-bit code sequence that will
>> trip up ascii assumptions annoyingly ;-)
>
>Because adding encoding to strings raise difficult questions, which,
>when answered, will result in non-intuitive behaviour.

Care to elaborate? I don't know what difficult questions nor non-intuitive behavior
you have in mind, but I am probably not the only one who is curious ;-)

>
>> >Currently, they are represented as ASCII+escapes. I see no reason to
>> >change that.
>> Ok, that's no biggie, but even with your name? ;-)
>
>I use Unicode literals in source code. They can represent my name just
>fine.

Ok, ok ;-)

>
>> interesting. Will u'...' mean Unicode in the abstract, reserving the
>> the choice of utf-16(le|be)/wchar or utf-8 to the implementation?
>
>You seem to be missing an important point. u'...' is available today.

No, I know that ;-) But I don't know how you are going to migrate towards
a more pervasive use of unicode in all the '...' contexts. Whether at
some point unicode will be built into cpython as the C representation
of all internal strings, or it will use unicode through unicode objects
and their interfaces, which I imagine would be the way it started.
Memory-limited implementations might want to make different choices IWG,
so the cleaner the python-unicode relationship the freer those choices
are likely to be IWT. I was just speculating on these things.

>
>The choice of representation is currently between UCS-2/UTF-16 and
>UCS-4, with UTF-8 being an unlikely candidate for implementation
>choice.
>
>> Yes that seems obvious, but I had some inkling that if two modules
>> m1 and m2 had different source encodings, different codes would be
>> allowed in '...' literals in each, and e.g.,
>>
>> import m1,m2
>> print 'm1: %r, m2: %r' % (m1.s1, m2.s2)
>>
>> might have ill-defined meaning
>
>That is just one of the problems you run into when associating

^--not ;-)

>encodings with strings. Fortunately, there is no encoding associated
>with a byte string.

So assume ascii, after having stripped away better knowledge?

It's fine to have a byte type with no encoding associated. But unfortunately
ISTM str instances seem to be playing a dual role as ascii-encoded strings
and byte strings. More below.

>
>> But if s = '...' becomes effectively s = u'...' will type('...') =>
>> <type 'unicode'> ?
>
>Of course!

Just checking ;-)

>
>> What will become of str? Will that still be the default
>> pseudo-ascii-but-really-byte-string general data container that is
>> is now?
>
>Well, <type 'str'> will continue to be the byte string type, and
>conversion to str() will continue to produce byte strings. It might be
>reasonable to add a string() built-in some day, which is a synonym for
>unicode().

How will the following look when s == '...' becomes effectively s = u'...' per above?

>>> str('L\xf6wis')
'L\xf6wis'
>>> str(u'L\xf6wis')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in
ange(128)

Will there be an ascii codec involved in that first str('...') ?

Hm. For a pure-bytes string type, you could define an assumed 8-bit encoding in place of ascii,
so that you could always get a unicode translation 1:1. You could do it by using a private
code area of unicode, so e.g. '\x00' to '\xff' becomes u'\ue000' to u'\ue0ff' and then e.g.,
unicode.__str__(u'\ue0ab') could render back '\xab' as the str value instead of raising
UnicodeEncodeError saying it's not in ascii range. Also, u'\ue0ab'.encode('bytes') would presumably
return the byte string '\xab'.

To get the e000-e0ff unicode, you'd do some_byte_string.decode('bytes') analogous to the apparent
some_ordinary_str.decode('ascii') that seems to be attempted in some contexts now.
BTW, is that really some_ordinary_str.decode(sys.getdefaultencoding()) ?

Another thing: what encoding should an ordinary user object return from __str__ ?
(I still think str instances with explicit optional mutable 'encoding' attribute slots
could be useful ;-) Or should __str__ not return type str any more, but unicode??
Or optionally either?

BTW, is '...' =(effectively)= u'...' slated for a particular future python version?

Regards,
Bengt Richter

Martin v. Löwis

unread,

Dec 6, 2003, 12:20:57 PM12/6/03

to

bo...@oz.net (Bengt Richter) writes:

> >> Why, when unicode includes all?
> >
> >Because at the end, you would produce a byte string. Then the question
> >is what type the byte string should have.
> Unicode, of course, unless that coercion was not necessary, as in ascii+ascii
> or latin-1 + latin-1, etc., where the result could retain the more specific
> encoding attribute.

I meant to write "what *encoding* the byte string should have". Unicode
is not an encoding.

> Why not assume latin-1, if it's just a convenience assumption for certain
> contexts? I suspect it would be right more often than not, given that for
> other cases explicit unicode or decode/encode calls would probably be used.

This was by BDFL pronouncement, and I agree with that decision. I
personally would have favoured UTF-8 as system encoding in Python, as
it would support all languages, and would allow for as little mistakes
as ASCII (e.g. you can't mistake a Latin-1 or KOI-8R string as UTF-8).
I would consider chosing Latin-1 as euro-centric, and it would
silently do the wrong thing if the actual encoding was something else.

Errors should never pass silently.
Unless explicitly silenced.

> >name = u'Martin Löwis'
> >print name
> Right, but that is a workaround w.r.t the possibility I am trying to
> discuss.

The problem is that the possibility is not a possibility. What you
propose just cannot be implemented in a meaningful way. If you don't
believe me, please try implementing it yourself, and I'll show you the
problems of your implementation.

Using Unicode objects to represent characters is not a work-around, it
is the solution.

> Care to elaborate? I don't know what difficult questions nor
> non-intuitive behavior you have in mind, but I am probably not the
> only one who is curious ;-)

As I said: What would be the meaning of concatenating strings, if both
strings have different encodings?

I see three possible answers to this question, all non-intuitive:
1. Choose one of the encodings, and convert the other string to
that encoding. This has these problems:
a) neither encoding might be capable of representing all characters
of the result string. There are several ways to deal with this
case; finding them is left as an exercise to the reader.
b) it would be incompatible with prior versions, as it would
not be a plain byte concatenation.
2. Convert the result string to UTF-8. This is incompatible with
earlier Python versions.
3. Consider the result as having "no encoding". This would render
the entire feature useless, as string data would degrade to
"no encoding" very quickly. This, in turn, would leave to "strange"
errors, as sometimes, printing a string works fine, but seemingly
randomly, it fails.

Also, what would be the encoding of strings returned from file.read(),
socket.read(), etc.?

Also, what would be the encoding of strings created as a result of
splice operations? What if the splice hits the middle of a multi-byte
encoding?

> No, I know that ;-) But I don't know how you are going to migrate towards
> a more pervasive use of unicode in all the '...' contexts. Whether at
> some point unicode will be built into cpython as the C representation
> of all internal strings

Unicode is not a representation of byte strings, so this cannot
happen.

> or it will use unicode through unicode objects
> and their interfaces, which I imagine would be the way it started.

Yes, all library functions that expect strings should support Unicode
objects. Ideally, all library functions that return strings should
return Unicode objects, but this raises backwards compatibility
issues. For the APIs where this matters much, transition mechanisms
are in progress.

> Memory-limited implementations might want to make different choices IWG,
> so the cleaner the python-unicode relationship the freer those choices
> are likely to be IWT.

I'm not too concerned with memory-limited implementations. It would be
feasible to re-implement the Unicode type to use UTF-8 as its internal
representation, but that would be tedious to do on the C level, and it
would lead to really bad performance, given that slicing and indexing
become inefficient.

> >> import m1,m2
> >> print 'm1: %r, m2: %r' % (m1.s1, m2.s2)
> >>
> >> might have ill-defined meaning
> >
> >That is just one of the problems you run into when associating
> ^--not ;-)
> >encodings with strings. Fortunately, there is no encoding associated
> >with a byte string.
> So assume ascii, after having stripped away better knowledge?

No, in current Python, there is no doubt about the semantics: We
assume *nothing* about the encoding. Instead, if s1 and s2 are <type
'str'>, we treat them as byte strings. This means that bytes 0..31 and
128..256 are escaped, with special escapes applying to 10, 13, ...,
and bytes 34 and 39.

> It's fine to have a byte type with no encoding associated. But
> unfortunately ISTM str instances seem to be playing a dual role as
> ascii-encoded strings and byte strings. More below.

No. They actually play a dual role as byte strings and somehow-encoded
strings, depending on the application. In many applications, that
encoding is the locale's encoding, but in internet applications, you
often have to handle multiple encodings in a single run of the
program.

> How will the following look when s == '...' becomes effectively s =
> u'...' per above?

I don't know. Because this question is difficult to answer, that
change cannot be made in the near future. It might be reasonable to
have str() return Unicode objects - with another builtin to generate
byte strings.

> BTW, is '...' =(effectively)= u'...' slated for a particular future
> python version?

No. Try running your favourite application with -U, and see what
happens. For Python 2.3, I managed python -U to atleast enter
interactive mode - in 2.2, importing site.py will fail, trying
to put Unicode objects on sys.path.

Regards,
Martin

Bengt Richter

unread,

Dec 6, 2003, 10:14:25 PM12/6/03

to

On 06 Dec 2003 18:20:57 +0100, mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bo...@oz.net (Bengt Richter) writes:
>
>> >> Why, when unicode includes all?
>> >
>> >Because at the end, you would produce a byte string. Then the question
>> >is what type the byte string should have.
>> Unicode, of course, unless that coercion was not necessary, as in ascii+ascii
>> or latin-1 + latin-1, etc., where the result could retain the more specific
>> encoding attribute.
>
>I meant to write "what *encoding* the byte string should have". Unicode
>is not an encoding.

True (conceptually correct ;-). I was being sloppy and using "unicode" as
a metonym for "any unicode encoding" (the choice would be an implementation/optimization
concern. The point being to preserve character identity information from the original
string encodings in a combined new encoding).

>
>> Why not assume latin-1, if it's just a convenience assumption for certain
>> contexts? I suspect it would be right more often than not, given that for
>> other cases explicit unicode or decode/encode calls would probably be used.
>
>This was by BDFL pronouncement, and I agree with that decision. I
>personally would have favoured UTF-8 as system encoding in Python, as
>it would support all languages, and would allow for as little mistakes
>as ASCII (e.g. you can't mistake a Latin-1 or KOI-8R string as UTF-8).
>I would consider chosing Latin-1 as euro-centric, and it would
>silently do the wrong thing if the actual encoding was something else.

We still can get silent wrongdoing I think, but ok, never mind latin-1.
Probably not a good idea. It's a red herring for now anyway.

>
>Errors should never pass silently.
>Unless explicitly silenced.

Ok, I'm happy with that. But let's see where the errors come from.
By definition it's from associating the wrong encoding assumption
with a pure byte sequence. So how can that happen?

1. a pure byte sequence is used in a context that requires
interpretation as a character sequence, and a wrong default is assumed

2. the program code has a bug and passes explicitly wrong information

We'll ignore #2, but how can a pure byte sequence get to situation #1 ?

1a. Available unambiguous encoding information not matching the
default assumption was dropped. This is IMO the most likely.
1b. The byte sequence came from an unspecified source and never got explicit encoding info associated.
This is probably a bug or application design flaw, not a python problem.

IMO a large part of the answer will be not to drop available encoding info.

>
>> >name = u'Martin Löwis'
>> >print name
>> Right, but that is a workaround w.r.t the possibility I am trying to
>> discuss.
>
>The problem is that the possibility is not a possibility. What you
>propose just cannot be implemented in a meaningful way. If you don't
>believe me, please try implementing it yourself, and I'll show you the
>problems of your implementation.

I hope an outline of what I am thinking is becoming visible.

>
>Using Unicode objects to represent characters is not a work-around, it
>is the solution.
>
>> Care to elaborate? I don't know what difficult questions nor
>> non-intuitive behavior you have in mind, but I am probably not the
>> only one who is curious ;-)
>
>As I said: What would be the meaning of concatenating strings, if both
>strings have different encodings?

If the strings have encodings, the semantics are the semantics of character
sequences with possibly heterogenous representations. The simplest thing would probably
be to choose utf-16le like windows wchar UIAM and normalize all strings that have
encodings to that, but you could postpone that and get more efficient memory use
by introducing a mutable coding property for str instances, so that e.g.,

u'abc'.encode('latin-1').coding => 'latin-1'

i.e., when you encode a character sequence into a byte sequence according to a certain
encoding, you get a str instance with a .coding attribute that says what the encoding is.
The bytes of the str are just like now. A plain byte str would have .coding == None.
plain string syntax in a program source would work like

'abc'.coding => (whatever source encoding is) # not necessarily 'ascii'

This leaves the case where you explicitly want an actual pure byte string, with
no encoding. IMO '...' should _not_ produce that, because by far the most common use
is to encode characters for docstrings and printing and displaying of readable characters.
I.e., for character sequences, not data byte sequences. I don't think we should have to convert
all those string literals to u'...' for the sake of the few actual data-byte-string literals.
Instead, the latter could become explicit, e.g., by a string prefix. E.g.,

a'...'

meaning a byte string represented by ascii+escapes syntax like current practice (whatever the program
source encoding. I.e., latin-1 non-ascii characters would not be allowed in the literal _source_
representation even if the program source were encoded in latin-1. (of course escapes would be allowed)).
This would make such literals fairly portable, IWT. Also, a'foo'.coding => None, to indicate
a pure byte string.

By contrast, a plainly quoted string would get its .coding attribute value from the source encoding,
and any characters permissible in the program source would be carried through, and the string
.coding attribute would be the same as for the program source.

IWT .coding attributes/properties would permit combining character strings with different
encodings by promoting to an encoding that includes all without information loss. User custom
objects presenting string behaviour could also have a .coding attribute to feed into
the system seamlessly.

Of course you cannot arbitrarily combine byte strings b (b.coding==None)
with character strings s (s.coding!=None).

If "strings" don't have encodings, they are not character sequences, they
are byte vectors. It should probably be a TypeError to pass a byte vector
without associated explicit encoding info where a character sequence is expected,
though for backward compatibility the coding='ascii' assumption will probably have
to be made.

With .coding attributes, print could do what it does with unicode strings now, and encode to
the current output device's coding.

>
>I see three possible answers to this question, all non-intuitive:
>1. Choose one of the encodings, and convert the other string to
> that encoding. This has these problems:
> a) neither encoding might be capable of representing all characters
> of the result string. There are several ways to deal with this
> case; finding them is left as an exercise to the reader.

Fairly obviously IWT, convert to a unicode encoding.

> b) it would be incompatible with prior versions, as it would
> not be a plain byte concatenation.

IWT it would be plain byte concatenation when encodings were compatible.
Otherwise plain concatenation would be an error anyway.

>2. Convert the result string to UTF-8. This is incompatible with
> earlier Python versions.

Or utf-16xx. I wonder how many mixed-encoding situations there are in earlier code.
Single-encoding should not require change of encoding, so it should look like plain
concatenation as far as the byte sequence part is concerned. It might be mostly transparent.

>3. Consider the result as having "no encoding". This would render
> the entire feature useless, as string data would degrade to
> "no encoding" very quickly. This, in turn, would leave to "strange"
> errors, as sometimes, printing a string works fine, but seemingly
> randomly, it fails.

I agree that this would be useless ;-)

>
>Also, what would be the encoding of strings returned from file.read(),
>socket.read(), etc.?

socket_or_file.read().coding => None

unless some encoding was specified in the opening operation.

>
>Also, what would be the encoding of strings created as a result of
>splice operations? What if the splice hits the middle of a multi-byte
>encoding?

That can't happen (unless you convert an encoded string to its unencoded byte sequence,
and somehow get a splice operation to work at that spot, in which case you deserve the result ;-)
Remember, if there is encoding, we are semantically dealing with character sequences,
so splicing has to be implemented in terms of characters, however represented.
It means that e.g., u'%s'% slatin_1 would require promotion of latin-1 encoding to whatever
unicode encoding you are using for u'...'.

But note that if slatin_1 bytes were 'L\xf6wis' and slatin_1.coding were set to 'latin-1' one
way or another, then there wouldn't have to be an implicit 'ascii' assumption to correct by writing

u'%s' % slatin_1.decode('latin-1')
instead of
u'%s' % slatin_1

>
>> No, I know that ;-) But I don't know how you are going to migrate towards
>> a more pervasive use of unicode in all the '...' contexts. Whether at
>> some point unicode will be built into cpython as the C representation
>> of all internal strings
>
>Unicode is not a representation of byte strings, so this cannot
>happen.

Sorry, for internal stuff I meant strings as in character strings, not byte strings ;-)

^^^^^^^^^^^^^^^^-- If that is so, why does str have an encode method?
That's supposed to go from character entities to bytes, I thought ;-)

>'str'>, we treat them as byte strings. This means that bytes 0..31 and
>128..256 are escaped, with special escapes applying to 10, 13, ...,
>and bytes 34 and 39.
>
>> It's fine to have a byte type with no encoding associated. But
>> unfortunately ISTM str instances seem to be playing a dual role as
>> ascii-encoded strings and byte strings. More below.
>
>No. They actually play a dual role as byte strings and somehow-encoded
>strings, depending on the application. In many applications, that
>encoding is the locale's encoding, but in internet applications, you
>often have to handle multiple encodings in a single run of the
>program.

Which is why I thought some_string.coding attributes to carry that
information explicitly would be a good idea.

>
>> How will the following look when s == '...' becomes effectively s =
>> u'...' per above?
>
>I don't know. Because this question is difficult to answer, that
>change cannot be made in the near future. It might be reasonable to
>have str() return Unicode objects - with another builtin to generate
>byte strings.

I'm not sure what that other builtin would do. Call on objects' __bytes__
methods? IWT most of the current str machinery could still operate on
byte strings, just by astring.coding=None to differentiate from actual
character-oriented processing.

>
>> BTW, is '...' =(effectively)= u'...' slated for a particular future
>> python version?
>
>No. Try running your favourite application with -U, and see what
>happens. For Python 2.3, I managed python -U to atleast enter
>interactive mode - in 2.2, importing site.py will fail, trying
>to put Unicode objects on sys.path.

Haven't tried that yet...

Regards,
Bengt Richter

Martin v. Löwis

unread,

Dec 7, 2003, 4:08:37 AM12/7/03

to

bo...@oz.net (Bengt Richter) writes:

> Ok, I'm happy with that. But let's see where the errors come from.
> By definition it's from associating the wrong encoding assumption
> with a pure byte sequence.

Wrong. Errors may also happen when performing unexpected conversions
from one encoding to a different one.

> 1a. Available unambiguous encoding information not matching the
> default assumption was dropped. This is IMO the most likely.
> 1b. The byte sequence came from an unspecified source and never got explicit encoding info associated.
> This is probably a bug or application design flaw, not a python problem.

1b. is the most likely case. Any byte stream read operation (file,
socket, zipfile) will return byte streams of unspecified encoding.

> IMO a large part of the answer will be not to drop available
> encoding info.

Right. And this is very difficult, making the entire approach
unimplementable.

> I hope an outline of what I am thinking is becoming visible.

Unfortunately, not. You seem to assume that nearly all strings have
encoding information attached, but you don't explain where you expect
this information to come from.

> >As I said: What would be the meaning of concatenating strings, if both
> >strings have different encodings?
> If the strings have encodings, the semantics are the semantics of character
> sequences with possibly heterogenous representations.

??? What is a "possibly heterogenous representation", how do I
implement it, and how do I use it?

Are you suggesting that different bytes in a single string should use
different encodings? If not, how does suggesting a heterougenous
implementation answer the question of how concatenation of strings is
implemented?

> The simplest thing would probably be to choose utf-16le like windows
> wchar UIAM and normalize all strings that have encodings to that

Again: How does that answer the question what concatenation of strings
means?

Also, if you use utf-16le as the internal encoding of byte strings,
what is the meaning of indexing? I.e. given a string s='Hallo',
what is len(s), s[0], s[1]?

> Instead, the latter could become explicit, e.g., by a string prefix. E.g.,
>
> a'...'
>
> meaning a byte string represented by ascii+escapes syntax like
> current practice (whatever the program source encoding. I.e.,
> latin-1 non-ascii characters would not be allowed in the literal
> _source_ representation even if the program source were encoded in
> latin-1. (of course escapes would be allowed)).

Hmm. This still doesn't answer my question, but now you are extending
the syntax already.

> IWT .coding attributes/properties would permit combining character
> strings with different encodings by promoting to an encoding that
> includes all without information loss.

No, it would not - atleast not unless you specify further details. If
I have a latin-1 string ('\xf6'), and a koi-8r string ('\xf6'), and
concatenate them, what do get?

> Of course you cannot arbitrarily combine byte strings b (b.coding==None)
> with character strings s (s.coding!=None).

So what happens if you try to combine them?

> >2. Convert the result string to UTF-8. This is incompatible with
> > earlier Python versions.
> Or utf-16xx. I wonder how many mixed-encoding situations there
> are in earlier code. Single-encoding should not require change
> of encoding, so it should look like plain concatenation as far
> as the byte sequence part is concerned. It might be mostly
> transparent.

This approach is incompatible with earlier Python versions even for a
single encoding. If I have a KOI-8R s='\xf6' (which is the same as
U+0416), and UTF-16 is the internal represenation, and I do s[0], what
do I get, and what algorithm is used to compute that result?

> socket_or_file.read().coding => None
>
> unless some encoding was specified in the opening operation.

So *all* existing socket code would get byte strings, and so would all
existing file I/O. You will break a lot of code.

> Remember, if there is encoding, we are semantically dealing with
> character sequences, so splicing has to be implemented in terms of
> characters, however represented.

You never mentioned that you expect indexing to operate on characters,
not bytes. That would be incompatible with current Python, so I was
assuming that you could not possibly suggest that approach.

If I summarize your approach:
- conversion to an internal represenation based on UTF-16
- indexing based on characters, not bytes

I arrive at the current Unicode type. So what you want is already
implemented, except for the meaningless 'coding' attribute (it is
meaningless, as it does not describe a property of the string object).

> >No, in current Python, there is no doubt about the semantics: We
> >assume *nothing* about the encoding. Instead, if s1 and s2 are <type
> ^^^^^^^^^^^^^^^^-- If that is so, why does str have an encode method?

By mistake, IMO. Marc-Andre Lemburg suggested this as a generalization
of Unicode encodings, allowing arbitrary objects to be encoded - he
would have considered (3).encode('decimal') a good idea. With the
current encode method on string objects, you can do things like
s.encode('base64').

> That's supposed to go from character entities to bytes, I thought ;-)

In a specific case of character codecs, yes. However, this has
(unfortunately) been generalized to arbitrary two-way conversion
between arbitrary things.

> Which is why I thought some_string.coding attributes to carry that
> information explicitly would be a good idea.

Yes, it sounds like a good idea. Unfortunately, it is not
implementable in a meaningful way.

Regards,
Martin

Neil Hodgson

unread,

Dec 7, 2003, 7:49:02 AM12/7/03

to

Ruby 2.0 will probably implement strings with encoding attributes. The
developers want to avoid changing encoding whenever possible. One posting I
saw had joining strings with different encodings as an error although I
don't know if this has been finalised. I doubt whether it will work well, so
I think the best approach for Python is to watch Ruby and if it is
successful then copy. "m17n" is the magic googling term but most of the
available information is encoded in Japanese (EUC).

Neil

Fredrik Lundh

unread,

Dec 7, 2003, 8:02:30 AM12/7/03

to pytho...@python.org

Martin v. Löwis wrote:

> This was by BDFL pronouncement, and I agree with that decision. I
> personally would have favoured UTF-8 as system encoding in Python, as
> it would support all languages, and would allow for as little mistakes
> as ASCII (e.g. you can't mistake a Latin-1 or KOI-8R string as UTF-8).
> I would consider chosing Latin-1 as euro-centric

otoh, it would make sense to use 8-bit strings to store Unicode strings
that happen to contain only Unicode code points in the full 8-bit range
(0..255).

(but that would make it almost-exactly-but-not-quite the same thing
as a Latin-1 string, which we all know is a euro-centric thingy... and
the "almost" part would give people even more reasons to complain
about how "rude" I am when I take them to task for flaming others ;-)

> > or it will use unicode through unicode objects
> > and their interfaces, which I imagine would be the way it started.
>
> Yes, all library functions that expect strings should support Unicode
> objects.

I assume you meant:

Yes, all library functions that expect *text* strings should support
Unicode objects.

but maybe that was obvious from the thread context.

> I'm not too concerned with memory-limited implementations. It would be
> feasible to re-implement the Unicode type to use UTF-8 as its internal
> representation, but that would be tedious to do on the C level, and it
> would lead to really bad performance, given that slicing and indexing
> become inefficient.

and

and it *may* lead to really bad performance, given that slicing and
indexing *might* become inefficient.

having written Python's Unicode string type, I'm now thinking that it might
have been better to use a polymorphic "text" type with either UTF-8 or
encoded char or wchar buffers, and do dynamic translation based on usage
patterns. I've been playing with this idea in Pytte, but as usual, there's so
much code, and so little time...

</F>

Michael Hudson

unread,

Dec 7, 2003, 10:54:50 AM12/7/03

to

"Neil Hodgson" <nhod...@bigpond.net.au> writes:

I fail to see what Python does so badly now that we might want to
consider changing (of course, I haven't read Bengt's recent posts
especially closely, but I found Martin's replies made rather more
sense...).

Cheers,
mwh

--
"Sturgeon's Law (90% of everything is crap) applies to Usenet."
"Nothing guarantees that the 10% isn't crap, too."
-- Gene Spafford's Axiom #2 of Usenet, and a corollary

Martin v. Löwis

unread,

Dec 7, 2003, 12:31:49 PM12/7/03

to

"Fredrik Lundh" <fre...@pythonware.com> writes:

> otoh, it would make sense to use 8-bit strings to store Unicode strings
> that happen to contain only Unicode code points in the full 8-bit range
> (0..255).

I'm not sure about the advantages. It would give a more efficient
representation, yes, but at the cost a slower implementation. Codecs
often cannot know in advance whether a string will contain only
latin-1 (unless they are the latin-1 or the ascii codec), so they
would need to scan over the input first.

In addition, operations like PyUnicode_AsUnicode would be very
difficult to implement (unless you have *two* representation pointers
in the Unicode object - at which time the memory savings are
questionable).

> I assume you meant:
>
> Yes, all library functions that expect *text* strings should support
> Unicode objects.

Correct.

> having written Python's Unicode string type, I'm now thinking that
> it might have been better to use a polymorphic "text" type with
> either UTF-8 or encoded char or wchar buffers, and do dynamic
> translation based on usage patterns. I've been playing with this
> idea in Pytte, but as usual, there's so much code, and so little
> time...

"Better" in what sense? Would it even be better if you had to preserve
all the C-level API that we currently have?

Regards,
Martin

Martin v. Löwis

unread,

Dec 7, 2003, 12:37:33 PM12/7/03

to

"Neil Hodgson" <nhod...@bigpond.net.au> writes:

> Ruby 2.0 will probably implement strings with encoding attributes. The
> developers want to avoid changing encoding whenever possible.

Yes, but Ruby was very bad at Unicode all the time. Also "Ruby will"
is a weak statement - AFAICT Ruby 2 is about the same mythical
implementation that P3k is.

> One posting I saw had joining strings with different encodings as an
> error although I don't know if this has been finalised. I doubt
> whether it will work well, so I think the best approach for Python
> is to watch Ruby and if it is successful then copy. "m17n" is the
> magic googling term

Well, yes. I also doubt it will work; Ruby has traditionally supported
only a single encoding at the time well, and focused on the Japanese
ones primarily.

Regards,
Martin

Bengt Richter

unread,

Dec 7, 2003, 1:34:32 PM12/7/03

to

On 07 Dec 2003 10:08:37 +0100, mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bo...@oz.net (Bengt Richter) writes:
>
>> Ok, I'm happy with that. But let's see where the errors come from.
>> By definition it's from associating the wrong encoding assumption
>> with a pure byte sequence.
>
>Wrong. Errors may also happen when performing unexpected conversions
>from one encoding to a different one.

ISTM that could only happen e.g. if you explicitly called codecs to
convert between incompatible encodings. That is normal, just like

>>> float(10**309)

Traceback (most recent call last):
File "<stdin>", line 1, in ?

OverflowError: long int too large to convert to float

is a normal result. Otherwise "unexpected conversions" must happen
because of some string expression involving strings of different encodings,
in which case it is like

>>> 10**308 * 10.0
1.#INF

(Which BTW could be argued (but I won't ;-) should also be the result of float(10**309)).
So here is a case where information was lost, because there was no information-preserving
representation available. If we used an exact numeric form, e.g., and exact decimal I was
experimenting with, we can do

>>> from ut.exactdec import ED
>>> ED(10**308) * ED(10.0, 'all')
ED('1.0e309')

(The 'all' is a rounding parameter indication to capture all available accuracy bits
from a floating point arg to the constructor and call the result exact. Integers or longs
are naturally exact, so don't require that parameter).

Anyway, that's analogous to an expression involving, e.g.,

s1 = u'abc'.encode('utf-8')
and
s2 = u'def'.encode('latin-1')

In my scenario, you would have

assert s1.coding == 'utf-8'
assert s1.bytes() == a'abc' # bytes() gets the encoded byte sequence as pure str bytes
assert s2.coding == 'latin-1'
assert s2.bytes() == a'def'

so
s3 = s1 + s2

would imply

s3 = (1.bytes().decode(s1.coding) + s2.bytes().decode(s2.coding)).encode(cenc(s1.coding, s2.coding))

where lcenc is a function something like (sketch)

def cenc(enc1, enc2):
"""return common encoding"""
if enc1==enc2: return enc1 # this makes latin-1 + latin-1 => latin-1, etc.
if enc1 is None: enc1 = 'ascii' # notorios assumption ;-)
if enc2 is None: enc2 = 'ascii' # ditto
if enc1[:3] == 'utf': return enc1 # preserve unicode encoding format of s1 preferentially
return 'utf' # generic system utf encoding

which in the above example would get you

assert s3.coding == 'utf-8'

>
>> 1a. Available unambiguous encoding information not matching the
>> default assumption was dropped. This is IMO the most likely.
>> 1b. The byte sequence came from an unspecified source and never got explicit encoding info associated.
>> This is probably a bug or application design flaw, not a python problem.
>
>1b. is the most likely case. Any byte stream read operation (file,
>socket, zipfile) will return byte streams of unspecified encoding.

But this is not an error. An error would only arise if one tried to use
the bytes as characters without specifying a decoding.

>
>> IMO a large part of the answer will be not to drop available
>> encoding info.
>
>Right. And this is very difficult, making the entire approach
>unimplementable.

ISTM it doesn't have to be all or nothing.

>
>> I hope an outline of what I am thinking is becoming visible.
>
>Unfortunately, not. You seem to assume that nearly all strings have
>encoding information attached, but you don't explain where you expect
>this information to come from.

Strings appearing as literals in program sources will be assumed to have
the same encoding as is assumed or explicitly specified for the source text.
IMO that will cover a lot of strings not now covered, and will be an improvement
even if it doesn't cover everything.

>
>> >As I said: What would be the meaning of concatenating strings, if both
>> >strings have different encodings?
>> If the strings have encodings, the semantics are the semantics of character
>> sequences with possibly heterogenous representations.
>
>??? What is a "possibly heterogenous representation", how do I
>implement it, and how do I use it?

See example s3 = s1 + s2 above.

>
>Are you suggesting that different bytes in a single string should use
>different encodings? If not, how does suggesting a heterougenous
>implementation answer the question of how concatenation of strings is
>implemented?

See as before.

>
>> The simplest thing would probably be to choose utf-16le like windows
>> wchar UIAM and normalize all strings that have encodings to that
>
>Again: How does that answer the question what concatenation of strings
>means?

See as before.

>
>Also, if you use utf-16le as the internal encoding of byte strings,
>what is the meaning of indexing? I.e. given a string s='Hallo',
>what is len(s), s[0], s[1]?

If s.coding is None, it's the same as now. Otherwise

len(s) <-> len(s.decode(s.coding))

e.g.

>>> s = 'L\xf6wis'
>>> s8 = s.decode('latin-1').encode('utf-8')
>>> s8
'L\xc3\xb6wis'
>>> len(s8)
6
>>> len(s)
5
>>> len(s8.decode('utf-8'))
5

s[0] and s[1] create new encoded strings if they are indexing encoded strings,
and preserve the .coding info. So e.g., in general, when .coding is not None,

s[i] <-> s.decode(s.coding)[i].encode(s.coding)

(This is semantics, let's not prematurely talk about optimization ;-)

>>> s8 # current display of the bytes of utf-8 encoding in s8
'L\xc3\xb6wis'
>>> s8[0] # wrong when .coding is not None
'L'
>>> s8_0 = s8.decode('utf-8')[0].encode('utf-8')
>>> s8_1 = s8.decode('utf-8')[1].encode('utf-8')
>>> s8
'L\xc3\xb6wis'
>>> s8_0
'L'
>>> s8_1
'\xc3\xb6'

and assert s8_0.coding == s8_1.coding == 'utf-8' would hold for results.

>
>> Instead, the latter could become explicit, e.g., by a string prefix. E.g.,
>>
>> a'...'
>>
>> meaning a byte string represented by ascii+escapes syntax like
>> current practice (whatever the program source encoding. I.e.,
>> latin-1 non-ascii characters would not be allowed in the literal
>> _source_ representation even if the program source were encoded in
>> latin-1. (of course escapes would be allowed)).
>
>Hmm. This still doesn't answer my question, but now you are extending
>the syntax already.
>
>> IWT .coding attributes/properties would permit combining character
>> strings with different encodings by promoting to an encoding that
>> includes all without information loss.
>
>No, it would not - atleast not unless you specify further details. If
>I have a latin-1 string ('\xf6'), and a koi-8r string ('\xf6'), and
>concatenate them, what do get?

An sequence of bytes that is an encoding of the _character_ sequence, such that
the encoding is adequate to represent all the characters, e.g.,

>>> latkoi = ('\xf6'.decode('latin-1') + '\xf6'.decode('koi8_r')).encode('utf')
>>> latkoi
'\xc3\xb6\xd0\x96'

with assert latkoi.coding == 'utf-8' since 'utf' seems to be an alias for 'utf-8'
And since .coding is not None, the resulting string length is

>>> len(latkoi.decode('utf-8'))
2

>
>> Of course you cannot arbitrarily combine byte strings b (b.coding==None)
>> with character strings s (s.coding!=None).
>
>So what happens if you try to combine them?
>
>> >2. Convert the result string to UTF-8. This is incompatible with
>> > earlier Python versions.
>> Or utf-16xx. I wonder how many mixed-encoding situations there
>> are in earlier code. Single-encoding should not require change
>> of encoding, so it should look like plain concatenation as far
>> as the byte sequence part is concerned. It might be mostly
>> transparent.
>
>This approach is incompatible with earlier Python versions even for a
>single encoding. If I have a KOI-8R s='\xf6' (which is the same as
>U+0416), and UTF-16 is the internal represenation, and I do s[0], what
>do I get, and what algorithm is used to compute that result?

if you have s='\xf6' representing a KOI-8R character, you have two pieces of info.
The bare bytes would have s.coding == None. (I suggested a literal format a'\xf6' for pure bytes,
so let's say
s = a'\xf6'
but you want that interpreted as KOI-8R, so we have to decode it according to that,
and then re-encode to get a byte string with .coding set:

s = s.bytes().decode('koi8_r').encode('koi8_r')

e.g.
>>> '\xf6'.decode('koi8_r').encode('koi8_r')
'\xf6'

(which seems like it could be optimized ;-)

But you mention an "internal representation" of UTF-16. I'm not sure what you mean,
(though I assume unicode is internally handled for u'...' strings in utf-16le/wchar_t
format in most PCs) except you could certainly have a string with that explicit .coding
format, e.g.,

>>> s16 = '\xf6'.decode('koi8_r').encode('utf-16')
>>> s16
'\xff\xfe\x16\x04'

and then assert s16.coding == 'utf-16' would pass ok.
Note the BOM. Still the length (since s16.coding is not None) is

>>> len(s16.decode('utf-16'))
1

If s.coding is None, you could say length was len(s.bytes().decode('bytes')) and say
it was optimized away, I suppose.

In other words, when s.coding is not None, you can think of all the possibilities
as alternative representations of s.bytes().decode(s.coding) where .bytes() is a method
to get the raw str bytes of the particular encoding, and even if s.coding is None, you
could use the virtual 'bytes' character set default assumption, so that all strings
have a character interpretation if needed.

>
>> socket_or_file.read().coding => None
>>
>> unless some encoding was specified in the opening operation.
>
>So *all* existing socket code would get byte strings, and so would all
>existing file I/O. You will break a lot of code.

Why?

>
>> Remember, if there is encoding, we are semantically dealing with
>> character sequences, so splicing has to be implemented in terms of
>> characters, however represented.
>
>You never mentioned that you expect indexing to operate on characters,
>not bytes. That would be incompatible with current Python, so I was
>assuming that you could not possibly suggest that approach.

It would normally be transparent, I think, and mostly optimize away,
except where you have programs with mixed encodings.

>
>If I summarize your approach:
>- conversion to an internal represenation based on UTF-16

Only as needed. Encodings would not arbitrarily be changed.

>- indexing based on characters, not bytes

if the bytes represent an encoded character sequence as indicated by s.coding, yes,
but the result is another (sub)string with the same s.coding, and might be encoded
as a single byte or several (as sometimes in utf-8). I.e., from above, semantically:

s[i] <-> s.decode(s.coding)[i].encode(s.coding) # with result having same .coding value

>
>I arrive at the current Unicode type. So what you want is already
>implemented, except for the meaningless 'coding' attribute (it is
>meaningless, as it does not describe a property of the string object).
>
>> >No, in current Python, there is no doubt about the semantics: We
>> >assume *nothing* about the encoding. Instead, if s1 and s2 are <type
>> ^^^^^^^^^^^^^^^^-- If that is so, why does str have an encode method?
>
>By mistake, IMO. Marc-Andre Lemburg suggested this as a generalization
>of Unicode encodings, allowing arbitrary objects to be encoded - he
>would have considered (3).encode('decimal') a good idea. With the
>current encode method on string objects, you can do things like
>s.encode('base64').

I can see the usefulness of that. I guess you have to envisage a
hidden transparent s.decode('bytes') before the .encode('base64')
so we have a logical round trip. s.decode('bytes') could be conceptualized
as producing unicode in a private range U+E000 .. U+E0ff and encoding that
range of unicode "characters" could do the right thing. Re-encoding as 'bytes'
would restore the original byte sequence, just like any other 1:1 encoding
transformation. You could even design a font, like little boxes with the
hex values in them ;-)

>
>> That's supposed to go from character entities to bytes, I thought ;-)
>
>In a specific case of character codecs, yes. However, this has
>(unfortunately) been generalized to arbitrary two-way conversion
>between arbitrary things.

Well, maybe it can be rationalized via the virtual 'bytes' character encoding
(and which in some contexts might make a better default assumption than 'ascii').

>
>> Which is why I thought some_string.coding attributes to carry that
>> information explicitly would be a good idea.
>
>Yes, it sounds like a good idea. Unfortunately, it is not
>implementable in a meaningful way.

I'm still hoping for something meaningful. See above ;-)

I'm tempted to subclass str, to play with it, but not right now ;-)

Regards,
Bengt Richter

Bengt Richter

unread,

Dec 7, 2003, 2:39:44 PM12/7/03

to

On 07 Dec 2003 18:31:49 +0100, mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>"Fredrik Lundh" <fre...@pythonware.com> writes:
>
>> otoh, it would make sense to use 8-bit strings to store Unicode strings
>> that happen to contain only Unicode code points in the full 8-bit range
>> (0..255).
>
>I'm not sure about the advantages. It would give a more efficient
>representation, yes, but at the cost a slower implementation. Codecs
>often cannot know in advance whether a string will contain only
>latin-1 (unless they are the latin-1 or the ascii codec), so they
>would need to scan over the input first.

But if all strings that represented characters had .coding attributes, you
would just use that and know without looking.

>
>In addition, operations like PyUnicode_AsUnicode would be very
>difficult to implement (unless you have *two* representation pointers
>in the Unicode object - at which time the memory savings are
>questionable).

Maybe if unicode objects used strings with a standard .coding attribute
of 'unicode' for normalized representation, meaning the system's standard unicode
encoding (probably utf-16le byte strings), then any string with a .coding attribute
could be instantly captured and plugged into the unicode object by reference as an
alternative unicode data representation, and conversions of actual
representation format could be lazy, normalizing when sesnible, but also possibly
doing multi-string operations in their native encodings when compatible.

IOW, u'abc' + u'def' might capture 'abc' from the source code text with # -*- coding: latin-1 -*-
and lazily use an internal pointer to a 'abc' string with .coding='latin-1'. Ditto with
the u'def', so when they are added, you get u'abcdef' but internally it was adding two
latin-1 representations of the unicode characters, so it could produce the concatenation
without changing encoding. As far as the unicode object interface was concerned, nothing
would change (except maybe some debugging/inspecting things to get at internal details),
but representation could vary privately.

(BTW all the standard encoding names could be interned so is-comparisons could be used
to check them).

>
>> I assume you meant:
>>
>> Yes, all library functions that expect *text* strings should support
>> Unicode objects.
>
>Correct.
>
>> having written Python's Unicode string type, I'm now thinking that
>> it might have been better to use a polymorphic "text" type with
>> either UTF-8 or encoded char or wchar buffers, and do dynamic
>> translation based on usage patterns. I've been playing with this
>> idea in Pytte, but as usual, there's so much code, and so little
>> time...

This sounds very similar to what I have been trying to say.

>
>"Better" in what sense? Would it even be better if you had to preserve
>all the C-level API that we currently have?
>

Not sure what that entails.
Regards,
Bengt Richter

"Martin v. Löwis"

unread,

Dec 7, 2003, 3:39:03 PM12/7/03

to

Bengt Richter wrote:

> This sounds very similar to what I have been trying to say.

I would really suggest that you implement your ideas. You
*will* find that they are unimplementable. After adjusting
the ideas to constraints of reality, you *will* find that
you break backwards compatibility. After fixing the backwards
compatibility problems, you *will* find that your implementation
has very bad performance characteristics, compared to the
existing string types.

Unfortunately, it is very difficult to nail down the problems
you will encounter, as you refuse to provide a complete
specification of the interface and implementation that you
propose. Originally, I thought you are proposing modifications
to <type 'str'>, but now it appears that you are proposing
a new data type, which has large similarities with <type
'unicode'>. If so, I fail to understand why you don't want
to use the existing Unicode type.

Notice that /F has something completely different in mind:
He is still talking about the Python Unicode type, and just
suggesting that a different internal representation should
be used. Speculating about the motivation, I would think he
has efficiency in the face of round-trip conversions in mind,
but not a change in visible behaviour.

Regards,
Martin

Serge Orlov

unread,

Dec 7, 2003, 7:01:09 PM12/7/03

to

Bengt,

don't take it personally but this what happens <wink> when you use unicode
unaware software:
Quote from your message:

> mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

Your software also doesn't specify message encoding.

The real issue is to convince developers that there are many encodings in
this world. Python should offer only one way to deal with multiple encodings.
Your 8-bit strings with attached coding attribute is duplicating what unicode
strings offer.

> >> If e.g. name had latin-1 encoding associated with it by virtue of source like
> >> ...
> >> # -*- coding: latin-1 -*-

> >> name = 'Martin Lowis'

> >>
> >> then on my cp437 console window, I might be able to expect to see the umlaut
> >> just by writing
> >>
> >> print name
> >
> >I see. To achieve this effect, do
> >
> ># -*- coding: latin-1 -*-

> >name = u'Martin Lowis'

> >print name
> Right, but that is a workaround w.r.t the possibility I am trying to discuss.

It's not a workaround it's a solution. What you propose is a lot of effort for
a little gain wrt handling multiple encoding. It's already possible to handle
multiple encoding. The time is better spent converting everything that still
deals with 8-bit text to handle unicode.

-- Serge.

Bengt Richter

unread,

Dec 7, 2003, 7:49:59 PM12/7/03

to

On Sun, 07 Dec 2003 21:39:03 +0100, =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= <mar...@v.loewis.de> wrote:

>Bengt Richter wrote:
>
>> This sounds very similar to what I have been trying to say.
>
>I would really suggest that you implement your ideas. You
>*will* find that they are unimplementable. After adjusting
>the ideas to constraints of reality, you *will* find that
>you break backwards compatibility. After fixing the backwards
>compatibility problems, you *will* find that your implementation
>has very bad performance characteristics, compared to the
>existing string types.
>
>Unfortunately, it is very difficult to nail down the problems
>you will encounter, as you refuse to provide a complete

I may be failing to communicate, but I am not refusing ;-)

>specification of the interface and implementation that you
>propose. Originally, I thought you are proposing modifications
>to <type 'str'>, but now it appears that you are proposing

I was, and am (though I confess that discussion is making it
a somewhat moving target, as more ideas pop up ;-)

>a new data type, which has large similarities with <type
>'unicode'>. If so, I fail to understand why you don't want
>to use the existing Unicode type.

It's not that I don't want to use it. I think it's great. I am just
trying to sort out where the existing str is really being used as a character
sequence type and where it is really a byte buffer type that has no character
significance until a decoding intepretation is imposed.

As I've said, ISTM most string literals in program sources are really character
strings, and could well become unicode right in the tokenizer (that part is a new
statement ;-) But there are some string literals that really do represent bytes,
not characters, and there is a legacy of byte-producing object interfaces that
claim to be producing the same type thing (str) as string literals. This is problematic IMO.

ISTM there really are two different types that need to be disentangled (charstring vs bytestring,
or chars vs bytes for short).

I was trying to imagine tweaks to str that could make it play both roles more explicitly,
but thinking more, it's probably a wrong approach. Bite the bullet and separate them ;-)

OTTOMH I don't thing charstring should be unicode, because that excludes custom character sets that
may not exist in unicode, and/or will be a pain to create a private unicode map for. That's
why I think charstring must be effectively be something like a (bytestring, codec) pair.
Undoubtedly it will have an intimate optimized relationship to unicode. But a charstring type
would have the option of delegating efficiently in single-encoding environments, IWT, where
things would work much as they do now.

How to bring in the bytestring type is "interesting". I guess one approach would be to let it
be str minus its current charstring uses. But there's so many charstring uses that maybe it makes
more sense to let str be the charstring. But there are so much legacy interfaces that produce
bytestrings as str ;-/ Round and round. So I am pulled back the idea of .coding-attributed str's,
even though it's not too clean. Maybe charstrings as a str subclass? Ugh. I think
conceptual correctness (or lack thereof) has bitten ;-/

>
>Notice that /F has something completely different in mind:
>He is still talking about the Python Unicode type, and just
>suggesting that a different internal representation should
>be used. Speculating about the motivation, I would think he
>has efficiency in the face of round-trip conversions in mind,
>but not a change in visible behaviour.

That's not completely different ;-) I was trying to tweak str to play
its dual roles of charstring and bytestring explicitly, and I think the
charstring side is almost exactly what /F was talking about. A reasonable
implementation would be to have charstring be /F's unicode exactly, though
I'm always wanting to be "conceptually correct" so the concept of charstring
would have to include totally arbitrary character sets, not just those found
in unicode ;-)

Regards,
Bengt Richter

Bengt Richter

unread,

Dec 7, 2003, 9:38:58 PM12/7/03

to

On Mon, 8 Dec 2003 03:01:09 +0300, "Serge Orlov" <sombD...@pobox.ru> wrote:

>Bengt,
>
>don't take it personally but this what happens <wink> when you use unicode
>unaware software:
>Quote from your message:
>> mar...@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:
>
>Your software also doesn't specify message encoding.
>
>The real issue is to convince developers that there are many encodings in
>this world. Python should offer only one way to deal with multiple encodings.
>Your 8-bit strings with attached coding attribute is duplicating what unicode
>strings offer.

In one sense yes (being able to represent all the same character sequences one way
or another) but other senses no. Part of the concern was matters that could well be
totally hidden behind a unicode interface (internal representation/optimization issues).
Normalizing to one methodology usually saves time and space, but if there are use cases
where the normalization is a useless 1:1 transformation, then not. But then it becomes
a matter of how much overhead there is in deciding which situation applies. It may or
may not be worth it. Anyway, that's one path of discussion. Another is how to deal with
the fact that you wouldn't want to convert all current str data into unicode (e.g. data
that is really pure bytestrings and has no character interpretation).

If we were starting from scratch, it would be a lot easier to disentangle bytestrings
and charstrings. That's really the problem that led to most of my ramblings. My tacking
on .coding attributes is really a kludge to create a hybrid charstring/bytestring.
I don't like kludges, but otherwise I don't currently see a way short of major breakage,
and Martin predicts I'll run into major problems any way I might want to try going.
His opionion is nothing to sneeze at ;-)

>
>> >> If e.g. name had latin-1 encoding associated with it by virtue of source like
>> >> ...
>> >> # -*- coding: latin-1 -*-
>> >> name = 'Martin Lowis'
>> >>
>> >> then on my cp437 console window, I might be able to expect to see the umlaut
>> >> just by writing
>> >>
>> >> print name
>> >
>> >I see. To achieve this effect, do
>> >
>> ># -*- coding: latin-1 -*-
>> >name = u'Martin Lowis'
>> >print name
>> Right, but that is a workaround w.r.t the possibility I am trying to discuss.
>
>It's not a workaround it's a solution. What you propose is a lot of effort for

It's one solution to a particular problem. That solution was proposed as a way
to get an end effect. I wasn't asking for help on how to get the end effect.
I think I can do that ;-) I was discussing another problem, namely that a bare quoted
string is not sufficient to cause the proper encoding conversions for output even though
the encoding could be determined unambigously from the source file encoding.

>a little gain wrt handling multiple encoding. It's already possible to handle

I don't consider being able to eliminate unnecessary source text cruft little gain.

>multiple encoding. The time is better spent converting everything that still
>deals with 8-bit text to handle unicode.

It was always possible to handle multiple encodings, if you wanted to go to the trouble
of writing your own solution. If you are happy with the u'...' prefix as the final answer
to handling multiple character sets, and consider python evolution finished there with that issue,
that's fine. But the problem is also disentangling text uses from byte uses of str, and migrating
towards transparent use of unicode oer the quivalent. This gets into language design issues that
are more interesting than just how to make a multi-char-set app work using python as it is currently.

I doubt if wholesale conversion to unicode is a good idea. E.g., you wouldn't want to read a latin-1
log file of hundreds of megabytes and get automatic conversion to 16-bit unicode for no reason,
I wouldn't think. You could have a unicode interface that hid internal
bytestrings-with-various-encodings-attached. And then you are getting into that part of
what I was discussing. The other part is how to disentangle charstrings from bytestrings
without rewriting the world. Hence proposing, for discussion, a kludge that might let str act as both.
But you didn't see me mention the PEP word anywhere yet, did you ;-)

Martin says the problem is hard. I believe him. I still like to bat ideas around with intelligent
people (of which there are a fair number posting to this group), even at the risk of striking out
some of the time. Sometimes something worthwhile emerges that perhaps no one participating would
have thought of without the interchange to trigger a key thought.

I'll be off line for a few days (pre-apologizing in case I appear to be ignoring anyone ;-)

Regards,
Bengt Richter

Martin v. Löwis

unread,

Dec 8, 2003, 12:18:49 PM12/8/03

to

bo...@oz.net (Bengt Richter) writes:

> >bo...@oz.net (Bengt Richter) writes:
> >
> >> Ok, I'm happy with that. But let's see where the errors come from.
> >> By definition it's from associating the wrong encoding assumption
> >> with a pure byte sequence.
> >
> >Wrong. Errors may also happen when performing unexpected conversions
> >from one encoding to a different one.
> ISTM that could only happen e.g. if you explicitly called codecs to
> convert between incompatible encodings.

No. It could also happen when you concatenate strings with
incompatible encodings.

> s3 = (s1.bytes().decode(s1.coding) + s2.bytes().decode(s2.coding)).encode(cenc(s1.coding, s2.coding))

So what happens if either s1.coding or s2.coding is None?

> def cenc(enc1, enc2):
> """return common encoding"""
> if enc1==enc2: return enc1 # this makes latin-1 + latin-1 => latin-1, etc.
> if enc1 is None: enc1 = 'ascii' # notorios assumption ;-)
> if enc2 is None: enc2 = 'ascii' # ditto
> if enc1[:3] == 'utf': return enc1 # preserve unicode encoding format of s1 preferentially
> return 'utf' # generic system utf encoding

It would be better to call that utf-8, as utf is an unfortunate
alias...

So concatenating latin-1 and koi-8r strings would give an utf-8
string, as would concatenating an ascii string and a latin-1 string.

> But this is not an error. An error would only arise if one tried to use
> the bytes as characters without specifying a decoding.

So

print open("/etc/passwd").read()

would raise an exception???

> >Unfortunately, not. You seem to assume that nearly all strings have
> >encoding information attached, but you don't explain where you expect
> >this information to come from.

> Strings appearing as literals in program sources will be assumed to
> have the same encoding as is assumed or explicitly specified for the
> source text. IMO that will cover a lot of strings not now covered,
> and will be an improvement even if it doesn't cover everything.

I doubt that. Operatings will decay to "no encoding" very quickly, or
give exceptions - depending on your (yet unclear) specification.

> >??? What is a "possibly heterogenous representation", how do I
> >implement it, and how do I use it?
> See example s3 = s1 + s2 above.

In what sense is the resulting representation heterogenous? ISTM that
the result uses cenc(s1.encoding, s2.encoding) as its representation.

> s[0] and s[1] create new encoded strings if they are indexing
> encoded strings, and preserve the .coding info. So e.g., in general,
> when .coding is not None,
>
> s[i] <-> s.decode(s.coding)[i].encode(s.coding)

So if s.coding doesn't round-trip, s[i].bytes() would not be a
substring of s.bytes(), right?

> In other words, when s.coding is not None, you can think of all the
> possibilities as alternative representations of
> s.bytes().decode(s.coding) where .bytes() is a method to get the raw
> str bytes of the particular encoding, and even if s.coding is None,
> you could use the virtual 'bytes' character set default assumption,
> so that all strings have a character interpretation if needed.

So what is the difference between this type, and the unicode type? It
appears that indexing works all the same in your string, type, and in
the Unicode type, and instead of saying .bytes, you say .encode(encname).

> >So *all* existing socket code would get byte strings, and so would all
> >existing file I/O. You will break a lot of code.
> Why?

Because people try to combine such strings with strings with encoding
information.

You haven't specified yet what happens when you try to do this, but it
appears that you are proposing that one gets an exception.

Regards,
Martin

Martin v. Löwis

unread,

Dec 8, 2003, 12:27:50 PM12/8/03

to

bo...@oz.net (Bengt Richter) writes:

> As I've said, ISTM most string literals in program sources are
> really character strings, and could well become unicode right in the
> tokenizer (that part is a new statement ;-)

Yes, this is what the -U switch does, today, and it may become the
default in the future.

> But there are some string literals that really do represent bytes,
> not characters

Yes, and we may get byte string literals for this case, some day.

> and there is a legacy of byte-producing object interfaces that claim
> to be producing the same type thing (str) as string literals. This
> is problematic IMO.

Yes, it is.

> ISTM there really are two different types that need to be
> disentangled (charstring vs bytestring, or chars vs bytes for
> short).

Yes, and that's why the Unicode type was introduced. Think "Unicode -
characters", "str - bytes".

> OTTOMH I don't thing charstring should be unicode, because that
> excludes custom character sets that may not exist in unicode, and/or
> will be a pain to create a private unicode map for.

This is both academic, and uninteresting. It is academic because all
existing character sets have mappings to Unicode, and it is
uninteresting because you can easily map unsupported characters to
private-use characters. There are plenty of these.

> That's why I think charstring must be effectively be something like
> a (bytestring, codec) pair.

And this is the conclusion that is wrong, IMO. For all relevant
charsets, codecs are already available, and your specification would
fall apart if an encoding was used for which no codec is available,
so using this representation does not improve anything.

> bytestrings as str ;-/ Round and round. So I am pulled back the idea
> of .coding-attributed str's, even though it's not too clean.

Any approach where concatenating two instances of the byte string type
will change bytes (through recoding) breaks backwards compatibility.

> That's not completely different ;-) I was trying to tweak str to play
> its dual roles of charstring and bytestring explicitly, and I think the
> charstring side is almost exactly what /F was talking about.

Yes, but /F was thinking of Unicode objects as *the* character string
type, all the time. And not surprisingly so: he (co-)invented the thing.

> A reasonable implementation would be to have charstring be /F's
> unicode exactly, though I'm always wanting to be "conceptually
> correct" so the concept of charstring would have to include totally
> arbitrary character sets, not just those found in unicode ;-)

You are talking about the emtpy set, here.

Regards,
Martin

Fredrik Lundh

unread,

Dec 8, 2003, 12:55:51 PM12/8/03

to pytho...@python.org

Martin v. Löwis wrote:

> > having written Python's Unicode string type, I'm now thinking that
> > it might have been better to use a polymorphic "text" type with
> > either UTF-8 or encoded char or wchar buffers, and do dynamic
> > translation based on usage patterns. I've been playing with this
> > idea in Pytte, but as usual, there's so much code, and so little
> > time...
>
> "Better" in what sense?

less code, better performance. the usual stuff.

> Would it even be better if you had to preserve all the C-level API
> that we currently have?

I don't think it can be done without changing the API. maybe in
Python 3000?

</F>