break unichr instead of fix ord?

ru...@yahoo.com

unread,

Aug 25, 2009, 3:45:49 PM8/25/09

to

In Python 2.5 on Windows I could do [*1]:

# Create a unicode character outside of the BMP.
>>> a = u'\U00010040'

# On Windows it is represented as a surogate pair.
>>> len(a)
2
>>> a[0],a[1]
(u'\ud800', u'\udc40')

# Create the same character with the unichr() function.
>>> a = unichr (65600)
>>> a[0],a[1]
(u'\ud800', u'\udc40')

# Although the unichr() function works fine, its
# inverse, ord(), doesn't.
>>> ord (a)
TypeError: ord() expected a character, but string of length 2 found

On Python 2.6, unichr() was "fixed" (using the word
loosely) so that it too now fails with characters outside
the BMP.

>>> a = unichr (65600)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Why was this done rather than changing ord() to accept a
surrogate pair?

Does not this effectively make unichr() and ord() useless
on Windows for all but a subset of unicode characters?

Jan Kaliszewski

unread,

Aug 25, 2009, 9:02:11 PM8/25/09

to ru...@yahoo.com, pytho...@python.org

25-08-2009 o 21:45:49 <ru...@yahoo.com> wrote:

> In Python 2.5 on Windows I could do [*1]:
>
> # Create a unicode character outside of the BMP.
> >>> a = u'\U00010040'
>
> # On Windows it is represented as a surogate pair.

[snip]

> On Python 2.6, unichr() was "fixed" (using the word
> loosely) so that it too now fails with characters outside
> the BMP.

[snip]

> Does not this effectively make unichr() and ord() useless
> on Windows for all but a subset of unicode characters?

Are you sure, you couldn't have UCS-4-compiled Python distro
for Windows?? :-O

*j

--
Jan Kaliszewski (zuo) <z...@chopin.edu.pl>

Christian Heimes

unread,

Aug 25, 2009, 9:49:42 PM8/25/09

to pytho...@python.org

Jan Kaliszewski wrote:
> Are you sure, you couldn't have UCS-4-compiled Python distro
> for Windows?? :-O

Nope, Windows require UCS-2 builds.

Christian

Mark Tolonen

unread,

Aug 25, 2009, 11:53:52 PM8/25/09

to pytho...@python.org

<ru...@yahoo.com> wrote in message
news:2ad21a79-4a6c-42a7...@v20g2000yqm.googlegroups.com...

Switch to Python 3?

>>> x='\U00010040'
>>> import unicodedata
>>> unicodedata.name(x)
'LINEAR B SYLLABLE B025 A2'
>>> ord(x)
65600
>>> hex(ord(x))
'0x10040'
>>> unicodedata.name(chr(0x10040))
'LINEAR B SYLLABLE B025 A2'
>>> ord(chr(0x10040))
65600
>>> print(ascii(chr(0x10040)))
'\ud800\udc40'

-Mark

Vlastimil Brom

unread,

Aug 26, 2009, 4:05:17 AM8/26/09

to ru...@yahoo.com, pytho...@python.org

2009/8/25 <ru...@yahoo.com>:

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Hi,
I'm not sure about the exact reasons for this behaviour on narrow
builds either (maybe the consistency of the input/ output data to
exactly one character?).

However, if I need these functions for higher unicode planes, the
following rather hackish replacements seem to work. I presume, there
might be smarter ways of dealing with this, but anyway...

hth,
vbr

#### not (systematically) tested #####################################

import sys

def wide_ord(char):
try:
return ord(char)
except TypeError:
if len(char) == 2 and 0xD800 <= ord(char[0]) <= 0xDBFF and
0xDC00 <= ord(char[1]) <= 0xDFFF:
return (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) -
0xDC00) + 0x10000
else:
raise TypeError("invalid character input")

def wide_unichr(i):
if i <= sys.maxunicode:
return unichr(i)
else:
return ("\U"+str(hex(i))[2:].zfill(8)).decode("unicode-escape")

"Martin v. Löwis"

unread,

Aug 26, 2009, 5:10:41 PM8/26/09

to ru...@yahoo.com

> In Python 2.5 on Windows I could do [*1]:
>

> >>> a = unichr (65600)
> >>> a[0],a[1]
> (u'\ud800', u'\udc40')

I can't reproduce that. My copy of Python on Windows gives

Traceback (most recent call last):
File "<pyshell#0>", line 1, in <module>
unichr(65600)

ValueError: unichr() arg not in range(0x10000) (narrow Python build)

This is

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32

Regards,
Martin

ru...@yahoo.com

unread,

Aug 26, 2009, 7:27:33 PM8/26/09

to

My apologies for the red herring. I was working from
a comment in my replacement ord() function. I dug up
an old copy of Python 2.4.3 and could not reproduce it
there either so I have no explanation for the comment
(which I wrote). Python 2.3 maybe?

But regardless, the significant question is, what is
the reason for having ord() (and unichr) not work for
surrogate pairs and thus not usable with a large number
of unicode characters that Python otherwise supports?

ru...@yahoo.com

unread,

Aug 26, 2009, 7:29:34 PM8/26/09

to

I am still a long way away from moving to Python 3
but I am looking forward to hopefully more rational
unicode handling there. Thanks for the info.

ru...@yahoo.com

unread,

Aug 26, 2009, 7:35:51 PM8/26/09

to

On Aug 26, 2:05 am, Vlastimil Brom <vlastimil.b...@gmail.com> wrote:
>[...]

> Hi,
> I'm not sure about the exact reasons for this behaviour on narrow
> builds either (maybe the consistency of the input/ output data to
> exactly one character?).
>
> However, if I need these functions for higher unicode planes, the
> following rather hackish replacements seem to work. I presume, there
> might be smarter ways of dealing with this, but anyway...
>
> hth,
> vbr
>

>[...code snipped...]

Thanks, I wrote a replacement ord function nearly identical
to yours but will steal your unichr function if that's ok. :-)

But I still wonder why all this is neccessary.

Vlastimil Brom

unread,

Aug 26, 2009, 8:26:28 PM8/26/09

to ru...@yahoo.com, pytho...@python.org

2009/8/27 <ru...@yahoo.com>:

> On Aug 26, 2:05 am, Vlastimil Brom <vlastimil.b...@gmail.com> wrote:
>>[...]

>>...

>> However, if I need these functions for higher unicode planes, the
>> following rather hackish replacements seem to work. I presume, there
>> might be smarter ways of dealing with this, but anyway...
>>
>> hth,
>> vbr
>>
>>[...code snipped...]
>
> Thanks, I wrote a replacement ord function nearly identical
> to yours but will steal your unichr function if that's ok. :-)
>
> But I still wonder why all this is neccessary.

> --
You are welcome :-),
but make sure to test it, if you are going to use it for something
more complex; as I said, it's only vaguely tested ...

vbr

Steven D'Aprano

unread,

Aug 26, 2009, 10:52:41 PM8/26/09

to

On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote:

> But regardless, the significant question is, what is the reason for
> having ord() (and unichr) not work for surrogate pairs and thus not
> usable with a large number of unicode characters that Python otherwise
> supports?

I'm no expert on Unicode, but my guess is that the reason is out of a
desire for simplicity: unichr() should always return a single char, not a
pair of chars, and similarly ord() should take as input a single char,
not two, and return a single number.

Otherwise it would be ambiguous whether ord(surrogate_pair) should return
a pair of ints representing the codes for each item in the pair, or a
single int representing the code point for the whole pair.

E.g. given your earlier example:

>>> a = u'\U00010040'
>>> len(a)
2
>>> a[0]
u'\ud800'
>>> a[1]
u'\udc40'

would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040? If the
latter, what about ord(u'ab')?

Remember that a unicode string can contain code points that aren't valid
characters:

>>> ord(u'\ud800') # reserved for surrogates, not a character
55296

so if ord() sees a surrogate pair, it can't assume it's meant to be
treated as a surrogate pair rather than a pair of code points that just
happens to match a surrogate pair.

None of this means you can't deal with surrogate pairs, it just means you
can't deal with them using ord() and unichr().

The above is just my guess, I'd be interested to hear what others say.

--
Steven

ru...@yahoo.com

unread,

Aug 27, 2009, 1:36:12 AM8/27/09

to

On 08/26/2009 08:52 PM, Steven D'Aprano wrote:
> On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote:
>
>> But regardless, the significant question is, what is the reason for
>> having ord() (and unichr) not work for surrogate pairs and thus not
>> usable with a large number of unicode characters that Python otherwise
>> supports?
>
>
> I'm no expert on Unicode, but my guess is that the reason is out of a
> desire for simplicity: unichr() should always return a single char, not a
> pair of chars, and similarly ord() should take as input a single char,
> not two, and return a single number.
>
> Otherwise it would be ambiguous whether ord(surrogate_pair) should return
> a pair of ints representing the codes for each item in the pair, or a
> single int representing the code point for the whole pair.
>
> E.g. given your earlier example:
>
>>>> a = u'\U00010040'
>>>> len(a)
> 2
>>>> a[0]
> u'\ud800'
>>>> a[1]
> u'\udc40'
>
> would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040?

The latter.

> If the
> latter, what about ord(u'ab')?

I would expect a TypeError* (as ord() currently raises) because
the string length is not 1 and 'ab' is not a surrogate pair.

*Actually I would have expected ValueError but I'm not going
to lose sleep over it.

> Remember that a unicode string can contain code points that aren't valid
> characters:
>
>>>> ord(u'\ud800') # reserved for surrogates, not a character
> 55296
>
> so if ord() sees a surrogate pair, it can't assume it's meant to be
> treated as a surrogate pair rather than a pair of code points that just
> happens to match a surrogate pair.

Well, actually, yes it can. :-)

Python has already made a strong statement that such a pair
the representation of a character:

>>> a = ''.join([u'\ud800',u'\udc40'])
>>> a
u'\U00010040'

That is, Python prints, and treats in nearly all other contexts,
that combination as a character.

This is related to the practicality argument: what is the ratio
of need treat a surrogate pair as character consistent with
with the rest of Python, vs the need to treat it as a string
of two separate (and invalid in the unicode sense?) characters?

And if you want to treat each half of the pair separately
it's not exactly hard: ord(a[0]), ord(a[1]).

> None of this means you can't deal with surrogate pairs, it just means you
> can't deal with them using ord() and unichr().

Kind of like saying, it doesn't mean you can't deal
with integers larger that 2**32, you just can't multiply
and divide them.

"Martin v. Löwis"

unread,

Aug 27, 2009, 1:51:29 AM8/27/09

to ru...@yahoo.com

> My apologies for the red herring. I was working from
> a comment in my replacement ord() function. I dug up
> an old copy of Python 2.4.3 and could not reproduce it
> there either so I have no explanation for the comment
> (which I wrote). Python 2.3 maybe?

No. The behavior you observed would only happen on
a wide Unicode build (e.g. on Unix).

> But regardless, the significant question is, what is
> the reason for having ord() (and unichr) not work for
> surrogate pairs and thus not usable with a large number
> of unicode characters that Python otherwise supports?

See PEP 261, http://www.python.org/dev/peps/pep-0261/
It specifies all this.

Regards,
Martin

ru...@yahoo.com

unread,

Aug 27, 2009, 8:49:26 PM8/27/09

to

On 08/26/2009 11:51 PM, "Martin v. Löwis" wrote:
>[...]

>> But regardless, the significant question is, what is
>> the reason for having ord() (and unichr) not work for
>> surrogate pairs and thus not usable with a large number
>> of unicode characters that Python otherwise supports?
>
> See PEP 261, http://www.python.org/dev/peps/pep-0261/
> It specifies all this.

The PEP (AFAICT) says only what we already know... that
on narrow builds unichr() will raise an exception with
an argument >= 0x10000, and ord() is unichr()'s inverse.

I have read the PEP twice now and still see no justification
for that decision, it appears to have been made by fiat.[*1]

Could you or someone please point me to specific justification
for having unichr and ord work only for a subset of unicode
characters on narrow builds, as opposed to the more general
and IMO useful behavior proposed earlier in this thread?

----------------------------------------------------------
[*1]
The PEP says:
* unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
length-one string.

* unichr(i) for 2**16 <= i <= TOPCHAR will return a
length-one string on wide Python builds. On narrow
builds it will raise ValueError.
and
* ord() is always the inverse of unichr()

which of course we know; that is the current behavior. But
there is no reason given for that behavior.

Under the second *unicode bullet point, there are two issues
raised:
1) Should surrogate pairs be disallowed on narrow builds?
That appears to have been answered in the negative and is
not relevant to my question.
2) Should access to code points above TOPCHAR be allowed?
Not relevant to my question.

* every Python Unicode character represents exactly
one Unicode code point (i.e. Python Unicode
Character = Abstract Unicode character)

I'm not sure what this means (what's an abstract unicode
character?). If it mandates that u'\ud800\udc40' be
treated as a len() 2 string, that is that current case
but does not say anything about how unichr and ord
should behave. If it mandates that that string must
always be treated as two separate code points then
Python itself violates by printing that string as
u'\U00010040' rather than u'\ud800\udc40'.

Finally we read:

* There is a convention in the Unicode world for
encoding a 32-bit code point in terms of two
16-bit code points. These are known as
"surrogate pairs". Python's codecs will adopt
this convention.

Is a distinction made between Python and Python
codecs with only the latter having any knowledge of
surrogate pairs? I guess that would explain why
Python prints a surrogate pair as a single character.
But this seems arbitrary and counter-useful if
applied to ord() and unichr(). What possible
use-case is there for *not* recognizing surrogate
pairs in those two functions?

Nothing else in the PEP seems remotely relevant.

"Martin v. Löwis"

unread,

Aug 28, 2009, 4:12:34 AM8/28/09

to

> The PEP says:
> * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
> length-one string.
>
> * unichr(i) for 2**16 <= i <= TOPCHAR will return a
> length-one string on wide Python builds. On narrow
> builds it will raise ValueError.
> and
> * ord() is always the inverse of unichr()
>
> which of course we know; that is the current behavior. But
> there is no reason given for that behavior.

Sure there is, right above the list:

"Most things will behave identically in the wide and narrow worlds."

That's the reason: scripts should work the same as much as possible
in wide and narrow builds.

What you propose would break the property "unichr(i) always returns
a string of length one, if it returns anything at all".

> 1) Should surrogate pairs be disallowed on narrow builds?
> That appears to have been answered in the negative and is
> not relevant to my question.

It is, as it does lead to inconsistencies between wide and narrow
builds. OTOH, it also allows the same source code to work on both
versions, so it also preserves the uniformity in a different way.

> * every Python Unicode character represents exactly
> one Unicode code point (i.e. Python Unicode
> Character = Abstract Unicode character)
>
> I'm not sure what this means (what's an abstract unicode
> character?).

I don't think this is actually the case, but I may be confusing
Unicode terminology here - "abstract character" is a term from
the Unicode standard.

> Finally we read:
>
> * There is a convention in the Unicode world for
> encoding a 32-bit code point in terms of two
> 16-bit code points. These are known as
> "surrogate pairs". Python's codecs will adopt
> this convention.
>
> Is a distinction made between Python and Python
> codecs with only the latter having any knowledge of
> surrogate pairs?

No. In the end, the Unicode type represents code units,
not code points, i.e. half surrogates are individually
addressable. Codecs need to adjust to that; in particular
the UTF-8 and the UTF-32 codec in narrow builds, and the
UTF-16 codec in wide builds (which didn't exist when the
PEP was written).

> Nothing else in the PEP seems remotely relevant.

Except for the motivation, of course :-)

In addition: your original question was "why has this
been changed", to which the answer is "it hasn't".
Then, the next question is "why is it implemented that
way", to which the answer is "because the PEP says so".
Only *then* the question is "what is the rationale for
the PEP specifying things the way it does". The PEP is
relevant so that we can both agree that Python behaves
correctly (in the sense of behaving as specified).

Regards,
Martin

ru...@yahoo.com

unread,

Aug 29, 2009, 10:38:51 AM8/29/09

to

On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:

[I reordered the quotes from your previous post to try
and get the responses in a more coherent order. No
intent to take anything out of context...]

>> Nothing else in the PEP seems remotely relevant.

[to providing justification for the behavior of
unichr/ord]

>
> Except for the motivation, of course :-)
>
> In addition: your original question was "why has this
> been changed", to which the answer is "it hasn't".

My original interest was two-fold: can unichr/ord be
changed to work in a more general and helpful way? That
seemed remotely possible until it was pointed out that
the two behave consistently, and that behavior is accurately
documented. Second, why would they work the way they do
when they could have been generalized to cover the full
unicode space? An inadequate answer to this would have
provided support for the first point but remains interesting
to me for the reason below.

> Then, the next question is "why is it implemented that
> way", to which the answer is "because the PEP says so".

Not at all a satisfying answer unless one believes
in PEPal infallibility. :-)

> Only *then* the question is "what is the rationale for
> the PEP specifying things the way it does". The PEP is
> relevant so that we can both agree that Python behaves
> correctly (in the sense of behaving as specified).

But my question had become: why that behavior, when a
slightly different behavior would be more general with
little apparent downside?

To clarify, my interest in the justification for the
current behavior is this:

I think the best feature of python is not, as commonly
stated, the clean syntax, but rather the pretty complete
and orthogonal libraries. I often find, after I have
written some code, that due to the right library functions
being available, it turns out much shorter and concise
than I expected.

Nevertheless, every now and then, perhaps more than in some
other languages (I'm not sure), I run into something that
requires what seems to be excessive coding -- I have to
do something it seems to me that a library function should
have done for me. Sometimes this is because I don't under-
stand the reason the library function needs to works the
way it does. Other times it is one of the countless trade-
off made in the design of the language, which didn't happen
to go the way that would have been beneficial to me in a
particular coding situation.

But sometimes (and it feels too often) it seems as though,
zen not withstanding, that purity -- adherence to some
philosophic ideal -- beat practicality.
unichr/ord seems such as case to me, But I want to be
sure I am not missing something.

The reasons for the current behavior so far:

1.

> What you propose would break the property "unichr(i) always returns
> a string of length one, if it returns anything at all".

Yes. And i don't see the problem with that. Why is
that property more desirable than the non-existent
property that a Unicode literal always produces one
python character? It would only occur on a narrow
build with a unicode character outside of the bmp,
exactly the condition a unicode literal can "behave
differently" by producing two python characters.

2.
> > But there is no reason given [in the PEP] for that behavior.

> Sure there is, right above the list:
> "Most things will behave identically in the wide and narrow worlds."
> That's the reason: scripts should work the same as much as possible
> in wide and narrow builds.

So what else would work "differently"? My point was
that extending unichr/ord to work with all unicode
characters reduces differences far more often than
it increase them.

3.

>> * There is a convention in the Unicode world for
>> encoding a 32-bit code point in terms of two
>> 16-bit code points. These are known as
>> "surrogate pairs". Python's codecs will adopt
>> this convention.
>>
>> Is a distinction made between Python and Python
>> codecs with only the latter having any knowledge of
>> surrogate pairs?
>
> No. In the end, the Unicode type represents code units,
> not code points, i.e. half surrogates are individually
> addressable. Codecs need to adjust to that; in particular
> the UTF-8 and the UTF-32 codec in narrow builds, and the
> UTF-16 codec in wide builds (which didn't exist when the
> PEP was written).

OK, so that is not a reason either.

4.
I'll speculate a little.
If surrogate handling was added to ord/unichr, it would
be the top of a slippery slope leading to demands that
other string functions also handle surrogates.

But this is not true -- there is a strong distinction
between ord/unichr and other string methods. The latter
deal with strings of multiple characters. But the former
deals only with single characters (taking a surrogate
pair as a single unicode character.)

The behavior of ord/unichr is independent of the other
string methods -- if they were changed with regard to
surrogate handling they would all have to be changed to
maintain consistent behavior. Unichr/str affect only
each other.

The functions of ord/unichr -- to map characters to
numbers -- are fundamental string operations, akin to
indexing or extracting a substring. So why would
one want to limit them to a subset of characters if
not absolutely necessary?

To reiterate, I am not advocating for any change. I
simply want to understand if there is a good reason
for limiting the use of unchr/ord on narrow builds to
a subset of the unicode characters that Python otherwise
supports. So far, it seems not and that unichr/ord
is a poster child for "purity beats practicality".

Steven D'Aprano

unread,

Aug 29, 2009, 2:06:34 PM8/29/09

to

On Sat, 29 Aug 2009 07:38:51 -0700, rurpy wrote:

> > Then, the next question is "why is it implemented that way", to which
> > the answer is "because the PEP says so".
>
> Not at all a satisfying answer unless one believes in PEPal
> infallibility. :-)

Not at all. You don't have to believe that PEPs are infallible to accept
the answer, you just have to understand that major changes to Python
aren't made arbitrarily, they have to go through a PEP first. Even Guido
himself has to write a PEP before making any major changes to the
language. But PEPs aren't infallible, they can be challenged, rejected,
withdrawn or made obsolete by new PEPs.

> The reasons for the current behavior so far:
>
> 1.
>> What you propose would break the property "unichr(i) always returns a
>> string of length one, if it returns anything at all".
>
> Yes. And i don't see the problem with that. Why is that property more
> desirable than the non-existent property that a Unicode literal always
> produces one python character?

What do you mean? Unicode literals don't always produce one character,
e.g. u'abcd' is a Unicode literal with four characters.

I think it's fairly self-evident that a function called uniCHR [emphasis
added] should return a single character (technically a single code
point). But even if you can come up with a reason for unichr() to return
two or more characters, this would break code that relies on the
documented promise that the length of the output of unichr() is always
one.

> It would only occur on a narrow build
> with a unicode character outside of the bmp, exactly the condition a
> unicode literal can "behave differently" by producing two python
> characters.

> 2.
>> > But there is no reason given [in the PEP] for that behavior.
>> Sure there is, right above the list:
>> "Most things will behave identically in the wide and narrow worlds."
>> That's the reason: scripts should work the same as much as possible in
>> wide and narrow builds.
>
> So what else would work "differently"?

unichr(n) sometimes would return one character and sometimes two; ord(c)
would sometimes accept two characters and sometimes raise an exception.
That's a fairly major difference.

> My point was that extending
> unichr/ord to work with all unicode characters reduces differences far
> more often than it increase them.

I don't see that at all. What differences do you think it would reduce?

> 3.
>>> * There is a convention in the Unicode world for
>>> encoding a 32-bit code point in terms of two 16-bit code
>>> points. These are known as "surrogate pairs". Python's codecs
>>> will adopt this convention.
>>>
>>> Is a distinction made between Python and Python codecs with only the
>>> latter having any knowledge of surrogate pairs?
>>
>> No. In the end, the Unicode type represents code units, not code
>> points, i.e. half surrogates are individually addressable. Codecs need
>> to adjust to that; in particular the UTF-8 and the UTF-32 codec in
>> narrow builds, and the UTF-16 codec in wide builds (which didn't exist
>> when the PEP was written).
>
> OK, so that is not a reason either.

I think it is a very important reason. Python supports code points, so it
has to support surrogate codes individually. Python can't tell if the
pair of code points u'\ud800\udc40' represents the single character
\U00010040 or a pair of code points \ud800 and \udc40.

> 4.
> I'll speculate a little.
> If surrogate handling was added to ord/unichr, it would be the top of a
> slippery slope leading to demands that other string functions also
> handle surrogates.
>
> But this is not true -- there is a strong distinction between ord/unichr
> and other string methods. The latter deal with strings of multiple
> characters. But the former deals only with single characters (taking a
> surrogate pair as a single unicode character.)

Strictly speaking, unichr() deals with code points, not characters,
although the distinction is very fine.

>>> c = unichr(56384)
>>> len(c)
1
>>> import unicodedata
>>> unicodedata.category(c)
'Cs'

Cs is the general category for "Other, Surrogate", so \udc40 is not
strictly speaking a character. Nevertheless, Python treats it as one.

> To reiterate, I am not advocating for any change. I simply want to
> understand if there is a good reason for limiting the use of unchr/ord
> on narrow builds to a subset of the unicode characters that Python
> otherwise supports. So far, it seems not and that unichr/ord is a
> poster child for "purity beats practicality".

On the contrary, it seems pretty impractical to me for ord() to sometimes
successfully accept strings of length two and sometimes to raise an
exception. I would much rather see a pair of new functions, wideord() and
widechr() used for converting between surrogate pairs and numbers.

--
Steven

Vlastimil Brom

unread,

Aug 29, 2009, 3:43:58 PM8/29/09

to ru...@yahoo.com, pytho...@python.org

2009/8/29 <ru...@yahoo.com>:

> On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote:
>
> So far, it seems not and that unichr/ord
> is a poster child for "purity beats practicality".

> --
> http://mail.python.org/mailman/listinfo/python-list
>

As Mark Tolonen pointed out earlier in this thread, in Python 3 the
practicality apparently beat purity in this aspect:

Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.

>>> goth_urus_1 = '\U0001033f'
>>> list(goth_urus_1)
['\ud800', '\udf3f']
>>> len(goth_urus_1)
2
>>> ord(goth_urus_1)
66367
>>> goth_urus_2 = chr(66367)
>>> len(goth_urus_2)
2
>>> import unicodedata
>>> unicodedata.name(goth_urus_1)
'GOTHIC LETTER URUS'
>>> goth_urus_3 = unicodedata.lookup("GOTHIC LETTER URUS")
>>> goth_urus_4 = "\N{GOTHIC LETTER URUS}"
>>> goth_urus_1 == goth_urus_2 == goth_urus_3 == goth_urus_4
True
>>>

As for the behaviour in python 2.x, it's probably good enough, that
the surrogates aren't prohibited and the eventually needed behaviour
can be easily added via custom functions.

vbr

ru...@yahoo.com

unread,

Aug 29, 2009, 8:12:24 PM8/29/09

to

On 08/29/2009 12:06 PM, Steven D'Aprano wrote:
[...]

>> The reasons for the current behavior so far:
>>
>> 1.
>>> What you propose would break the property "unichr(i) always returns a
>>> string of length one, if it returns anything at all".
>>
>> Yes. And i don't see the problem with that. Why is that property more
>> desirable than the non-existent property that a Unicode literal always
>> produces one python character?
>
> What do you mean? Unicode literals don't always produce one character,
> e.g. u'abcd' is a Unicode literal with four characters.

I'm sorry, I should have been clearer. I meant the literal
representation of a *single* unicode character. u'\u4000'
which results in a string of length 1, vs u'\U00010040' which
results in a string of length 2. In both case the literal
represents a single unicode code point.

> I think it's fairly self-evident that a function called uniCHR [emphasis
> added] should return a single character (technically a single code
> point).

There are two concepts of characters here, the 16-bit things
that encodes a character in a Python unicode string (in a
narrow build Python), and a character in the sense of one
of the ~2**10 unicode characters. Python has chosen to
represent the latter (when outside the BMP) as a pair of
surrogate characters from the former. I don't see why one
would assume that CHR would mean the python 16-bit
character concept rather than the full unicode character
concept. In fact, rather the opposite.

> But even if you can come up with a reason for unichr() to return
> two or more characters,

I've given a number of reasons why it should return a two
character representation of a non-BMP character, one of
which is that that is how Python has chosen to represent
such characters internally. I won't repeat the other
reasons again.

I'm not sure why you think more than two characters
would ever be possible.

> this would break code that relies on the
> documented promise that the length of the output of unichr() is always
> one.

Ah, OK. This is the good reason I was looking for.
I did not realize (until prompted by your remark
to go back and look at the early docs) that unichr
had been documented to return a single character
since 2.0 and that wide character support was added
in 2.2. Martin v. Loewis also implied that, I now
see, although the implication was too deep for me
to pick up.

So although it leads to a suboptimal situation, I
agree that maintaining the documented behavior was
necessary.

[...]

> I would much rather see a pair of new functions, wideord() and
> widechr() used for converting between surrogate pairs and numbers.

I guess if it were still 2001 and Python 2.2 was
coming out I would be in favor of this too. :-)

ru...@yahoo.com

unread,

Aug 29, 2009, 8:16:30 PM8/29/09

to

Yes, that certainly seems like much more sensible behavior.

> > As for the behaviour in python 2.x, it's probably good enough, that
> > the surrogates aren't prohibited and the eventually needed behaviour
> > can be easily added via custom functions.

Yes, I agree that given the current behavior is well documented
and further, is fixed in python 3, it can't be changed.

I would a nit though with "can be easily added via custom
functions."
I don't think that is a good criterion for rejection of functionality
from the library because it is not sufficient; their are many
functions
in the library that fail that test. I think the criterion should
be more like a ratio: (how often needed) / (ease of writing).
[where "ease" is not just the line count but also the obviousness
to someone who is not a python expert yet.]

And I would also dispute that the generalized unichr/ord functions
are "easily" added. When I ran into the TypeError in ord(), I
thought "surrogate pairs" were something used in sex therapy. :-)
It took a lot of reading and research before I was able to write
a generalized ord() function.

Dieter Maurer

unread,

Aug 30, 2009, 12:54:21 AM8/30/09

to pytho...@python.org

"Martin v. Löwis" <mar...@v.loewis.de> writes on Fri, 28 Aug 2009 10:12:34 +0200:

> > The PEP says:
> > * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
> > length-one string.
> >
> > * unichr(i) for 2**16 <= i <= TOPCHAR will return a
> > length-one string on wide Python builds. On narrow
> > builds it will raise ValueError.
> > and
> > * ord() is always the inverse of unichr()
> >
> > which of course we know; that is the current behavior. But
> > there is no reason given for that behavior.
>
> Sure there is, right above the list:
>
> "Most things will behave identically in the wide and narrow worlds."
>
> That's the reason: scripts should work the same as much as possible
> in wide and narrow builds.
>
> What you propose would break the property "unichr(i) always returns
> a string of length one, if it returns anything at all".

But getting a "ValueError" in some builds (and not in others)
is rather worse than getting unicode strings of different length....

> > 1) Should surrogate pairs be disallowed on narrow builds?
> > That appears to have been answered in the negative and is
> > not relevant to my question.
>
> It is, as it does lead to inconsistencies between wide and narrow
> builds. OTOH, it also allows the same source code to work on both
> versions, so it also preserves the uniformity in a different way.

Do you not have the inconsistencies in any case?
... "ValueError" in some builds and not in others ...

"Martin v. Löwis"

unread,

Aug 30, 2009, 1:58:39 AM8/30/09

to ru...@yahoo.com

> To reiterate, I am not advocating for any change. I
> simply want to understand if there is a good reason
> for limiting the use of unchr/ord on narrow builds to
> a subset of the unicode characters that Python otherwise
> supports. So far, it seems not and that unichr/ord
> is a poster child for "purity beats practicality".

I think that's actually the case. I went back to the discussions,
and found that early 2.2 alpha releases did return two-byte
strings from unichr, and that this was changed because Marc-Andre
Lemburg insisted. Here are a few relevant messages from the
archives (search for unichr)

http://mail.python.org/pipermail/python-dev/2001-June/015649.html
http://mail.python.org/pipermail/python-dev/2001-July/015662.html
http://mail.python.org/pipermail/python-dev/2001-July/016110.html
http://mail.python.org/pipermail/python-dev/2001-July/016153.html
http://mail.python.org/pipermail/python-dev/2001-July/016155.html
http://mail.python.org/pipermail/python-dev/2001-July/016186.html

Eventually, in r28142, MAL changed it to give it its current
state.

Regards,
Martin

Nobody

unread,

Aug 30, 2009, 12:42:40 PM8/30/09

to

On Sun, 30 Aug 2009 06:54:21 +0200, Dieter Maurer wrote:

>> What you propose would break the property "unichr(i) always returns
>> a string of length one, if it returns anything at all".
>
> But getting a "ValueError" in some builds (and not in others)
> is rather worse than getting unicode strings of different length....

Not necessarily. If the code assumes that unichr() always returns a
single-character string, it will silently produce bogus results when
unichr() returns a pair of surrogates. An exception is usually preferable
to silently producing bad data.

If unichr() returns a surrogate pair, what is e.g. unichr(i).isalpha()
supposed to do?

Using surrogates is fine in an external representation (UTF-16), but it
doesn't make sense as an internal representation.

Think: why do people use wchar_t[] rather than a char[] encoded in UTF-8?
Because a wchar_t[] allows you to index *characters*, which you can't do
with a multi-byte encoding. You can't do it with a multi-*word* encoding
either.

UCS-2 and UTF-16 are superficially so similar that people forget that
they're completely different beasts. UCS-2 is fixed-length, UTF-16 is
variable-length. This makes UTF-16 semantically much closer to UTF-8 than
to UCS-2 or UCS-4.

If your wchar_t is 16 bits, the only sane solution is to forego support
for characters outside of the BMP.

The alternative is to process wide strings in exactly the same way that
you process narrow (mbcs) strings; e.g. extracting character N requires
iterating over the string from the beginning until you have counted N-1
characters. This provides no benefit over using narrow strings except for
a slight performance gain from halving the number of iterations. You still
end up with indexing being O(n) rather than O(1).