Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

PEP 393 vs UTF-8 Everywhere

163 views
Skip to first unread message

Pete Forman

unread,
Jan 20, 2017, 5:35:24 PM1/20/17
to
Can anyone point me at a rationale for PEP 393 being incorporated in
Python 3.3 over using UTF-8 as an internal string representation? I've
found good articles by Nick Coghlan, Armin Ronacher and others on the
matter. What I have not found is discussion of pros and cons of
alternatives to the old narrow or wide implementation of Unicode
strings.

ISTM that most operations on strings are via iterators and thus agnostic
to variable or fixed width encodings. How important is it to be able to
get to part of a string with a simple index? Just because old skool
strings could be treated as a sequence of characters, is that a reason
to shoehorn the subtleties of Unicode into that model?

--
Pete Forman

Chris Kaynor

unread,
Jan 20, 2017, 6:07:05 PM1/20/17
to
On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4...@gmail.com> wrote:
> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation? I've
> found good articles by Nick Coghlan, Armin Ronacher and others on the
> matter. What I have not found is discussion of pros and cons of
> alternatives to the old narrow or wide implementation of Unicode
> strings.

The PEP itself has the rational for the problems with the narrow/wide
idea, the quote from https://www.python.org/dev/peps/pep-0393/:
There are two classes of complaints about the current implementation
of the unicode type:on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

Basically, narrow builds had very odd behavior with non-BMP
characters, namely that indexing into the string could easily produce
mojibake. Wide builds used quite a bit more memory, which generally
translates to reduced performance.

> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings. How important is it to be able to
> get to part of a string with a simple index? Just because old skool
> strings could be treated as a sequence of characters, is that a reason
> to shoehorn the subtleties of Unicode into that model?

I think you are underestimating the indexing usages of strings. Every
operation on a string using UTF8 that contains larger characters must
be completed by starting at index 0 - you can never start anywhere
else safely. rfind/rsplit/rindex/rstrip and the other related reverse
functions would require walking the string from start to end, rather
than short-circuiting by reading from right to left. With indexing
becoming linear time, many simple algorithms need to be written with
that in mind, to avoid n*n time. Such performance regressions can
often go unnoticed by developers, who are likely to be testing with
small data, and thus may cause (accidental) DOS attacks when used on
real data. The exact same problems occur with the old narrow builds
(UTF16; note that this was NOT implemented in those builds, however,
which caused the mojibake problems) as well - only a UTF32 or PEP393
implementation can avoid those problems.

Note that from a user (including most developers, if not almost all),
PEP393 strings can be treated as if they were UTF32, but with many of
the benefits of UTF8. As far as I'm aware, it is only developers
writing extension modules that need to care - and only then if they
need maximum performance, and thus cannot convert every string they
access to UTF32 or UTF8.

--
Chris Kaynor

Thomas Nyberg

unread,
Jan 20, 2017, 6:14:50 PM1/20/17
to
On 01/20/2017 03:06 PM, Chris Kaynor wrote:
>
> [...snip...]
>
> --
> Chris Kaynor
>

I was able to delete my response which was a wholly contained subset of
this one. :)


But I have one extra question. Is string indexing guaranteed to be
constant-time for python? I thought so, but I couldn't find it
documented anywhere. (Not that I think it practically matters, since it
couldn't really change if it weren't for all the reasons you mentioned.)
I found this which at details (if not explicitly "guarantees") the
complexity properties of other datatypes:

https://wiki.python.org/moin/TimeComplexity

Cheers,
Thomas

Chris Angelico

unread,
Jan 20, 2017, 6:36:23 PM1/20/17
to
On Sat, Jan 21, 2017 at 10:15 AM, Thomas Nyberg <tomu...@gmx.com> wrote:
> But I have one extra question. Is string indexing guaranteed to be
> constant-time for python? I thought so, but I couldn't find it documented
> anywhere. (Not that I think it practically matters, since it couldn't really
> change if it weren't for all the reasons you mentioned.) I found this which
> at details (if not explicitly "guarantees") the complexity properties of
> other datatypes:
>

No, it isn't; this question came up in the context of MicroPython,
which chose to go UTF-8 internally instead of PEP 393. But the
considerations for uPy are different - it's not designed to handle
gobs of data, so constant-time vs linear isn't going to have as much
impact. But in normal work, it's important enough to have predictable
string performance. You can't afford to deploy a web application, test
it, and then have someone send a large amount of data at it, causing
massive O(n^2) blowouts.

ChrisA

Chris Kaynor

unread,
Jan 20, 2017, 6:38:23 PM1/20/17
to
.
On Fri, Jan 20, 2017 at 3:15 PM, Thomas Nyberg <tomu...@gmx.com> wrote:
> On 01/20/2017 03:06 PM, Chris Kaynor wrote:
>>
>>
>> [...snip...]
>>
>> --
>> Chris Kaynor
>>
>
> I was able to delete my response which was a wholly contained subset of this
> one. :)
>
>
> But I have one extra question. Is string indexing guaranteed to be
> constant-time for python? I thought so, but I couldn't find it documented
> anywhere. (Not that I think it practically matters, since it couldn't really
> change if it weren't for all the reasons you mentioned.) I found this which
> at details (if not explicitly "guarantees") the complexity properties of
> other datatypes:
>
> https://wiki.python.org/moin/TimeComplexity

As far as I'm aware, the language does not guarantee it. In fact, I
believe it was decided that MicroPython could use UTF8 strings with
linear indexing while still calling itself Python. This was very
useful for MicroPython due to the platforms it supports (embedded),
and needing to keep the memory footprint very small.

I believe Guido (on Python-ideas) has stated that constant-time string
indexing is a guarantee of CPython, however.

The only reference I found in my (very quick) search is the Python-Dev
thread at https://groups.google.com/forum/#!msg/dev-python/3lfXwljNLj8/XxO2s0TGYrYJ

MRAB

unread,
Jan 20, 2017, 7:19:14 PM1/20/17
to
On 2017-01-20 23:06, Chris Kaynor wrote:
> On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4...@gmail.com> wrote:
>> Can anyone point me at a rationale for PEP 393 being incorporated in
>> Python 3.3 over using UTF-8 as an internal string representation? I've
>> found good articles by Nick Coghlan, Armin Ronacher and others on the
>> matter. What I have not found is discussion of pros and cons of
>> alternatives to the old narrow or wide implementation of Unicode
>> strings.
>
> The PEP itself has the rational for the problems with the narrow/wide
> idea, the quote from https://www.python.org/dev/peps/pep-0393/:
> There are two classes of complaints about the current implementation
> of the unicode type:on systems only supporting UTF-16, users complain
> that non-BMP characters are not properly supported. On systems using
> UCS-4 internally (and also sometimes on systems using UCS-2), there is
> a complaint that Unicode strings take up too much memory - especially
> compared to Python 2.x, where the same code would often use ASCII
> strings (i.e. ASCII-encoded byte strings). With the proposed approach,
> ASCII-only Unicode strings will again use only one byte per character;
> while still allowing efficient indexing of strings containing non-BMP
> characters (as strings containing them will use 4 bytes per
> character).
>
> Basically, narrow builds had very odd behavior with non-BMP
> characters, namely that indexing into the string could easily produce
> mojibake. Wide builds used quite a bit more memory, which generally
> translates to reduced performance.
>
>> ISTM that most operations on strings are via iterators and thus agnostic
>> to variable or fixed width encodings. How important is it to be able to
>> get to part of a string with a simple index? Just because old skool
>> strings could be treated as a sequence of characters, is that a reason
>> to shoehorn the subtleties of Unicode into that model?
>
> I think you are underestimating the indexing usages of strings. Every
> operation on a string using UTF8 that contains larger characters must
> be completed by starting at index 0 - you can never start anywhere
> else safely. rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. With indexing
> becoming linear time, many simple algorithms need to be written with
> that in mind, to avoid n*n time. Such performance regressions can
> often go unnoticed by developers, who are likely to be testing with
> small data, and thus may cause (accidental) DOS attacks when used on
> real data. The exact same problems occur with the old narrow builds
> (UTF16; note that this was NOT implemented in those builds, however,
> which caused the mojibake problems) as well - only a UTF32 or PEP393
> implementation can avoid those problems.
>
You could implement rsplit and rstrip easily enough, but rfind and
rindex return the index, so you'd need to scan the string to return that.

> Note that from a user (including most developers, if not almost all),
> PEP393 strings can be treated as if they were UTF32, but with many of
> the benefits of UTF8. As far as I'm aware, it is only developers
> writing extension modules that need to care - and only then if they
> need maximum performance, and thus cannot convert every string they
> access to UTF32 or UTF8.
>
As someone who has written an extension, I can tell you that I much
prefer dealing with a fixed number of bytes per codepoint than a
variable number of bytes per codepoint, especially as I'm also
supporting earlier versions of Python where that was the case.

Pete Forman

unread,
Jan 20, 2017, 7:30:26 PM1/20/17
to
I'm taking as a given that the old way was often sub-optimal in many
scenarios. My questions were about the alternatives, and why PEP 393 was
chosen over other approaches.

>> ISTM that most operations on strings are via iterators and thus
>> agnostic to variable or fixed width encodings. How important is it to
>> be able to get to part of a string with a simple index? Just because
>> old skool strings could be treated as a sequence of characters, is
>> that a reason to shoehorn the subtleties of Unicode into that model?
>
> I think you are underestimating the indexing usages of strings. Every
> operation on a string using UTF8 that contains larger characters must
> be completed by starting at index 0 - you can never start anywhere
> else safely. rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. With indexing
> becoming linear time, many simple algorithms need to be written with
> that in mind, to avoid n*n time. Such performance regressions can
> often go unnoticed by developers, who are likely to be testing with
> small data, and thus may cause (accidental) DOS attacks when used on
> real data. The exact same problems occur with the old narrow builds
> (UTF16; note that this was NOT implemented in those builds, however,
> which caused the mojibake problems) as well - only a UTF32 or PEP393
> implementation can avoid those problems.

I was asserting that most useful operations on strings start from index
0. The r* operations would not be slowed down that much as UTF-8 has the
useful property that attempting to interpret from a byte that is not at
the start of a sequence (in the sense of a code point rather than
Python) is invalid and so quick to move over while working backwards
from the end.

The only significant use of an index dereference that I could come up
with was the result of a find() or index(). I put out this public
question so that I could be enclued as to other uses. My personal
experience is that in most cases where I might consider find() that I
end up using re and use the return from match groups which has copies of
the (sub)strings that I want.

> Note that from a user (including most developers, if not almost all),
> PEP393 strings can be treated as if they were UTF32, but with many of
> the benefits of UTF8. As far as I'm aware, it is only developers
> writing extension modules that need to care - and only then if they
> need maximum performance, and thus cannot convert every string they
> access to UTF32 or UTF8.

PEP 393 already says that "the specification chooses UTF-8 as the
recommended way of exposing strings to C code".

--
Pete Forman

Pete Forman

unread,
Jan 20, 2017, 7:52:02 PM1/20/17
to
MRAB <pyt...@mrabarnett.plus.com> writes:

> As someone who has written an extension, I can tell you that I much
> prefer dealing with a fixed number of bytes per codepoint than a
> variable number of bytes per codepoint, especially as I'm also
> supporting earlier versions of Python where that was the case.

At the risk of sounding harsh, if supporting variable bytes per
codepoint is a pain you should roll with it for the greater good of
supporting users.

PEP 393 / Python 3.3 required extension writers to revisit their access
to strings. My explicit question was about why PEP 393 was adopted to
replace the deficient old implementations rather than another approach.
The implicit question is whether a UTF-8 internal representation should
replace that of PEP 393.

--
Pete Forman

Chris Angelico

unread,
Jan 20, 2017, 7:58:19 PM1/20/17
to
On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman <petef4...@gmail.com> wrote:
> I was asserting that most useful operations on strings start from index
> 0. The r* operations would not be slowed down that much as UTF-8 has the
> useful property that attempting to interpret from a byte that is not at
> the start of a sequence (in the sense of a code point rather than
> Python) is invalid and so quick to move over while working backwards
> from the end.

Let's take one very common example: decoding JSON. A ton of web
servers out there will call json.loads() on user-supplied data. The
bulk of the work is in the scanner, which steps through the string and
does the actual parsing. That function is implemented in Python, so
it's a good example. (There is a C accelerator, but we can ignore that
and look at the pure Python one.)

So, how could you implement this function? The current implementation
maintains an index - an integer position through the string. It
repeatedly requests the next character as string[idx], and can also
slice the string (to check for keywords like "true") or use a regex
(to check for numbers). Everything's clean, but it's lots of indexing.
Alternatively, it could remove and discard characters as they're
consumed. It would maintain a string that consists of all the unparsed
characters. All indexing would be at or near zero, but after every
tiny piece of parsing, the string would get sliced.

With immutable UTF-8 strings, both of these would be O(n^2). Either
indexing is linear, so parsing the tail of the string means scanning
repeatedly; or slicing is linear, so parsing the head of the string
means slicing all the rest away.

The only way for it to be fast enough would be to have some sort of
retainable string iterator, which means exposing an opaque "position
marker" that serves no purpose other than parsing. Every string parse
operation would have to be reimplemented this way, lest it perform
abysmally on large strings. It'd mean some sort of magic "thing" that
probably has a reference to the original string, so you don't get the
progressive RAM refunds that slicing gives, and you'd still have to
deal with lots of the other consequences. It's probably doable, but it
would be a lot of pain.

ChrisA

Chris Angelico

unread,
Jan 20, 2017, 8:01:16 PM1/20/17
to
On Sat, Jan 21, 2017 at 11:51 AM, Pete Forman <petef4...@gmail.com> wrote:
> MRAB <pyt...@mrabarnett.plus.com> writes:
>
>> As someone who has written an extension, I can tell you that I much
>> prefer dealing with a fixed number of bytes per codepoint than a
>> variable number of bytes per codepoint, especially as I'm also
>> supporting earlier versions of Python where that was the case.
>
> At the risk of sounding harsh, if supporting variable bytes per
> codepoint is a pain you should roll with it for the greater good of
> supporting users.

That hasn't been demonstrated, though. There's plenty of evidence
regarding cache usage that shows that direct indexing is incredibly
beneficial on large strings. What are the benefits of variable-sized
encodings? AFAIK, the only real benefit is that you can use less
memory for strings that contain predominantly ASCII but a small number
of astral characters (plus *maybe* a faster encode-to-UTF-8; you
wouldn't get a faster decode-from-UTF-8, because you still need to
check that the byte sequence is valid). Can you show a use-case that
would be materially improved by UTF-8?

ChrisA

MRAB

unread,
Jan 20, 2017, 9:10:32 PM1/20/17
to
On 2017-01-21 00:51, Pete Forman wrote:
> MRAB <pyt...@mrabarnett.plus.com> writes:
>
>> As someone who has written an extension, I can tell you that I much
>> prefer dealing with a fixed number of bytes per codepoint than a
>> variable number of bytes per codepoint, especially as I'm also
>> supporting earlier versions of Python where that was the case.
>
> At the risk of sounding harsh, if supporting variable bytes per
> codepoint is a pain you should roll with it for the greater good of
> supporting users.
>
Or I could decide not bother and leave it to someone else to continue
the project. After all, it's not like I'm not getting paid for the work,
it's purely voluntary.

> PEP 393 / Python 3.3 required extension writers to revisit their access
> to strings. My explicit question was about why PEP 393 was adopted to
> replace the deficient old implementations rather than another approach.
> The implicit question is whether a UTF-8 internal representation should
> replace that of PEP 393.
>
I already had to handle 1-byte bytestrings and 2/4-byte (narrow/wide)
Unicode strings, so switching to 1/2/4 strings wasn't too bad. Switching
to a completely different, variable-width system would've been a lot
more work.

Paul Rubin

unread,
Jan 21, 2017, 1:01:22 AM1/21/17
to
Chris Angelico <ros...@gmail.com> writes:
> decoding JSON... the scanner, which steps through the string and
> does the actual parsing. ...
> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing.

Python already has that type of iterator:
x = "foo"
for c in x: ....

> It'd mean some sort of magic "thing" that probably has a reference to
> the original string

It's a regular old string iterator unless I'm missing something. Of
course a json parser should use it, though who uses the non-C json
parser anyway these days?

[Chris Kaynor writes:]
> rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left.

UTF-8 can be read from right to left because you can recognize when a
codepoint begins by looking at the top 2 bits of each byte as you scan
backwards. Any combination except for 11 is a leading byte, and 11 is
always a continuation byte. This "prefix property" of UTF8 is a design
feature and not a trick someone noticed after the fact.

Also if you really want O(1) random access, you could put an auxiliary
table into long strings, giving the byte offset of every 256th codepoint
or something like that. Then you'd go to the nearest table entry and
scan from there. This would usually be in-cache scanning so quite fast.
Or use the related representation of "ropes" which are also very easy to
concatenate if they can be nested. Erlang does something like that
with what it calls "binaries".

Chris Angelico

unread,
Jan 21, 2017, 1:23:22 AM1/21/17
to
On Sat, Jan 21, 2017 at 5:01 PM, Paul Rubin <no.e...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>> decoding JSON... the scanner, which steps through the string and
>> does the actual parsing. ...
>> The only way for it to be fast enough would be to have some sort of
>> retainable string iterator, which means exposing an opaque "position
>> marker" that serves no purpose other than parsing.
>
> Python already has that type of iterator:
> x = "foo"
> for c in x: ....
>
>> It'd mean some sort of magic "thing" that probably has a reference to
>> the original string
>
> It's a regular old string iterator unless I'm missing something. Of
> course a json parser should use it, though who uses the non-C json
> parser anyway these days?

You can't do a look-ahead with a vanilla string iterator. That's
necessary for a lot of parsers.

> Also if you really want O(1) random access, you could put an auxiliary
> table into long strings, giving the byte offset of every 256th codepoint
> or something like that. Then you'd go to the nearest table entry and
> scan from there. This would usually be in-cache scanning so quite fast.
> Or use the related representation of "ropes" which are also very easy to
> concatenate if they can be nested. Erlang does something like that
> with what it calls "binaries".

Yes, which gives a two-level indexing (first find the strand, then the
character), and that's going to play pretty badly with CPU caches. I'd
be curious to know how an alternate Python with that implementation
would actually perform.

ChrisA

Jussi Piitulainen

unread,
Jan 21, 2017, 1:38:50 AM1/21/17
to
Chris Angelico writes:
Julia does this. It has immutable UTF-8 strings, and there is a JSON
parser. The "opaque position marker" is just the byte index. An attempt
to use an invalid index throws an error. A substring type points to an
underlying string. An iterator, called graphemes, even returns
substrings that correspond to what people might consider a character.

I offer Julia as evidence.

My impression is that Julia's UTF-8-based system works and is not a
pain. I wrote a toy function once to access the last line of a large
memory-mapped text file, so I have just this little bit of personal
experience of it, so far. Incidentally, can Python memory-map a UTF-8
file as a string?

http://docs.julialang.org/en/stable/manual/strings/
https://github.com/JuliaIO/JSON.jl

Paul Rubin

unread,
Jan 21, 2017, 4:14:37 AM1/21/17
to
Chris Angelico <ros...@gmail.com> writes:
> You can't do a look-ahead with a vanilla string iterator. That's
> necessary for a lot of parsers.

For JSON? For other parsers you usually have a tokenizer that reads
characters with maybe 1 char of lookahead.

> Yes, which gives a two-level indexing (first find the strand, then the
> character), and that's going to play pretty badly with CPU caches.

If you're jumping around at random all over the string, you probably
really want a bytearray rather than a unicode string. If you're
scanning sequentually you won't have to look at the outer table very
often.

Tim Chase

unread,
Jan 21, 2017, 7:54:41 AM1/21/17
to
On 2017-01-21 11:58, Chris Angelico wrote:
> So, how could you implement this function? The current
> implementation maintains an index - an integer position through the
> string. It repeatedly requests the next character as string[idx],
> and can also slice the string (to check for keywords like "true")
> or use a regex (to check for numbers). Everything's clean, but it's
> lots of indexing.

But in these parsing cases, the indexes all originate from stepping
through the string from the beginning and processing it
codepointwise. Even this is a bit of an oddity, especially once you
start taking combining characters into consideration and need to
process them with the preceding character(s). So while you may be
doing indexing, those indexes usually stem from having walked to that
point, not arbitrarily picking some offset.

You allude to it in your:

> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing. Every string
> parse operation would have to be reimplemented this way, lest it
> perform abysmally on large strings. It'd mean some sort of magic
> "thing" that probably has a reference to the original string, so
> you don't get the progressive RAM refunds that slicing gives, and
> you'd still have to deal with lots of the other consequences. It's
> probably doable, but it would be a lot of pain.

but I'm hard-pressed to come up with any use case where direct
indexing into a (non-byte)string makes sense unless you've already
processed/searched up to that point and can use a recorded index
from that processing/search.

Can you provide real-world examples of "I need character 2832 from
this string of unicode text, but I never had to scan to that point
linearly from the beginning/end of the string"?

-tkc



Steve D'Aprano

unread,
Jan 21, 2017, 8:58:23 AM1/21/17
to
On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote:

> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation?

I've read over the PEP, and the email discussion, and there is very little
mention of UTF-8, and as far as I can see no counter-proposal for using
UTF-8. However, there are a few mentions of UTF-8 that suggest that the
participants were aware of it as an alternative, and simply didn't think it
was worth considering. I don't know why.

You can read the PEP and the mailing list discussion here:

The PEP:

https://www.python.org/dev/peps/pep-0393/

Mailing list discussion starts here:

https://mail.python.org/pipermail/python-dev/2011-January/107641.html

Stefan Behnel (author of Cython) states that UTF-8 is much harder to use:

https://mail.python.org/pipermail/python-dev/2011-January/107739.html

I see nobody challenging that claim, so perhaps there was simply enough
broad agreement that UTF-8 would have been more work and so nobody wanted
to propose it. I'm just guessing though.

Perhaps it would have been too big a change to adapt the CPython internals
to variable-width UTF-8 from the existing fixed-width UTF-16 and UTF-32
implementations?

(I know that UTF-16 is actually variable-width, but Python prior to PEP 393
treated it as if it were fixed.)

There was a much earlier discussion about the internal implementation of
Unicode strings:

https://mail.python.org/pipermail/python-3000/2006-September/003795.html

including some discussion of UTF-8:

https://mail.python.org/pipermail/python-3000/2006-September/003816.html

It too proposed using a three-way internal implementation, and made it clear
that O(1) indexing was an requirement.

Here's a comment explicitly pointing out that constant-time indexing is
wanted, and that using UTF-8 with a two-level table destroys any space
advantage UTF-8 might have:

https://mail.python.org/pipermail/python-3000/2006-September/003822.html

Ironically, Martin v. Löwis, the author of PEP 393 originally started off
opposing an three-way internal representation, calling it "terrible":

https://mail.python.org/pipermail/python-3000/2006-September/003891.html

Another factor which I didn't see discussed anywhere is that Python strings
treat surrogates as normal code points. I believe that would be troublesome
for a UTF-8 implementation:

py> '\uDC37'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
position 0: surrogates not allowed


but of course with a UCS-2 or UTF-32 implementation it is trivial: you just
treat the surrogate as another code point like any other.


[...]
> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings.

Slicing is not.

start = text.find(":")
end = text.rfind("!")
assert end > start
chunk = text[start:end]

But even with iteration, we still would expect that indexes be consecutive:

for i, c in enumerate(text):
assert c == text[i]


The complexity of those functions will be greatly increased with UTF-8. Of
course you can make it work, and you can even hide the fact that UTF-8 has
variable-width code points. But you can't have all three of:

- simplicity;
- memory efficiency;
- O(1) operations

with UTF-8.

But of course, I'd be happy for a competing Python implementation to use
UTF-8 and prove me wrong!




--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

Steve D'Aprano

unread,
Jan 21, 2017, 9:44:43 AM1/21/17
to
On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:

> but I'm hard-pressed to come up with any use case where direct
> indexing into a (non-byte)string makes sense unless you've already
> processed/searched up to that point and can use a recorded index
> from that processing/search.


Let's take a simple example: you do a find to get an offset, and then slice
from that offset.

py> text = "αβγдлфxx"
py> offset = text.find("ф")
py> stuff = text[offset:]
py> assert stuff == "фxx"


That works fine whether indexing refers to code points or bytes.

py> "αβγдлфxx".find("ф")
5
py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8'))
10

Either way, you get the expected result. However:

py> stuff = text[offset + 1:]
py> assert stuff == "xx"


That requires indexes to point to the beginning of *code points*, not bytes:
taking byte 11 of "αβγдлфxx".encode('utf-8') drops you into the middle of
the ф representation:

py> "αβγдлфxx".encode('utf-8')[11:]
b'\x84xx'

and it isn't a valid UTF-8 substring. Slicing would generate an exception
unless you happened to slice right at the start of a code point.

It's like seek() and tell() on text files: you cannot seek to arbitrary
positions, but only to the opaque positions returned by tell. That's
unacceptable for strings.

You could avoid that error by increasing the offset by the right amount:

stuff = text[offset + len("ф".encode('utf-8'):]

which is awful. I believe that's what Go and Julia expect you to do.

Another solution would be to have the string slicing method automatically
scan forward to the start of the next valid UTF-8 code point. That would be
the "Do What I Mean" solution.

The problem with the DWIM solution is that not only is it adding complexity,
but it's frankly *weird*. It would mean:


- if the character at position `offset` fits in 2 bytes:
text[offset+1:] == text[offset+2:]

- if it fits in 3 bytes:
text[offset+1:] == text[offset+2:] == text[offset+3:]

- and if it fits in 4 bytes:
text[offset+1:] == text[offset+2:] == text[offset+3:] == text[offset+4:]


Having the string slicing method Do The Right Thing would actually be The
Wrong Thing. It would make it awful to reason about slicing.

You can avoid this by having the interpreter treat the Python-level indexes
as opaque "code point offsets", and converting them to and from "byte
offsets" as needed. That's not even very hard. But it either turns every
indexing into O(N) (since you have to walk the string to count which byte
represents the nth code point), or you have to keep an auxiliary table with
every string, letting you convert from byte indexes to code point indexes
quickly, but that will significantly increase the memory size of every
string, blowing out the advantage of using UTF-8 in the first place.

Pete Forman

unread,
Jan 21, 2017, 10:50:50 AM1/21/17
to
Steve D'Aprano <steve+...@pearwood.info> writes:

> [...]
> Another factor which I didn't see discussed anywhere is that Python
> strings treat surrogates as normal code points. I believe that would
> be troublesome for a UTF-8 implementation:
>
> py> '\uDC37'.encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
> position 0: surrogates not allowed
>
> but of course with a UCS-2 or UTF-32 implementation it is trivial: you
> just treat the surrogate as another code point like any other.

Thanks for a very thorough reply, most useful. I'm going to pick you up
on the above, though.

Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
3629 (2003). There is CESU-8 if you really need a naive encoding of
UTF-16 to UTF-8-alike.

py> low = '\uDC37'

is only meaningful on narrow builds pre Python 3.3 where the user must
do extra to correctly handle characters outside the BMP.

--
Pete Forman
https://payg-petef.rhcloud.com

Jussi Piitulainen

unread,
Jan 21, 2017, 10:56:51 AM1/21/17
to
Steve D'Aprano writes:

[snip]

> You could avoid that error by increasing the offset by the right
> amount:
>
> stuff = text[offset + len("ф".encode('utf-8'):]
>
> which is awful. I believe that's what Go and Julia expect you to do.

Julia provides a method to get the next index.

let text = "ἐπὶ οἴνοπα πόντον", offset = 1
while offset <= endof(text)
print(text[offset], ".")
offset = nextind(text, offset)
end
println()
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

Chris Angelico

unread,
Jan 21, 2017, 11:18:19 AM1/21/17
to
This implies that regular iteration isn't good enough, though.

Here's a function that creates a numbered list:

def print_list(items):
width = len(str(len(items)))
for idx, item in enumerate(items, 1):
print("%*d: %s" % (width, idx, item))

In Python, this will happily accept anything that is iterable and has
a known length. Could be a list or tuple, obviously, but can also just
as easily be a dict view (keys or items), a range object, or.... a
string. It's perfectly acceptable to enumerate the characters of a
string. And enumerate() itself is implemented entirely generically. If
you have to call nextind() to get the next character, you've made it
impossible to do any kind of generic operation on the text. You can't
do a windowed view by slicing while iterating, you can't have a "lag"
or "lead" value, you can't do any of those kinds of simple and obvious
index-based operations.

Oh, and Python 3.3 wasn't the first programming language to use this
flexible string representation. Pike introduced an extremely similar
string representation back in 1998:

https://github.com/pikelang/Pike/commit/db4a4

So yes, UTF-8 has its advantages. But it also has its costs, and for a
text processing language like Pike or Python, they significantly
outweigh the benefits.

ChrisA

Jussi Piitulainen

unread,
Jan 21, 2017, 1:54:13 PM1/21/17
to
Chris Angelico writes:

> On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote:
>> Steve D'Aprano writes:
>>
>> [snip]
>>
>>> You could avoid that error by increasing the offset by the right
>>> amount:
>>>
>>> stuff = text[offset + len("ф".encode('utf-8'):]
>>>
>>> which is awful. I believe that's what Go and Julia expect you to do.
>>
>> Julia provides a method to get the next index.
>>
>> let text = "ἐπὶ οἴνοπα πόντον", offset = 1
>> while offset <= endof(text)
>> print(text[offset], ".")
>> offset = nextind(text, offset)
>> end
>> println()
>> end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.
>
> This implies that regular iteration isn't good enough, though.

It doesn't. Here's the straightforward iteration over the whole string:

let text = "ἐπὶ οἴνοπα πόντον"
for c in text
print(c, ".")
end
println()
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

One can also join any iterable whose elements can be converted to
strings, and characters can:

let text = "ἐπὶ οἴνοπα πόντον"
println(join(text, "."), ".")
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

And strings, trivially, can:

let text = "ἐπὶ οἴνοπα πόντον"
println(join(split(text), "."), ".")
end # prints: ἐπὶ.οἴνοπα.πόντον.

> Here's a function that creates a numbered list:
>
> def print_list(items):
> width = len(str(len(items)))
> for idx, item in enumerate(items, 1):
> print("%*d: %s" % (width, idx, item))
>
> In Python, this will happily accept anything that is iterable and has
> a known length. Could be a list or tuple, obviously, but can also just
> as easily be a dict view (keys or items), a range object, or.... a
> string. It's perfectly acceptable to enumerate the characters of a
> string. And enumerate() itself is implemented entirely generically.

I'll skip the formatting - I don't know off-hand how to do it - but keep
the width calculation, and I cut the character iterator short at 10
items to save some space. There, it's much the same in Julia:

let text = "ἐπὶ οἴνοπα πόντον"

function print_list(items)
width = endof(string(length(items)))
println("width = ", width)
for (idx, item) in enumerate(items)
println(idx, '\t', item)
end
end

print_list(take(text, 10))
print_list([text, text, text])
print_list(split(text))
end

That prints this:

width = 2
1 ἐ
2 π
3 ὶ
4
5 ο
6 ἴ
7 ν
8 ο
9 π
10 α
width = 1
1 ἐπὶ οἴνοπα πόντον
2 ἐπὶ οἴνοπα πόντον
3 ἐπὶ οἴνοπα πόντον
width = 1
1 ἐπὶ
2 οἴνοπα
3 πόντον

> If you have to call nextind() to get the next character, you've made
> it impossible to do any kind of generic operation on the text. You
> can't do a windowed view by slicing while iterating, you can't have a
> "lag" or "lead" value, you can't do any of those kinds of simple and
> obvious index-based operations.

Yet Julia does with ease many things that you seem to think it cannot
possibly do at all. The iteration system works on types that have
methods for certain generic functions. For strings, the default is to
iterate over something like its characters; I think another iterator
over valid indexes is available, or wouldn't be hard to write; it could
be forward or backward, and in Julia many of these things are often
peekable by default (because the iteration protocol itself does not have
state - see below at "more magic").

The usual things work fine:

let text = "ἐπὶ οἴνοπα πόντον"
foreach(print, enumerate(zip(text, split(text))))
end # prints: (1,('ἐ',"ἐπὶ"))(2,('π',"οἴνοπα"))(3,('ὶ',"πόντον"))

How is that bad?

More magic:

let text = "ἐπὶ οἴνοπα πόντον"
let ever = cycle(split(text))
println(first(ever))
println(first(ever))
for n in 2:6
println(join(take(ever, n), " "))
end end end

This prints the following. The cycle iterator, ever, produces an
endless repetition of the three words, but it doesn't have state like
Python iterators do, so it's possible to look at the first word twice
(and then five more times).

ἐπὶ
ἐπὶ
ἐπὶ οἴνοπα
ἐπὶ οἴνοπα πόντον
ἐπὶ οἴνοπα πόντον ἐπὶ
ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα
ἐπὶ οἴνοπα πόντον ἐπὶ οἴνοπα πόντον

> Oh, and Python 3.3 wasn't the first programming language to use this
> flexible string representation. Pike introduced an extremely similar
> string representation back in 1998:
>
> https://github.com/pikelang/Pike/commit/db4a4

Ok. Is GitHub that old?

> So yes, UTF-8 has its advantages. But it also has its costs, and for a
> text processing language like Pike or Python, they significantly
> outweigh the benefits.

I process text in my work but I really don't use character indexes much
at all. Rather split, join, startswith, endswith, that kind of thing,
and whether a string contains some character or substring anywhere.

Marko Rauhamaa

unread,
Jan 21, 2017, 2:52:51 PM1/21/17
to
Pete Forman <petef4...@gmail.com>:

> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32.

Also, they don't exist as Unicode code points. Python shouldn't allow
surrogate characters in strings.

Thus the range of code points that are available for use as
characters is U+0000–U+D7FF and U+E000–U+10FFFF (1,112,064 code
points).

<URL: https://en.wikipedia.org/wiki/Unicode>


The Unicode Character Database is basically a table of characters
indexed using integers called ’code points’. Valid code points are in
the ranges 0 to #xD7FF inclusive or #xE000 to #x10FFFF inclusive,
which is about 1.1 million code points.

<URL: https://www.gnu.org/software/guile/docs/master/guile.html/Char
acters.html>

Guile does the right thing:

scheme@(guile-user)> #\xd7ff
$1 = #\153777
scheme@(guile-user)> #\xe000
$2 = #\160000
scheme@(guile-user)> #\xd812
While reading expression:
ERROR: In procedure scm_lreadr: #<unknown port>:5:8: out-of-range hex c
haracter escape: xd812

> py> low = '\uDC37'

That should raise a SyntaxError exception.


Marko

Pete Forman

unread,
Jan 21, 2017, 3:21:30 PM1/21/17
to
Marko Rauhamaa <ma...@pacujo.net> writes:

>> py> low = '\uDC37'
>
> That should raise a SyntaxError exception.

Quite. My point was that with older Python on a narrow build (Windows
and Mac) you need to understand that you are using UTF-16 rather than
Unicode. On a wide build or Python 3.3+ then all is rosy. (At this point
I'm tempted to put in a winky emoji but that might push the internal
representation into UCS-4.)

--
Pete Forman

eryk sun

unread,
Jan 21, 2017, 3:50:26 PM1/21/17
to
CPython allows surrogate codes for use with the "surrogateescape" and
"surrogatepass" error handlers, which are used for POSIX and Windows
file-system encoding, respectively. Maybe MicroPython goes about the
file-system round-trip problem differently, or maybe it just require
using bytes for file-system and environment-variable names on POSIX
and doesn't care about Windows.

"surrogateescape" allows 'decoding' arbitrary bytes:

>>> b'\x81'.decode('ascii', 'surrogateescape')
'\udc81'
>>> '\udc81'.encode('ascii', 'surrogateescape')
b'\x81'

This error handler is required by CPython on POSIX to handle arbitrary
bytes in file-system paths. For example, when running with LANG=C:

>>> sys.getfilesystemencoding()
'ascii'
>>> os.listdir(b'.')
[b'\x81']
>>> os.listdir('.')
['\udc81']

"surrogatepass" allows encoding surrogates:

>>> '\udc81'.encode('utf-8', 'surrogatepass')
b'\xed\xb2\x81'
>>> b'\xed\xb2\x81'.decode('utf-8', 'surrogatepass')
'\udc81'

This error handler is used by CPython 3.6+ to encode Windows UCS-2
file-system paths as WTF-8 (Wobbly). For example:

>>> os.listdir('.')
['\udc81']
>>> os.listdir(b'.')
[b'\xed\xb2\x81']

Matt Ruffalo

unread,
Jan 21, 2017, 8:29:54 PM1/21/17
to
On 2017-01-21 10:50, Pete Forman wrote:
> Thanks for a very thorough reply, most useful. I'm going to pick you up
> on the above, though.
>
> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
> 3629 (2003). There is CESU-8 if you really need a naive encoding of
> UTF-16 to UTF-8-alike.
>
> py> low = '\uDC37'
>
> is only meaningful on narrow builds pre Python 3.3 where the user must
> do extra to correctly handle characters outside the BMP.

Hi Pete-

Lone surrogate characters have a standardized use in Python, not just in
narrow builds of Python <= 3.2. Unpaired high surrogate characters are
used to store any bytes that couldn't be decoded with a given character
encoding scheme, for use in OS/filesystem interfaces that use arbitrary
byte strings:

"""
Python 3.6.0 (default, Dec 23 2016, 08:25:24)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = 'héllo'
>>> b = s.encode('latin-1')
>>> b
b'h\xe9llo'
>>> from os import fsdecode, fsencode
>>> decoded = fsdecode(b)
>>> decoded
'h\udce9llo'
>>> fsencode(decoded)
b'h\xe9llo'
"""

This provides a mechanism for lossless round-trip decoding and encoding
of arbitrary byte strings which aren't valid under the user's locale.
This is absolutely necessary in POSIX systems in which filenames can
contain any sequence of bytes despite the user's locale, and is even
necessary in Windows, where filenames are stored as opaque
not-quite-UCS2 strings:

"""
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64
bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from pathlib import Path
>>> import os
>>> os.chdir(Path('~/Desktop').expanduser())
>>> filename = '\udcf9'
>>> with open(filename, 'w'): pass

>>> os.listdir('.')
['desktop.ini', '\udcf9']
"""

MMR...

Tim Chase

unread,
Jan 21, 2017, 8:34:35 PM1/21/17
to
On 2017-01-22 01:44, Steve D'Aprano wrote:
> On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:
>
> > but I'm hard-pressed to come up with any use case where direct
> > indexing into a (non-byte)string makes sense unless you've already
> > processed/searched up to that point and can use a recorded index
> > from that processing/search.
>
>
> Let's take a simple example: you do a find to get an offset, and
> then slice from that offset.
>
> py> text = "αβγдлфxx"
> py> offset = text.find("ф")

Right, so here, you've done a (likely linear, but however you get
here) search, which then makes sense to use this opaque "offset"
token for slicing purposes:

> py> stuff = text[offset:]
> py> assert stuff == "фxx"

> That works fine whether indexing refers to code points or bytes.
>
> py> "αβγдлфxx".find("ф")
> 5
> py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8'))
> 10
>
> Either way, you get the expected result. However:
>
> py> stuff = text[offset + 1:]
> py> assert stuff == "xx"
>
> That requires indexes to point to the beginning of *code points*,
> not bytes: taking byte 11 of "αβγдлфxx".encode('utf-8') drops you
> into the middle of the ф representation:
>
> py> "αβγдлфxx".encode('utf-8')[11:]
> b'\x84xx'
>
> and it isn't a valid UTF-8 substring. Slicing would generate an
> exception unless you happened to slice right at the start of a code
> point.

Right. It gets even weirder (edge-case'ier) when dealing with
combining characters:


>>> s = "man\N{COMBINING TILDE}ana"
>>> for i, c in enumerate(s): print("%i: %s" % (i, c))
...
0: m
1: a
2: n
3:˜
4: a
5: n
6: a
>>> ''.join(reversed(s))
'anãnam'

Offsetting s[3:] produces a (sub)string that begins with a combining
character that doesn't have anything preceding it to combine with.

> It's like seek() and tell() on text files: you cannot seek to
> arbitrary positions, but only to the opaque positions returned by
> tell. That's unacceptable for strings.

I'm still unclear on *why* this would be considered unacceptable for
strings. It makes sense when dealing with byte-strings, since they
contain binary data that may need to get sliced at arbitrary
offsets. But for strings, slicing only makes sense (for every
use-case I've been able to come up with) in the context of known
offsets like you describe with tell(). The cost of not using opaque
tell()like offsets is, as you describe, slicing in the middle of
characters.

> You could avoid that error by increasing the offset by the right
> amount:
>
> stuff = text[offset + len("ф".encode('utf-8'):]
>
> which is awful. I believe that's what Go and Julia expect you to do.

It may be awful, but only because it hasn't been pythonified. If the
result from calling .find() on a string returns a "StringOffset"
object, then it would make sense that its __add__/__radd__ methods
would accept an integer and to such translation for you.

> You can avoid this by having the interpreter treat the Python-level
> indexes as opaque "code point offsets", and converting them to and
> from "byte offsets" as needed. That's not even very hard. But it
> either turns every indexing into O(N) (since you have to walk the
> string to count which byte represents the nth code point)

The O(N) cost has to be paid at some point, but I'd put forth that
other operations like .find() already pay that O(N) cost and can
return an opaque "offset token" that can be subsequently used for O(1)
indexing (multiple times if needed).

-tkc












Steve D'Aprano

unread,
Jan 21, 2017, 9:36:59 PM1/21/17
to
On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:

> Pete Forman <petef4...@gmail.com>:
>
>> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
>> and UTF-32.
>
> Also, they don't exist as Unicode code points. Python shouldn't allow
> surrogate characters in strings.

Not quite. This is where it gets a bit messy and confusing. The bottom line
is: surrogates *are* code points, but they aren't *characters*. Strings
which contain surrogates are strictly speaking illegal, although some
programming languages (including Python) allow them.

The Unicode standard defines surrogates as follows:

http://www.unicode.org/glossary/

- Surrogate Character. A misnomer. It would be an encoded character
having a surrogate code point, which is impossible. Do not use
this term.

- Surrogate Code Point. A Unicode code point in the range
U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of
surrogate code units (a high surrogate followed by a low surrogate)
“stand in” for a supplementary code point.

- Surrogate Pair. A representation for a single abstract character
that consists of a sequence of two 16-bit code units, where the
first value of the pair is a high-surrogate code unit, and the
second is a low-surrogate code unit. (See definition D75 in
Section 3.8, Surrogates.)


http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf#G2630


So far so good, this is clear: surrogates are not characters. Surrogate
pairs are only used by UTF-16 (since that's the only UTF which uses 16-bit
code units).

Suppose you read two code units (four bytes) in UTF-16 (big endian):

b'\xd8\x02\xdd\x00'

That could be ambiguous, as it could mean:

- a single code point, U+10900 PHOENICIAN LETTER ALF, encoded as
the surrogate pair, U+D802 U+DD00;

- two surrogate code points, U+D802 followed by U+DD00.

UTF-16 definitely rejects the second alternative and categorically makes
only the first valid. To ensure that is the only valid interpretation, the
second is explicitly disallowed: conforming Unicode strings are not allowed
to include surrogate code points, regardless of whether you are intending
to encode them in UTF-32 (where there is no ambiguity) or UTF-16 (where
there is). Only UTF-16 encoded bytes are allowed to include surrogate code
*units*, and only in pairs.

However, Python (and other languages?) take a less strict approach. In
Python 3.3 and better, the code point U+10900 (Phoenician Alf) is encoded
using a single four-byte code unit: 0x00010900', so there's no ambiguity.
That allows Python to encode surrogates as double-byte code units with no
ambiguity: the interpreter can distinguish a single 32 byte number
0x00010900 from a pair of 16 bit numbers 0x0001 0x0900 and treat the first
as Alf and the second as two surrogates.

By the letter of the Unicode standard, it should not do this, but
nevertheless it does and it appears to do no real harm and have some
benefit.


The glossary also says:


- High-Surrogate Code Point. A Unicode code point in the range
U+D800 to U+DBFF. (See definition D71 in Section 3.8, Surrogates.)

- High-Surrogate Code Unit. A 16-bit code unit in the range D800_16
to DBFF_16, used in UTF-16 as the leading code unit of a surrogate
pair. Also known as a leading surrogate. (See definition D72 in
Section 3.8, Surrogates.)

- Low-Surrogate Code Point. A Unicode code point in the range
U+DC00 to U+DFFF. (See definition D73 in Section 3.8, Surrogates.)

- Low-Surrogate Code Unit. A 16-bit code unit in the range DC00_16
to DFFF_16, used in UTF-16 as the trailing code unit of a surrogate
pair. Also known as a trailing surrogate. (See definition D74 in
Section 3.8, Surrogates.)


So we can certainly talk about surrogates being code points: the code points
U+D800 through U+DFFF inclusive are surrogate code points, but not
characters.

They're not "non-characters" either. Unicode includes exactly 66 code points
formally defined as "non-characters":

- Noncharacter. A code point that is permanently reserved for internal
use. Noncharacters consist of the values U+nFFFE and U+nFFFF (where
n is from 0 to 1016), and the values U+FDD0..U+FDEF. See the FAQ on
Private-Use Characters, Noncharacters and Sentinels.

http://www.unicode.org/faq/private_use.html#noncharacters

So even though noncharacters (with or without the hyphen) are code points
reserved for internal use, and surrogates are code points reserved for the
internal use of the UTF-16 encoder, and surrogates are not characters,
surrogates are not noncharacters.

Naming things is hard.


> Thus the range of code points that are available for use as
> characters is U+0000–U+D7FF and U+E000–U+10FFFF (1,112,064 code
> points).
>
> <URL: https://en.wikipedia.org/wiki/Unicode>

That is correct for a strictly conforming Unicode implementation.


>> py> low = '\uDC37'
>
> That should raise a SyntaxError exception.

If Python was strictly conforming, that is correct, but it turns out there
are some useful things you can do with strings if you allow surrogates.

Steve D'Aprano

unread,
Jan 21, 2017, 9:43:11 PM1/21/17
to
On Sun, 22 Jan 2017 07:21 am, Pete Forman wrote:

> Marko Rauhamaa <ma...@pacujo.net> writes:
>
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
>
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
> Unicode.

But you're *not* using UTF-16, at least not proper UTF-16, in older narrow
builds. If you were, then Unicode strings u'...' containing surrogate pairs
would be treated as supplementary single code points, but they aren't.

unichr() doesn't support supplementary code points in narrow builds:

[steve@ando ~]$ python2.7 -c "print len(unichr(0x10900))"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)


and even if you sneak a supplementary code point in, it is treated wrongly:

[steve@ando ~]$ python2.7 -c "print len(u'\U00010900')"
2


So Python narrow builds are more like a bastard hybrid of UCS-2 and UTF-16.

Steven D'Aprano

unread,
Jan 22, 2017, 1:47:49 AM1/22/17
to
On Sunday 22 January 2017 06:58, Tim Chase wrote:

> Right. It gets even weirder (edge-case'ier) when dealing with
> combining characters:
>
>
>>>> s = "man\N{COMBINING TILDE}ana"
>>>> for i, c in enumerate(s): print("%i: %s" % (i, c))
> ...
> 0: m
> 1: a
> 2: n
> 3:˜
> 4: a
> 5: n
> 6: a
>>>> ''.join(reversed(s))
> 'anãnam'
>
> Offsetting s[3:] produces a (sub)string that begins with a combining
> character that doesn't have anything preceding it to combine with.

That doesn't matter. Unicode is a universal character set, not a universal
*grapheme* set. But even speaking about characters is misleading: Unicode's
"characters" (note the scare quotes) are abstract code points which can
represent at least:

- letters of alphabets
- digits
- punctuation marks
- ideographs
- line drawing symbols
- emoji
- noncharacters

Since it doesn't promise to only provide graphemes (I can write "$\N{COMBINING
TILDE}" which is not a valid grapheme in any human language) it doesn't matter
if you end up with lone combining characters. Or rather, it does matter, but
fixing that is not Unicode's responsibility. That should become a layer built
on top of Unicode.


>> It's like seek() and tell() on text files: you cannot seek to
>> arbitrary positions, but only to the opaque positions returned by
>> tell. That's unacceptable for strings.
>
> I'm still unclear on *why* this would be considered unacceptable for
> strings.

Sometimes you want to slice at a particular index which is *not* an opaque
position returned by find().

text[offset + 1:]

Of for that matter:

middle_character = text[len(text)//2]


Forbidding those sorts of operations are simply too big a break with previous
versions.


> It makes sense when dealing with byte-strings, since they
> contain binary data that may need to get sliced at arbitrary
> offsets. But for strings, slicing only makes sense (for every
> use-case I've been able to come up with) in the context of known
> offsets like you describe with tell().

I'm sorry, I find it hard to believe that you've never needed to add or
subtract 1 from a given offset returned by find() or equivalent.


> The cost of not using opaque
> tell()like offsets is, as you describe, slicing in the middle of
> characters.

>> You could avoid that error by increasing the offset by the right
>> amount:
>>
>> stuff = text[offset + len("ф".encode('utf-8'):]
>>
>> which is awful. I believe that's what Go and Julia expect you to do.
>
> It may be awful, but only because it hasn't been pythonified.

No, it's awful no matter what. It makes it painful to reason about which code
points will be picked up by a slice. What's the length of...?

text[offset:offset+5]

In current Python, that's got to be five code points (excluding the edge cases
of slicing past the end of the string). But with opaque indexes, that could be
anything from 1 to 5 code points.


> If the
> result from calling .find() on a string returns a "StringOffset"
> object, then it would make sense that its __add__/__radd__ methods
> would accept an integer and to such translation for you.

At cost of predictability.


>> You can avoid this by having the interpreter treat the Python-level
>> indexes as opaque "code point offsets", and converting them to and
>> from "byte offsets" as needed. That's not even very hard. But it
>> either turns every indexing into O(N) (since you have to walk the
>> string to count which byte represents the nth code point)
>
> The O(N) cost has to be paid at some point, but I'd put forth that
> other operations like .find() already pay that O(N) cost and can
> return an opaque "offset token" that can be subsequently used for O(1)
> indexing (multiple times if needed).

Sure -- but only at the cost of blowing out the complexity and memory
requirements of the string, which completely negates the point in using UTF-8
in the first place.



--
Steven
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." - Jon Ronson

Marko Rauhamaa

unread,
Jan 22, 2017, 3:13:32 AM1/22/17
to
eryk sun <ery...@gmail.com>:

> On Sat, Jan 21, 2017 at 8:21 PM, Pete Forman <petef4...@gmail.com> wrote:
>> Marko Rauhamaa <ma...@pacujo.net> writes:
>>
>>>> py> low = '\uDC37'
>>>
>>> That should raise a SyntaxError exception.
>>
>> Quite. [...]
>
> CPython allows surrogate codes for use with the "surrogateescape" and
> "surrogatepass" error handlers, which are used for POSIX and Windows
> file-system encoding, respectively.

Yes, but at the cost of violating Unicode, leading to unprintable
strings etc. In my opinion, Python should have "stayed pure" instead of
playing cheap tricks with surrogates.

(Of course, Unicode itself is a mess, but that's another story.)


Marko

Marko Rauhamaa

unread,
Jan 22, 2017, 3:34:17 AM1/22/17
to
Steve D'Aprano <steve+...@pearwood.info>:

> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>> Also, [surrogates] don't exist as Unicode code points. Python
>> shouldn't allow surrogate characters in strings.
>
> Not quite. This is where it gets a bit messy and confusing. The bottom
> line is: surrogates *are* code points, but they aren't *characters*.

All animals are equal, but some animals are more equal than others.

> Strings which contain surrogates are strictly speaking illegal,
> although some programming languages (including Python) allow them.

Python shouldn't allow them.

> The Unicode standard defines surrogates as follows:
> [...]
>
> - Surrogate Code Point. A Unicode code point in the range
> U+D800..U+DFFF. Reserved for use by UTF-16,

The writer of the standard is playing word games, maybe to offer a fig
leaf to Windows, Java et al.

> By the letter of the Unicode standard, [Python] should not do this,
> but nevertheless it does and it appears to do no real harm and have
> some benefit.

I'm afraid Python's choice may lead to exploitable security holes in
Python programs.

>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
>
> If Python was strictly conforming, that is correct, but it turns out
> there are some useful things you can do with strings if you allow
> surrogates.

Conceptual confusion is a high price to pay for such tricks.


Marko

wxjm...@gmail.com

unread,
Jan 22, 2017, 4:17:47 AM1/22/17
to
Le dimanche 22 janvier 2017 09:34:17 UTC+1, Marko Rauhamaa a écrit :
> >
> > If Python was strictly conforming, that is correct, but it turns out
> > there are some useful things you can do with strings if you allow
> > surrogates.
>
> Conceptual confusion is a high price to pay for such tricks.
>
>

Exactly, and it leads to buggy code.
When these talented devs are "repairing" something on one side,
it fails on the other side.
What is wrong by design will always stays wrong by design.
Release after release (> py32) I can produce systematically such
failures.
A very interesing case study.

Steve D'Aprano

unread,
Jan 22, 2017, 9:01:44 AM1/22/17
to
On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:

> Steve D'Aprano <steve+...@pearwood.info>:
>
>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>> Also, [surrogates] don't exist as Unicode code points. Python
>>> shouldn't allow surrogate characters in strings.
>>
>> Not quite. This is where it gets a bit messy and confusing. The bottom
>> line is: surrogates *are* code points, but they aren't *characters*.
>
> All animals are equal, but some animals are more equal than others.

Huh?


>> Strings which contain surrogates are strictly speaking illegal,
>> although some programming languages (including Python) allow them.
>
> Python shouldn't allow them.

That's one opinion.


>> The Unicode standard defines surrogates as follows:
>> [...]
>>
>> - Surrogate Code Point. A Unicode code point in the range
>> U+D800..U+DFFF. Reserved for use by UTF-16,
>
> The writer of the standard is playing word games, maybe to offer a fig
> leaf to Windows, Java et al.

Seriously?


>> By the letter of the Unicode standard, [Python] should not do this,
>> but nevertheless it does and it appears to do no real harm and have
>> some benefit.
>
> I'm afraid Python's choice may lead to exploitable security holes in
> Python programs.

Feel free to back up that with an actual demonstration of an exploit, rather
than just FUD.


>>>> py> low = '\uDC37'
>>>
>>> That should raise a SyntaxError exception.
>>
>> If Python was strictly conforming, that is correct, but it turns out
>> there are some useful things you can do with strings if you allow
>> surrogates.
>
> Conceptual confusion is a high price to pay for such tricks.

There's a lot to comprehend about Unicode. I don't see that Python's
non-strict implementation is harder to understand than the strict version.

Marko Rauhamaa

unread,
Jan 22, 2017, 10:19:52 AM1/22/17
to
Steve D'Aprano <steve+...@pearwood.info>:

> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
>
>> Steve D'Aprano <steve+...@pearwood.info>:
>>
>>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>>> Also, [surrogates] don't exist as Unicode code points. Python
>>>> shouldn't allow surrogate characters in strings.
>>>
>>> Not quite. This is where it gets a bit messy and confusing. The
>>> bottom line is: surrogates *are* code points, but they aren't
>>> *characters*.
>>
>> All animals are equal, but some animals are more equal than others.
>
> Huh?

There is no difference between 0xD800 and 0xD8000000. They are both
numbers that don't--and won't--represent anything in Unicode. It's
pointless to call one a "code point" and not the other one. A code point
that isn't code for anything can barely be called a code point.

I'm guessing 0xD800 is called a code point because it was always called
that. It was dropped out when UTF-16 was invented but they didn't want
to "demote" the number retroactively, especially since Windows and Java
already were allowing them in strings.

>>> By the letter of the Unicode standard, [Python] should not do this,
>>> but nevertheless it does and it appears to do no real harm and have
>>> some benefit.
>>
>> I'm afraid Python's choice may lead to exploitable security holes in
>> Python programs.
>
> Feel free to back up that with an actual demonstration of an exploit,
> rather than just FUD.

It might come as a surprise to programmers that pathnames cannot be
UTF-encoded or displayed. Also, those situations might not show up
during testing but only with appropriately crafted input.


Marko

Steve D'Aprano

unread,
Jan 22, 2017, 9:14:32 PM1/22/17
to
On Mon, 23 Jan 2017 02:19 am, Marko Rauhamaa wrote:

> Steve D'Aprano <steve+...@pearwood.info>:
>
>> On Sun, 22 Jan 2017 07:34 pm, Marko Rauhamaa wrote:
>>
>>> Steve D'Aprano <steve+...@pearwood.info>:
>>>
>>>> On Sun, 22 Jan 2017 06:52 am, Marko Rauhamaa wrote:
>>>>> Also, [surrogates] don't exist as Unicode code points. Python
>>>>> shouldn't allow surrogate characters in strings.
>>>>
>>>> Not quite. This is where it gets a bit messy and confusing. The
>>>> bottom line is: surrogates *are* code points, but they aren't
>>>> *characters*.
>>>
>>> All animals are equal, but some animals are more equal than others.
>>
>> Huh?
>
> There is no difference between 0xD800 and 0xD8000000.

Arithmetic disagrees:

py> 0xD800 == 0xD8000000
False


> They are both
> numbers that don't--and won't--represent anything in Unicode.

Your use of hex notation 0x... indicates that you're talking about code
units rather than U+... code points. The first one 0xD800 could be:

- a Little Endian double-byte code unit for 'Ø' in either UCS-2 or UTF-16;

- a Big Endian double-byte code unit that has no special meaning in UCS-2;

- one half of a surrogate pair (two double-byte code units) in Big Endian
UTF-16, encoding some unknown supplementary code point.

The second one 0xD8000000 could be:

- a C long (four-byte int) 3623878656, which is out of range for Big Endian
UCS-4 or UTF-32;

- the Little Endian four-byte code unit for 'Ø' in either UCS-4 or UTF-32.


> It's pointless to call one a "code point" and not the other one.

Neither of them are code points. You're confusing the concrete
representation with the abstract character.

Perhaps you meant to compare the code point U+D800 to, well, there's no
comparison to be made, because "U+D8000000" is not valid and is completely
out of range. The largest code point is U+10FFFF.


> A code point
> that isn't code for anything can barely be called a code point.

It does have a purpose. Or even more than one.

- It ensures that there is a one-to-one mapping between code points and
code units in any specific encoding and byte-order.

- By reserving those code points, it ensures that they cannot be
accidentally used by the standard for something else.

- It makes it easier to talk about the entities: "U+D800 is a surrogate
code point reserved for UTF-16 surrogates", as opposed to "U+D800 isn't
anything, but if it was something, it would be a code point reserved
for UTF-16 surrogates".

- Or worse, forcing us to talk in terms of code units (implementation)
instead of abstract characters, which is painfully verbose:

"0xD800 in Big Endian UTF-16, or 0x00D8 in Little Endian UTF-16, or
0x0000D800 in Big Endian UTF-32, or 0x00D80000 in Little Endian
UTF-16, doesn't map to any code point but is reserved for UTF-16
surrogate pairs."


And, an entirely unforeseen purpose:

- It allows languages like Python to (ab)use surrogate code points for
round-tripping file names which aren't valid Unicode.


[...]
>>> I'm afraid Python's choice may lead to exploitable security holes in
>>> Python programs.
>>
>> Feel free to back up that with an actual demonstration of an exploit,
>> rather than just FUD.
>
> It might come as a surprise to programmers that pathnames cannot be
> UTF-encoded or displayed.

Many things come as surprises to programmers, and many pathnames cannot be
UTF-encoded.

To be precise, Mac OS requires pathnames to be both valid and normalised
UTF-8, and it would be nice if that practice spread. But Windows only
requires pathnames to consist of UCS-2 code points, and Linux pathnames are
arbitrary bytes that may include characters which are illegal on Windows.
So you don't need to involve surrogates to have undecodable pathnames.


> Also, those situations might not show up
> during testing but only with appropriately crafted input.

I'm not seeing a security exploit here.
0 new messages