PEP 393 vs UTF-8 Everywhere

Pete Forman

unread,

Jan 20, 2017, 5:35:24 PM1/20/17

to

Can anyone point me at a rationale for PEP 393 being incorporated in
Python 3.3 over using UTF-8 as an internal string representation? I've
found good articles by Nick Coghlan, Armin Ronacher and others on the
matter. What I have not found is discussion of pros and cons of
alternatives to the old narrow or wide implementation of Unicode
strings.

ISTM that most operations on strings are via iterators and thus agnostic
to variable or fixed width encodings. How important is it to be able to
get to part of a string with a simple index? Just because old skool
strings could be treated as a sequence of characters, is that a reason
to shoehorn the subtleties of Unicode into that model?

--
Pete Forman

Chris Kaynor

unread,

Jan 20, 2017, 6:07:05 PM1/20/17

to

On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4...@gmail.com> wrote:
> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation? I've
> found good articles by Nick Coghlan, Armin Ronacher and others on the
> matter. What I have not found is discussion of pros and cons of
> alternatives to the old narrow or wide implementation of Unicode
> strings.

The PEP itself has the rational for the problems with the narrow/wide
idea, the quote from https://www.python.org/dev/peps/pep-0393/:
There are two classes of complaints about the current implementation
of the unicode type:on systems only supporting UTF-16, users complain
that non-BMP characters are not properly supported. On systems using
UCS-4 internally (and also sometimes on systems using UCS-2), there is
a complaint that Unicode strings take up too much memory - especially
compared to Python 2.x, where the same code would often use ASCII
strings (i.e. ASCII-encoded byte strings). With the proposed approach,
ASCII-only Unicode strings will again use only one byte per character;
while still allowing efficient indexing of strings containing non-BMP
characters (as strings containing them will use 4 bytes per
character).

Basically, narrow builds had very odd behavior with non-BMP
characters, namely that indexing into the string could easily produce
mojibake. Wide builds used quite a bit more memory, which generally
translates to reduced performance.

> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings. How important is it to be able to
> get to part of a string with a simple index? Just because old skool
> strings could be treated as a sequence of characters, is that a reason
> to shoehorn the subtleties of Unicode into that model?

I think you are underestimating the indexing usages of strings. Every
operation on a string using UTF8 that contains larger characters must
be completed by starting at index 0 - you can never start anywhere
else safely. rfind/rsplit/rindex/rstrip and the other related reverse
functions would require walking the string from start to end, rather
than short-circuiting by reading from right to left. With indexing
becoming linear time, many simple algorithms need to be written with
that in mind, to avoid n*n time. Such performance regressions can
often go unnoticed by developers, who are likely to be testing with
small data, and thus may cause (accidental) DOS attacks when used on
real data. The exact same problems occur with the old narrow builds
(UTF16; note that this was NOT implemented in those builds, however,
which caused the mojibake problems) as well - only a UTF32 or PEP393
implementation can avoid those problems.

Note that from a user (including most developers, if not almost all),
PEP393 strings can be treated as if they were UTF32, but with many of
the benefits of UTF8. As far as I'm aware, it is only developers
writing extension modules that need to care - and only then if they
need maximum performance, and thus cannot convert every string they
access to UTF32 or UTF8.

--
Chris Kaynor

Thomas Nyberg

unread,

Jan 20, 2017, 6:14:50 PM1/20/17

to

On 01/20/2017 03:06 PM, Chris Kaynor wrote:
>
> [...snip...]
>
> --
> Chris Kaynor
>

I was able to delete my response which was a wholly contained subset of
this one. :)

But I have one extra question. Is string indexing guaranteed to be
constant-time for python? I thought so, but I couldn't find it
documented anywhere. (Not that I think it practically matters, since it
couldn't really change if it weren't for all the reasons you mentioned.)
I found this which at details (if not explicitly "guarantees") the
complexity properties of other datatypes:

https://wiki.python.org/moin/TimeComplexity

Cheers,
Thomas

Chris Angelico

unread,

Jan 20, 2017, 6:36:23 PM1/20/17

to

On Sat, Jan 21, 2017 at 10:15 AM, Thomas Nyberg <tomu...@gmx.com> wrote:
> But I have one extra question. Is string indexing guaranteed to be
> constant-time for python? I thought so, but I couldn't find it documented
> anywhere. (Not that I think it practically matters, since it couldn't really
> change if it weren't for all the reasons you mentioned.) I found this which
> at details (if not explicitly "guarantees") the complexity properties of
> other datatypes:
>

No, it isn't; this question came up in the context of MicroPython,
which chose to go UTF-8 internally instead of PEP 393. But the
considerations for uPy are different - it's not designed to handle
gobs of data, so constant-time vs linear isn't going to have as much
impact. But in normal work, it's important enough to have predictable
string performance. You can't afford to deploy a web application, test
it, and then have someone send a large amount of data at it, causing
massive O(n^2) blowouts.

ChrisA

Chris Kaynor

unread,

Jan 20, 2017, 6:38:23 PM1/20/17

to

.

On Fri, Jan 20, 2017 at 3:15 PM, Thomas Nyberg <tomu...@gmx.com> wrote:
> On 01/20/2017 03:06 PM, Chris Kaynor wrote:
>>
>>
>> [...snip...]
>>
>> --
>> Chris Kaynor
>>
>
> I was able to delete my response which was a wholly contained subset of this
> one. :)
>
>

> But I have one extra question. Is string indexing guaranteed to be
> constant-time for python? I thought so, but I couldn't find it documented
> anywhere. (Not that I think it practically matters, since it couldn't really
> change if it weren't for all the reasons you mentioned.) I found this which
> at details (if not explicitly "guarantees") the complexity properties of
> other datatypes:
>

> https://wiki.python.org/moin/TimeComplexity

As far as I'm aware, the language does not guarantee it. In fact, I
believe it was decided that MicroPython could use UTF8 strings with
linear indexing while still calling itself Python. This was very
useful for MicroPython due to the platforms it supports (embedded),
and needing to keep the memory footprint very small.

I believe Guido (on Python-ideas) has stated that constant-time string
indexing is a guarantee of CPython, however.

The only reference I found in my (very quick) search is the Python-Dev
thread at https://groups.google.com/forum/#!msg/dev-python/3lfXwljNLj8/XxO2s0TGYrYJ

MRAB

unread,

Jan 20, 2017, 7:19:14 PM1/20/17

to

On 2017-01-20 23:06, Chris Kaynor wrote:
> On Fri, Jan 20, 2017 at 2:35 PM, Pete Forman <petef4...@gmail.com> wrote:

>> Can anyone point me at a rationale for PEP 393 being incorporated in
>> Python 3.3 over using UTF-8 as an internal string representation? I've
>> found good articles by Nick Coghlan, Armin Ronacher and others on the
>> matter. What I have not found is discussion of pros and cons of
>> alternatives to the old narrow or wide implementation of Unicode
>> strings.
>

> The PEP itself has the rational for the problems with the narrow/wide
> idea, the quote from https://www.python.org/dev/peps/pep-0393/:
> There are two classes of complaints about the current implementation
> of the unicode type:on systems only supporting UTF-16, users complain
> that non-BMP characters are not properly supported. On systems using
> UCS-4 internally (and also sometimes on systems using UCS-2), there is
> a complaint that Unicode strings take up too much memory - especially
> compared to Python 2.x, where the same code would often use ASCII
> strings (i.e. ASCII-encoded byte strings). With the proposed approach,
> ASCII-only Unicode strings will again use only one byte per character;
> while still allowing efficient indexing of strings containing non-BMP
> characters (as strings containing them will use 4 bytes per
> character).
>
> Basically, narrow builds had very odd behavior with non-BMP
> characters, namely that indexing into the string could easily produce
> mojibake. Wide builds used quite a bit more memory, which generally
> translates to reduced performance.
>

>> ISTM that most operations on strings are via iterators and thus agnostic
>> to variable or fixed width encodings. How important is it to be able to
>> get to part of a string with a simple index? Just because old skool
>> strings could be treated as a sequence of characters, is that a reason
>> to shoehorn the subtleties of Unicode into that model?
>

> I think you are underestimating the indexing usages of strings. Every
> operation on a string using UTF8 that contains larger characters must
> be completed by starting at index 0 - you can never start anywhere
> else safely. rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. With indexing
> becoming linear time, many simple algorithms need to be written with
> that in mind, to avoid n*n time. Such performance regressions can
> often go unnoticed by developers, who are likely to be testing with
> small data, and thus may cause (accidental) DOS attacks when used on
> real data. The exact same problems occur with the old narrow builds
> (UTF16; note that this was NOT implemented in those builds, however,
> which caused the mojibake problems) as well - only a UTF32 or PEP393
> implementation can avoid those problems.
>

You could implement rsplit and rstrip easily enough, but rfind and
rindex return the index, so you'd need to scan the string to return that.

> Note that from a user (including most developers, if not almost all),
> PEP393 strings can be treated as if they were UTF32, but with many of
> the benefits of UTF8. As far as I'm aware, it is only developers
> writing extension modules that need to care - and only then if they
> need maximum performance, and thus cannot convert every string they
> access to UTF32 or UTF8.
>

As someone who has written an extension, I can tell you that I much
prefer dealing with a fixed number of bytes per codepoint than a
variable number of bytes per codepoint, especially as I'm also
supporting earlier versions of Python where that was the case.

Pete Forman

unread,

Jan 20, 2017, 7:30:26 PM1/20/17

to

I'm taking as a given that the old way was often sub-optimal in many
scenarios. My questions were about the alternatives, and why PEP 393 was
chosen over other approaches.

>> ISTM that most operations on strings are via iterators and thus
>> agnostic to variable or fixed width encodings. How important is it to
>> be able to get to part of a string with a simple index? Just because
>> old skool strings could be treated as a sequence of characters, is
>> that a reason to shoehorn the subtleties of Unicode into that model?
>
> I think you are underestimating the indexing usages of strings. Every
> operation on a string using UTF8 that contains larger characters must
> be completed by starting at index 0 - you can never start anywhere
> else safely. rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left. With indexing
> becoming linear time, many simple algorithms need to be written with
> that in mind, to avoid n*n time. Such performance regressions can
> often go unnoticed by developers, who are likely to be testing with
> small data, and thus may cause (accidental) DOS attacks when used on
> real data. The exact same problems occur with the old narrow builds
> (UTF16; note that this was NOT implemented in those builds, however,
> which caused the mojibake problems) as well - only a UTF32 or PEP393
> implementation can avoid those problems.

I was asserting that most useful operations on strings start from index
0. The r* operations would not be slowed down that much as UTF-8 has the
useful property that attempting to interpret from a byte that is not at
the start of a sequence (in the sense of a code point rather than
Python) is invalid and so quick to move over while working backwards
from the end.

The only significant use of an index dereference that I could come up
with was the result of a find() or index(). I put out this public
question so that I could be enclued as to other uses. My personal
experience is that in most cases where I might consider find() that I
end up using re and use the return from match groups which has copies of
the (sub)strings that I want.

> Note that from a user (including most developers, if not almost all),
> PEP393 strings can be treated as if they were UTF32, but with many of
> the benefits of UTF8. As far as I'm aware, it is only developers
> writing extension modules that need to care - and only then if they
> need maximum performance, and thus cannot convert every string they
> access to UTF32 or UTF8.

PEP 393 already says that "the specification chooses UTF-8 as the
recommended way of exposing strings to C code".

--
Pete Forman

Pete Forman

unread,

Jan 20, 2017, 7:52:02 PM1/20/17

to

MRAB <pyt...@mrabarnett.plus.com> writes:

> As someone who has written an extension, I can tell you that I much
> prefer dealing with a fixed number of bytes per codepoint than a
> variable number of bytes per codepoint, especially as I'm also
> supporting earlier versions of Python where that was the case.

At the risk of sounding harsh, if supporting variable bytes per
codepoint is a pain you should roll with it for the greater good of
supporting users.

PEP 393 / Python 3.3 required extension writers to revisit their access
to strings. My explicit question was about why PEP 393 was adopted to
replace the deficient old implementations rather than another approach.
The implicit question is whether a UTF-8 internal representation should
replace that of PEP 393.

--
Pete Forman

Chris Angelico

unread,

Jan 20, 2017, 7:58:19 PM1/20/17

to

On Sat, Jan 21, 2017 at 11:30 AM, Pete Forman <petef4...@gmail.com> wrote:
> I was asserting that most useful operations on strings start from index
> 0. The r* operations would not be slowed down that much as UTF-8 has the
> useful property that attempting to interpret from a byte that is not at
> the start of a sequence (in the sense of a code point rather than
> Python) is invalid and so quick to move over while working backwards
> from the end.

Let's take one very common example: decoding JSON. A ton of web
servers out there will call json.loads() on user-supplied data. The
bulk of the work is in the scanner, which steps through the string and
does the actual parsing. That function is implemented in Python, so
it's a good example. (There is a C accelerator, but we can ignore that
and look at the pure Python one.)

So, how could you implement this function? The current implementation
maintains an index - an integer position through the string. It
repeatedly requests the next character as string[idx], and can also
slice the string (to check for keywords like "true") or use a regex
(to check for numbers). Everything's clean, but it's lots of indexing.
Alternatively, it could remove and discard characters as they're
consumed. It would maintain a string that consists of all the unparsed
characters. All indexing would be at or near zero, but after every
tiny piece of parsing, the string would get sliced.

With immutable UTF-8 strings, both of these would be O(n^2). Either
indexing is linear, so parsing the tail of the string means scanning
repeatedly; or slicing is linear, so parsing the head of the string
means slicing all the rest away.

The only way for it to be fast enough would be to have some sort of
retainable string iterator, which means exposing an opaque "position
marker" that serves no purpose other than parsing. Every string parse
operation would have to be reimplemented this way, lest it perform
abysmally on large strings. It'd mean some sort of magic "thing" that
probably has a reference to the original string, so you don't get the
progressive RAM refunds that slicing gives, and you'd still have to
deal with lots of the other consequences. It's probably doable, but it
would be a lot of pain.

ChrisA

Chris Angelico

unread,

Jan 20, 2017, 8:01:16 PM1/20/17

to

On Sat, Jan 21, 2017 at 11:51 AM, Pete Forman <petef4...@gmail.com> wrote:
> MRAB <pyt...@mrabarnett.plus.com> writes:
>
>> As someone who has written an extension, I can tell you that I much
>> prefer dealing with a fixed number of bytes per codepoint than a
>> variable number of bytes per codepoint, especially as I'm also
>> supporting earlier versions of Python where that was the case.
>
> At the risk of sounding harsh, if supporting variable bytes per
> codepoint is a pain you should roll with it for the greater good of
> supporting users.

That hasn't been demonstrated, though. There's plenty of evidence
regarding cache usage that shows that direct indexing is incredibly
beneficial on large strings. What are the benefits of variable-sized
encodings? AFAIK, the only real benefit is that you can use less
memory for strings that contain predominantly ASCII but a small number
of astral characters (plus *maybe* a faster encode-to-UTF-8; you
wouldn't get a faster decode-from-UTF-8, because you still need to
check that the byte sequence is valid). Can you show a use-case that
would be materially improved by UTF-8?

ChrisA

MRAB

unread,

Jan 20, 2017, 9:10:32 PM1/20/17

to

On 2017-01-21 00:51, Pete Forman wrote:
> MRAB <pyt...@mrabarnett.plus.com> writes:
>
>> As someone who has written an extension, I can tell you that I much
>> prefer dealing with a fixed number of bytes per codepoint than a
>> variable number of bytes per codepoint, especially as I'm also
>> supporting earlier versions of Python where that was the case.
>
> At the risk of sounding harsh, if supporting variable bytes per
> codepoint is a pain you should roll with it for the greater good of
> supporting users.
>

Or I could decide not bother and leave it to someone else to continue
the project. After all, it's not like I'm not getting paid for the work,
it's purely voluntary.

> PEP 393 / Python 3.3 required extension writers to revisit their access
> to strings. My explicit question was about why PEP 393 was adopted to
> replace the deficient old implementations rather than another approach.
> The implicit question is whether a UTF-8 internal representation should
> replace that of PEP 393.
>

I already had to handle 1-byte bytestrings and 2/4-byte (narrow/wide)
Unicode strings, so switching to 1/2/4 strings wasn't too bad. Switching
to a completely different, variable-width system would've been a lot
more work.

Paul Rubin

unread,

Jan 21, 2017, 1:01:22 AM1/21/17

to

Chris Angelico <ros...@gmail.com> writes:
> decoding JSON... the scanner, which steps through the string and
> does the actual parsing. ...

> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing.

Python already has that type of iterator:
x = "foo"
for c in x: ....

> It'd mean some sort of magic "thing" that probably has a reference to
> the original string

It's a regular old string iterator unless I'm missing something. Of
course a json parser should use it, though who uses the non-C json
parser anyway these days?

[Chris Kaynor writes:]

> rfind/rsplit/rindex/rstrip and the other related reverse
> functions would require walking the string from start to end, rather
> than short-circuiting by reading from right to left.

UTF-8 can be read from right to left because you can recognize when a
codepoint begins by looking at the top 2 bits of each byte as you scan
backwards. Any combination except for 11 is a leading byte, and 11 is
always a continuation byte. This "prefix property" of UTF8 is a design
feature and not a trick someone noticed after the fact.

Also if you really want O(1) random access, you could put an auxiliary
table into long strings, giving the byte offset of every 256th codepoint
or something like that. Then you'd go to the nearest table entry and
scan from there. This would usually be in-cache scanning so quite fast.
Or use the related representation of "ropes" which are also very easy to
concatenate if they can be nested. Erlang does something like that
with what it calls "binaries".

Chris Angelico

unread,

Jan 21, 2017, 1:23:22 AM1/21/17

to

On Sat, Jan 21, 2017 at 5:01 PM, Paul Rubin <no.e...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>> decoding JSON... the scanner, which steps through the string and
>> does the actual parsing. ...
>> The only way for it to be fast enough would be to have some sort of
>> retainable string iterator, which means exposing an opaque "position
>> marker" that serves no purpose other than parsing.
>
> Python already has that type of iterator:
> x = "foo"
> for c in x: ....
>
>> It'd mean some sort of magic "thing" that probably has a reference to
>> the original string
>
> It's a regular old string iterator unless I'm missing something. Of
> course a json parser should use it, though who uses the non-C json
> parser anyway these days?

You can't do a look-ahead with a vanilla string iterator. That's
necessary for a lot of parsers.

> Also if you really want O(1) random access, you could put an auxiliary
> table into long strings, giving the byte offset of every 256th codepoint
> or something like that. Then you'd go to the nearest table entry and
> scan from there. This would usually be in-cache scanning so quite fast.
> Or use the related representation of "ropes" which are also very easy to
> concatenate if they can be nested. Erlang does something like that
> with what it calls "binaries".

Yes, which gives a two-level indexing (first find the strand, then the
character), and that's going to play pretty badly with CPU caches. I'd
be curious to know how an alternate Python with that implementation
would actually perform.

ChrisA

Jussi Piitulainen

unread,

Jan 21, 2017, 1:38:50 AM1/21/17

to

Chris Angelico writes:

Julia does this. It has immutable UTF-8 strings, and there is a JSON
parser. The "opaque position marker" is just the byte index. An attempt
to use an invalid index throws an error. A substring type points to an
underlying string. An iterator, called graphemes, even returns
substrings that correspond to what people might consider a character.

I offer Julia as evidence.

My impression is that Julia's UTF-8-based system works and is not a
pain. I wrote a toy function once to access the last line of a large
memory-mapped text file, so I have just this little bit of personal
experience of it, so far. Incidentally, can Python memory-map a UTF-8
file as a string?

http://docs.julialang.org/en/stable/manual/strings/
https://github.com/JuliaIO/JSON.jl

Paul Rubin

unread,

Jan 21, 2017, 4:14:37 AM1/21/17

to

Chris Angelico <ros...@gmail.com> writes:
> You can't do a look-ahead with a vanilla string iterator. That's
> necessary for a lot of parsers.

For JSON? For other parsers you usually have a tokenizer that reads
characters with maybe 1 char of lookahead.

> Yes, which gives a two-level indexing (first find the strand, then the
> character), and that's going to play pretty badly with CPU caches.

If you're jumping around at random all over the string, you probably
really want a bytearray rather than a unicode string. If you're
scanning sequentually you won't have to look at the outer table very
often.

Tim Chase

unread,

Jan 21, 2017, 7:54:41 AM1/21/17

to

On 2017-01-21 11:58, Chris Angelico wrote:
> So, how could you implement this function? The current
> implementation maintains an index - an integer position through the
> string. It repeatedly requests the next character as string[idx],
> and can also slice the string (to check for keywords like "true")
> or use a regex (to check for numbers). Everything's clean, but it's
> lots of indexing.

But in these parsing cases, the indexes all originate from stepping
through the string from the beginning and processing it
codepointwise. Even this is a bit of an oddity, especially once you
start taking combining characters into consideration and need to
process them with the preceding character(s). So while you may be
doing indexing, those indexes usually stem from having walked to that
point, not arbitrarily picking some offset.

You allude to it in your:

> The only way for it to be fast enough would be to have some sort of
> retainable string iterator, which means exposing an opaque "position
> marker" that serves no purpose other than parsing. Every string
> parse operation would have to be reimplemented this way, lest it
> perform abysmally on large strings. It'd mean some sort of magic
> "thing" that probably has a reference to the original string, so
> you don't get the progressive RAM refunds that slicing gives, and
> you'd still have to deal with lots of the other consequences. It's
> probably doable, but it would be a lot of pain.

but I'm hard-pressed to come up with any use case where direct
indexing into a (non-byte)string makes sense unless you've already
processed/searched up to that point and can use a recorded index
from that processing/search.

Can you provide real-world examples of "I need character 2832 from
this string of unicode text, but I never had to scan to that point
linearly from the beginning/end of the string"?

-tkc

Steve D'Aprano

unread,

Jan 21, 2017, 8:58:23 AM1/21/17

to

On Sat, 21 Jan 2017 09:35 am, Pete Forman wrote:

> Can anyone point me at a rationale for PEP 393 being incorporated in
> Python 3.3 over using UTF-8 as an internal string representation?

I've read over the PEP, and the email discussion, and there is very little
mention of UTF-8, and as far as I can see no counter-proposal for using
UTF-8. However, there are a few mentions of UTF-8 that suggest that the
participants were aware of it as an alternative, and simply didn't think it
was worth considering. I don't know why.

You can read the PEP and the mailing list discussion here:

The PEP:

https://www.python.org/dev/peps/pep-0393/

Mailing list discussion starts here:

https://mail.python.org/pipermail/python-dev/2011-January/107641.html

Stefan Behnel (author of Cython) states that UTF-8 is much harder to use:

https://mail.python.org/pipermail/python-dev/2011-January/107739.html

I see nobody challenging that claim, so perhaps there was simply enough
broad agreement that UTF-8 would have been more work and so nobody wanted
to propose it. I'm just guessing though.

Perhaps it would have been too big a change to adapt the CPython internals
to variable-width UTF-8 from the existing fixed-width UTF-16 and UTF-32
implementations?

(I know that UTF-16 is actually variable-width, but Python prior to PEP 393
treated it as if it were fixed.)

There was a much earlier discussion about the internal implementation of
Unicode strings:

https://mail.python.org/pipermail/python-3000/2006-September/003795.html

including some discussion of UTF-8:

https://mail.python.org/pipermail/python-3000/2006-September/003816.html

It too proposed using a three-way internal implementation, and made it clear
that O(1) indexing was an requirement.

Here's a comment explicitly pointing out that constant-time indexing is
wanted, and that using UTF-8 with a two-level table destroys any space
advantage UTF-8 might have:

https://mail.python.org/pipermail/python-3000/2006-September/003822.html

Ironically, Martin v. Löwis, the author of PEP 393 originally started off
opposing an three-way internal representation, calling it "terrible":

https://mail.python.org/pipermail/python-3000/2006-September/003891.html

Another factor which I didn't see discussed anywhere is that Python strings
treat surrogates as normal code points. I believe that would be troublesome
for a UTF-8 implementation:

py> '\uDC37'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
position 0: surrogates not allowed

but of course with a UCS-2 or UTF-32 implementation it is trivial: you just
treat the surrogate as another code point like any other.

[...]

> ISTM that most operations on strings are via iterators and thus agnostic
> to variable or fixed width encodings.

Slicing is not.

start = text.find(":")
end = text.rfind("!")
assert end > start
chunk = text[start:end]

But even with iteration, we still would expect that indexes be consecutive:

for i, c in enumerate(text):
assert c == text[i]

The complexity of those functions will be greatly increased with UTF-8. Of
course you can make it work, and you can even hide the fact that UTF-8 has
variable-width code points. But you can't have all three of:

- simplicity;
- memory efficiency;
- O(1) operations

with UTF-8.

But of course, I'd be happy for a competing Python implementation to use
UTF-8 and prove me wrong!

--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

Steve D'Aprano

unread,

Jan 21, 2017, 9:44:43 AM1/21/17

to

On Sat, 21 Jan 2017 11:45 pm, Tim Chase wrote:

> but I'm hard-pressed to come up with any use case where direct
> indexing into a (non-byte)string makes sense unless you've already
> processed/searched up to that point and can use a recorded index
> from that processing/search.

Let's take a simple example: you do a find to get an offset, and then slice
from that offset.

py> text = "αβγдлфxx"
py> offset = text.find("ф")
py> stuff = text[offset:]
py> assert stuff == "фxx"

That works fine whether indexing refers to code points or bytes.

py> "αβγдлфxx".find("ф")
5
py> "αβγдлфxx".encode('utf-8').find("ф".encode('utf-8'))
10

Either way, you get the expected result. However:

py> stuff = text[offset + 1:]
py> assert stuff == "xx"

That requires indexes to point to the beginning of *code points*, not bytes:
taking byte 11 of "αβγдлфxx".encode('utf-8') drops you into the middle of
the ф representation:

py> "αβγдлфxx".encode('utf-8')[11:]
b'\x84xx'

and it isn't a valid UTF-8 substring. Slicing would generate an exception
unless you happened to slice right at the start of a code point.

It's like seek() and tell() on text files: you cannot seek to arbitrary
positions, but only to the opaque positions returned by tell. That's
unacceptable for strings.

You could avoid that error by increasing the offset by the right amount:

stuff = text[offset + len("ф".encode('utf-8'):]

which is awful. I believe that's what Go and Julia expect you to do.

Another solution would be to have the string slicing method automatically
scan forward to the start of the next valid UTF-8 code point. That would be
the "Do What I Mean" solution.

The problem with the DWIM solution is that not only is it adding complexity,
but it's frankly *weird*. It would mean:

- if the character at position `offset` fits in 2 bytes:
text[offset+1:] == text[offset+2:]

- if it fits in 3 bytes:
text[offset+1:] == text[offset+2:] == text[offset+3:]

- and if it fits in 4 bytes:
text[offset+1:] == text[offset+2:] == text[offset+3:] == text[offset+4:]

Having the string slicing method Do The Right Thing would actually be The
Wrong Thing. It would make it awful to reason about slicing.

You can avoid this by having the interpreter treat the Python-level indexes
as opaque "code point offsets", and converting them to and from "byte
offsets" as needed. That's not even very hard. But it either turns every
indexing into O(N) (since you have to walk the string to count which byte
represents the nth code point), or you have to keep an auxiliary table with
every string, letting you convert from byte indexes to code point indexes
quickly, but that will significantly increase the memory size of every
string, blowing out the advantage of using UTF-8 in the first place.

Pete Forman

unread,

Jan 21, 2017, 10:50:50 AM1/21/17

to

Steve D'Aprano <steve+...@pearwood.info> writes:

> [...]

> Another factor which I didn't see discussed anywhere is that Python
> strings treat surrogates as normal code points. I believe that would
> be troublesome for a UTF-8 implementation:
>
> py> '\uDC37'.encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc37' in
> position 0: surrogates not allowed
>
> but of course with a UCS-2 or UTF-32 implementation it is trivial: you
> just treat the surrogate as another code point like any other.

Thanks for a very thorough reply, most useful. I'm going to pick you up
on the above, though.

Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
and UTF-32. The rules for UTF-8 were tightened up in Unicode 4 and RFC
3629 (2003). There is CESU-8 if you really need a naive encoding of
UTF-16 to UTF-8-alike.

py> low = '\uDC37'

is only meaningful on narrow builds pre Python 3.3 where the user must
do extra to correctly handle characters outside the BMP.

--
Pete Forman
https://payg-petef.rhcloud.com

Jussi Piitulainen

unread,

Jan 21, 2017, 10:56:51 AM1/21/17

to

Steve D'Aprano writes:

[snip]

> You could avoid that error by increasing the offset by the right
> amount:
>
> stuff = text[offset + len("ф".encode('utf-8'):]
>
> which is awful. I believe that's what Go and Julia expect you to do.

Julia provides a method to get the next index.

let text = "ἐπὶ οἴνοπα πόντον", offset = 1
while offset <= endof(text)
print(text[offset], ".")
offset = nextind(text, offset)
end
println()
end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

Chris Angelico

unread,

Jan 21, 2017, 11:18:19 AM1/21/17

to

This implies that regular iteration isn't good enough, though.

Here's a function that creates a numbered list:

def print_list(items):
width = len(str(len(items)))
for idx, item in enumerate(items, 1):
print("%*d: %s" % (width, idx, item))

In Python, this will happily accept anything that is iterable and has
a known length. Could be a list or tuple, obviously, but can also just
as easily be a dict view (keys or items), a range object, or.... a
string. It's perfectly acceptable to enumerate the characters of a
string. And enumerate() itself is implemented entirely generically. If
you have to call nextind() to get the next character, you've made it
impossible to do any kind of generic operation on the text. You can't
do a windowed view by slicing while iterating, you can't have a "lag"
or "lead" value, you can't do any of those kinds of simple and obvious
index-based operations.

Oh, and Python 3.3 wasn't the first programming language to use this
flexible string representation. Pike introduced an extremely similar
string representation back in 1998:

https://github.com/pikelang/Pike/commit/db4a4

So yes, UTF-8 has its advantages. But it also has its costs, and for a
text processing language like Pike or Python, they significantly
outweigh the benefits.

ChrisA

Jussi Piitulainen

unread,

Jan 21, 2017, 1:54:13 PM1/21/17

to

Chris Angelico writes:

> On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen wrote:
>> Steve D'Aprano writes:
>>
>> [snip]
>>
>>> You could avoid that error by increasing the offset by the right
>>> amount:
>>>
>>> stuff = text[offset + len("ф".encode('utf-8'):]
>>>
>>> which is awful. I believe that's what Go and Julia expect you to do.
>>
>> Julia provides a method to get the next index.
>>
>> let text = "ἐπὶ οἴνοπα πόντον", offset = 1
>> while offset <= endof(text)
>> print(text[offset], ".")
>> offset = nextind(text, offset)
>> end
>> println()
>> end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.
>
> This implies that regular iteration isn't good enough, though.

It doesn't. Here's the straightforward iteration over the whole string: