Py 3.3, unicode / upper()

wxjm...@gmail.com

unread,

Dec 19, 2012, 9:23:00 AM12/19/12

to

I was using the German word "Straße" (Strasse) — German
translation from "street" — to illustrate the catastrophic and
completely wrong-by-design Unicode handling in Py3.3, this
time from a memory point of view (not speed):

>>> sys.getsizeof('Straße')
43
>>> sys.getsizeof('STRAẞE')
50

instead of a sane (Py3.2)

>>> sys.getsizeof('Straße')
42
>>> sys.getsizeof('STRAẞE')
42

But, this is not the problem.
I was suprised to discover this:

>>> 'Straße'.upper()
'STRASSE'

I really, really do not know what I should think about that.
(It is a complex subject.) And the real question is why?

jmf

Thomas Bach

unread,

Dec 19, 2012, 9:43:57 AM12/19/12

to pytho...@python.org

On Wed, Dec 19, 2012 at 06:23:00AM -0800, wxjm...@gmail.com wrote:
> I was suprised to discover this:
>

> >>> 'Stra�e'.upper()

> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

Because there is no definition for upper-case '�'. 'SS' is used as the
common replacement in this case. I think it's pretty smart! :)

Regards,
Thomas.

Christian Heimes

unread,

Dec 19, 2012, 9:52:23 AM12/19/12

to wxjm...@gmail.com, pytho...@python.org

Am 19.12.2012 15:23, schrieb wxjm...@gmail.com:
> But, this is not the problem.
> I was suprised to discover this:
>
>>>> 'Straße'.upper()
> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

It's correct. LATIN SMALL LETTER SHARP S doesn't have an upper case
form. However the unicode database specifies an upper case mapping from
ß to SS. http://codepoints.net/U+00DF

Christian

Stefan Krah

unread,

Dec 19, 2012, 10:01:40 AM12/19/12

to pytho...@python.org

wxjm...@gmail.com <wxjm...@gmail.com> wrote:
> But, this is not the problem.
> I was suprised to discover this:
>

> >>> 'Stra�e'.upper()

> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

http://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F#Versalsatz_ohne_gro.C3.9Fes_.C3.9F

"Die gegenw�rtigen amtlichen Regeln[6] zur neuen deutschen Rechtschreibung
kennen keinen Gro�buchstaben zum �: Jeder Buchstabe existiert als
Kleinbuchstabe und als Gro�buchstabe (Ausnahme �). Im Versalsatz empfehlen
die Regeln, das � durch SS zu ersetzen: Bei Schreibung mit Gro�buchstaben
schreibt man SS, zum Beispiel: Stra�e -- STRASSE."

According to the new official spelling rules the uppercase � does not exist.
The recommendation is to use "SS" when writing in all-caps.

As to why: It has always been acceptable to replace � with "ss" when �
wasn't part of a character set. In the new spelling rules, � has been
officially replaced with "ss" in some cases:

http://en.wiktionary.org/wiki/da%C3%9F

The uppercase � isn't really needed, since � does not occur at the beginning
of a word. As far as I know, most Germans wouldn't even know that it has
existed at some point or how to write it.

Stefan Krah

Chris Angelico

unread,

Dec 19, 2012, 10:17:36 AM12/19/12

to pytho...@python.org

On Thu, Dec 20, 2012 at 1:23 AM, <wxjm...@gmail.com> wrote:
> But, this is not the problem.
> I was suprised to discover this:
>
>>>> 'Straße'.upper()
> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

Not all strings can be uppercased and lowercased cleanly. Please stop
trotting out the old Box Hill-to-Camberwell arguments[1] yet again.

For comparison, try this string:

'𝐇𝐞𝐥𝐥𝐨, 𝐰𝐨𝐫𝐥𝐝!'.upper()

And while you're at it, check out sys.getsizeof() on that sort of
string, compare your beloved 3.2 on that. Oh, and also check out len()
on it.

[1] Melbourne's current ticketing system is based on zones, and
Camberwell is in zone 1, and Box Hill in zone 2. Detractors of public
transport point out that it costs far more to take the train from Box
Hill to Camberwell than it does to drive a car the same distance. It's
the same contrived example that keeps on getting trotted out time and
time again.

ChrisA

Johannes Bauer

unread,

Dec 19, 2012, 10:18:38 AM12/19/12

to

On 19.12.2012 15:23, wxjm...@gmail.com wrote:
> I was using the German word "Straße" (Strasse) — German
> translation from "street" — to illustrate the catastrophic and
> completely wrong-by-design Unicode handling in Py3.3, this
> time from a memory point of view (not speed):
>
>>>> sys.getsizeof('Straße')
> 43
>>>> sys.getsizeof('STRAẞE')
> 50
>
> instead of a sane (Py3.2)
>
>>>> sys.getsizeof('Straße')
> 42
>>>> sys.getsizeof('STRAẞE')
> 42

How do those arbitrary numbers prove anything at all? Why do you draw
the conclusion that it's broken by design? What do you expect? You're
very vague here. Just to show how ridiculously pointless your numers
are, your example gives 84 on Python3.2 for any input of yours.

> But, this is not the problem.
> I was suprised to discover this:
>
>>>> 'Straße'.upper()
> 'STRASSE'
>
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

Because in the German language the uppercase "ß" is virtually dead.

Regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1...@speranza.aioe.org>

Johannes Bauer

unread,

Dec 19, 2012, 10:22:22 AM12/19/12

to

On 19.12.2012 16:18, Johannes Bauer wrote:

> How do those arbitrary numbers prove anything at all? Why do you draw
> the conclusion that it's broken by design? What do you expect? You're
> very vague here. Just to show how ridiculously pointless your numers
> are, your example gives 84 on Python3.2 for any input of yours.

...on Python3.2 on MY system is what I meant to say (x86_64 Linux). Sorry.

Also, further reading:

http://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F
http://en.wikipedia.org/wiki/Capital_%E1%BA%9E

Christian Heimes

unread,

Dec 19, 2012, 10:33:50 AM12/19/12

to Stefan Krah, pytho...@python.org

Am 19.12.2012 16:01, schrieb Stefan Krah:
> The uppercase ß isn't really needed, since ß does not occur at the beginning

> of a word. As far as I know, most Germans wouldn't even know that it has
> existed at some point or how to write it.

I think Python 3.3+ is using uppercase mapping (uc) instead of simple
upper case (suc).

Some background:

The old German Fractur has three variants of the letter S:

capital s: S
long s: ſ
round s: s.

ß is a ligature of ſs. ſ is usually used at the beginning or middle of a
syllable while s is used at the end of a syllable. Compare Wachſtube
(Wach-Stube == guard room) to Wachstube (Wachs-Tube == tube of wax). :)

Christian

Chris Angelico

unread,

Dec 19, 2012, 10:40:43 AM12/19/12

to pytho...@python.org

On Thu, Dec 20, 2012 at 2:18 AM, Johannes Bauer <dfnson...@gmx.de> wrote:
> On 19.12.2012 15:23, wxjm...@gmail.com wrote:
>> I was using the German word "Straße" (Strasse) — German
>> translation from "street" — to illustrate the catastrophic and
>> completely wrong-by-design Unicode handling in Py3.3, this
>> time from a memory point of view (not speed):
>>
>>>>> sys.getsizeof('Straße')
>> 43
>>>>> sys.getsizeof('STRAẞE')
>> 50
>>
>> instead of a sane (Py3.2)
>>
>>>>> sys.getsizeof('Straße')
>> 42
>>>>> sys.getsizeof('STRAẞE')
>> 42
>
> How do those arbitrary numbers prove anything at all? Why do you draw
> the conclusion that it's broken by design? What do you expect? You're
> very vague here. Just to show how ridiculously pointless your numers
> are, your example gives 84 on Python3.2 for any input of yours.

You may not be familiar with jmf. He's one of our resident trolls, and
he has a bee in his bonnet about PEP 393 strings, on the basis that
they take up more space in memory than a narrow build of Python 3.2
would, for a string with lots of BMP characters and one non-BMP. In
3.2 narrow builds, strings were stored in UTF-16, with *surrogate
pairs* for non-BMP characters. This means that len() counts them
twice, as does string indexing/slicing. That's a major bug, especially
as your Python code will do different things on different platforms -
most Linux builds of 3.2 are "wide" builds, storing characters in four
bytes each.

PEP 393 brings wide build semantics to all Pythons, while achieving
memory savings better than a narrow build can (with PEP 393 strings,
any all-ASCII or all-Latin-1 strings will be stored one byte per
character). Every now and then, though, jmf points out *yet again*
that his beloved and buggy narrow build consumes less memory and runs
faster than the oh so terrible 3.3 on some contrived example. It gets
rather tiresome.

Interestingly, IDLE on my Windows box can't handle the bolded
characters very well...

>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
>>> print(s)
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
print(s)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
in position 0: Non-BMP character not supported in Tk

I think this is most likely a case of "yeah, Windows XP just sucks".
But I have no reason or inclination to get myself a newer Windows to
find out if it's any different.

ChrisA

Ian Kelly

unread,

Dec 19, 2012, 1:27:38 PM12/19/12

to Python

On Wed, Dec 19, 2012 at 8:40 AM, Chris Angelico <ros...@gmail.com> wrote:
> You may not be familiar with jmf. He's one of our resident trolls, and
> he has a bee in his bonnet about PEP 393 strings, on the basis that
> they take up more space in memory than a narrow build of Python 3.2
> would, for a string with lots of BMP characters and one non-BMP. In
> 3.2 narrow builds, strings were stored in UTF-16, with *surrogate
> pairs* for non-BMP characters. This means that len() counts them
> twice, as does string indexing/slicing. That's a major bug, especially
> as your Python code will do different things on different platforms -
> most Linux builds of 3.2 are "wide" builds, storing characters in four
> bytes each.

>From what I've been able to discern, his actual complaint about PEP
393 stems from misguided moral concerns. With PEP-393, strings that
can be fully represented in Latin-1 can be stored in half the space
(ignoring fixed overhead) compared to strings containing at least one
non-Latin-1 character. jmf thinks this optimization is unfair to
non-English users and immoral; he wants Latin-1 strings to be treated
exactly like non-Latin-1 strings (I don't think he actually cares
about non-BMP strings at all; if narrow-build Unicode is good enough
for him, then it must be good enough for everybody). Unfortunately
for him, the Latin-1 optimization is rather trivial in the wider
context of PEP-393, and simply removing that part alone clearly
wouldn't be doing anybody any favors. So for him to get what he
wants, the entire PEP has to go.

It's rather like trying to solve the problem of wealth disparity by
forcing everyone to dump their excess wealth into the ocean.

Benjamin Peterson

unread,

Dec 19, 2012, 3:25:48 PM12/19/12

to pytho...@python.org

<wxjmfauth <at> gmail.com> writes:
> I really, really do not know what I should think about that.
> (It is a complex subject.) And the real question is why?

Because that's what the Unicode spec says to do.

wxjm...@gmail.com

unread,

Dec 19, 2012, 3:55:08 PM12/19/12

to wxjm...@gmail.com, pytho...@python.org

-----

Yes, it is correct (or can be considered as correct).
I do not wish to discuss the typographical problematic
of "Das Grosse Eszett". The web is full of pages on the
subject. However, I never succeeded to find an "official
position" from Unicode. The best information I found seem
to indicate (to converge), U+1E9E is now the "supported"
uppercase form of U+00DF. (see DIN).

What is bothering me, is more the implementation. The Unicode
documentation says roughly this: if something can not be
honoured, there is no harm, but do not implement a workaroud.
In that case, I'm not sure Python is doing the best.

If "wrong", this can be considered as programmatically correct
or logically acceptable (Py3.2)

>>> 'Straße'.upper().lower().capitalize() == 'Straße'
True

while this will *always* be problematic (Py3.3)

>>> 'Straße'.upper().lower().capitalize() == 'Straße'
False

jmf

wxjm...@gmail.com

unread,

Dec 19, 2012, 3:55:08 PM12/19/12

to comp.lan...@googlegroups.com, pytho...@python.org, wxjm...@gmail.com

Le mercredi 19 décembre 2012 15:52:23 UTC+1, Christian Heimes a écrit :

wxjm...@gmail.com

unread,

Dec 19, 2012, 4:18:05 PM12/19/12

to Python

----

latin-1 (iso-8859-1) ? are you sure ?

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('ab')
27
>>> sys.getsizeof('aé')
39

Time to go to bed. More complete answer tomorrow.

jmf

wxjm...@gmail.com

unread,

Dec 19, 2012, 4:18:05 PM12/19/12

to comp.lan...@googlegroups.com, Python

Le mercredi 19 décembre 2012 19:27:38 UTC+1, Ian a écrit :

Ian Kelly

unread,

Dec 19, 2012, 4:23:15 PM12/19/12

to Python

On Wed, Dec 19, 2012 at 1:55 PM, <wxjm...@gmail.com> wrote:
> Yes, it is correct (or can be considered as correct).
> I do not wish to discuss the typographical problematic
> of "Das Grosse Eszett". The web is full of pages on the
> subject. However, I never succeeded to find an "official
> position" from Unicode. The best information I found seem
> to indicate (to converge), U+1E9E is now the "supported"
> uppercase form of U+00DF. (see DIN).

Is this link not official?

http://unicode.org/cldr/utility/character.jsp?a=00DF

That defines a full uppercase mapping to SS and a simple uppercase
mapping to U+00DF itself, not U+1E9E. My understanding of the simple
mapping is that it is not allowed to map to multiple characters,
whereas the full mapping is so allowed.

> What is bothering me, is more the implementation. The Unicode
> documentation says roughly this: if something can not be
> honoured, there is no harm, but do not implement a workaroud.
> In that case, I'm not sure Python is doing the best.

But this behavior is per the specification, not a workaround. I think
the worst thing we could do in this regard would be to start diverging
from the specification because we think we know better than the
Unicode Consortium.

> If "wrong", this can be considered as programmatically correct
> or logically acceptable (Py3.2)
>
>>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> True
>
> while this will *always* be problematic (Py3.3)
>
>>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> False

On the other hand (Py3.2):

>>> 'Straße'.upper().isupper()
False

vs. Py3.3:

>>> 'Straße'.upper().isupper()
True

There is probably no one clearly correct way to handle the problem,
but personally this contradiction bothers me more than the example
that you posted.

Ian Kelly

unread,

Dec 19, 2012, 4:31:42 PM12/19/12

to Python

On Wed, Dec 19, 2012 at 2:18 PM, <wxjm...@gmail.com> wrote:
> latin-1 (iso-8859-1) ? are you sure ?

Yes.

>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('ab')
> 27
>>>> sys.getsizeof('aé')
> 39

Compare to:

>>> sys.getsizeof('a\u0100')
42

The reason for the difference you posted is that pure ASCII strings
have a further optimization, which I glossed over and which is purely
a savings in overhead:

>>> sys.getsizeof('abcde') - sys.getsizeof('a')
4
>>> sys.getsizeof('ábçdê') - sys.getsizeof('á')
4

Terry Reedy

unread,

Dec 19, 2012, 7:39:22 PM12/19/12

to pytho...@python.org

On 12/19/2012 10:40 AM, Chris Angelico wrote:

> Interestingly, IDLE on my Windows box can't handle the bolded
> characters very well...
>
>>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
>>>> print(s)
> Traceback (most recent call last):
> File "<pyshell#2>", line 1, in <module>
> print(s)
> UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
> in position 0: Non-BMP character not supported in Tk

On 3.3.0 on Win7 , the expressions 's', 'repr(s)', and 'str(s)' (without
the quotes) echo the input as entered (with \U escapes) while 'print(s)'
gets the same traceback you did.

--
Terry Jan Reedy

Chris Angelico

unread,

Dec 19, 2012, 9:01:33 PM12/19/12

to pytho...@python.org

On Thu, Dec 20, 2012 at 8:23 AM, Ian Kelly <ian.g...@gmail.com> wrote:
> On Wed, Dec 19, 2012 at 1:55 PM, <wxjm...@gmail.com> wrote:

>> Yes, it is correct (or can be considered as correct).
>> I do not wish to discuss the typographical problematic
>> of "Das Grosse Eszett". The web is full of pages on the
>> subject. However, I never succeeded to find an "official
>> position" from Unicode. The best information I found seem
>> to indicate (to converge), U+1E9E is now the "supported"
>> uppercase form of U+00DF. (see DIN).
>

> Is this link not official?
>
> http://unicode.org/cldr/utility/character.jsp?a=00DF
>
> That defines a full uppercase mapping to SS and a simple uppercase
> mapping to U+00DF itself, not U+1E9E. My understanding of the simple
> mapping is that it is not allowed to map to multiple characters,
> whereas the full mapping is so allowed.

Ahh, thanks, that explains why the other Unicode-aware language I
tried behaved differently.

Pike v7.9 release 5 running Hilfe v3.5 (Incremental Pike Frontend)
> string s="Stra\u00dfe";
> upper_case(s);
(1) Result: "STRA\337E"
> lower_case(upper_case(s));
(2) Result: "stra\337e"
> String.capitalize(lower_case(s));
(3) Result: "Stra\337e"

The output is the equivalent of repr(), and it uses octal escapes
where possible (for brevity), so \337 is its representation of U+00DF
(decimal 223, octal 337). Upper-casing and lower-casing this character
result in the same thing.

> write("Original: %s\nLower: %s\nUpper: %s\n",s,lower_case(s),upper_case(s));
Original: Straße
Lower: straße
Upper: STRAßE

It's worth noting, incidentally, that the unusual upper-case form of
the letter (U+1E9E) does lower-case to U+00DF in both Python 3.3 and
Pike 7.9.5:

> lower_case("Stra\u1E9Ee");
(9) Result: "stra\337e"

>>> ord("\u1e9e".lower())
223

So both of them are behaving in a compliant manner, even though
they're not quite identical.

ChrisA

Chris Angelico

unread,

Dec 19, 2012, 9:03:30 PM12/19/12

to pytho...@python.org

On Thu, Dec 20, 2012 at 5:27 AM, Ian Kelly <ian.g...@gmail.com> wrote:
> From what I've been able to discern, [jmf's] actual complaint about PEP

> 393 stems from misguided moral concerns. With PEP-393, strings that
> can be fully represented in Latin-1 can be stored in half the space
> (ignoring fixed overhead) compared to strings containing at least one
> non-Latin-1 character. jmf thinks this optimization is unfair to
> non-English users and immoral; he wants Latin-1 strings to be treated
> exactly like non-Latin-1 strings (I don't think he actually cares
> about non-BMP strings at all; if narrow-build Unicode is good enough
> for him, then it must be good enough for everybody).

Not entirely; most of his complaints are based on performance (speed
and/or memory) of 3.3 compared to a narrow build of 3.2, using silly
edge cases to prove how much worse 3.3 is, while utterly ignoring the
fact that, in those self-same edge cases, 3.2 is buggy.

ChrisA

Westley Martínez

unread,

Dec 19, 2012, 9:53:32 PM12/19/12

to pytho...@python.org

On Wed, Dec 19, 2012 at 02:23:15PM -0700, Ian Kelly wrote:
> On Wed, Dec 19, 2012 at 1:55 PM, <wxjm...@gmail.com> wrote:

> > If "wrong", this can be considered as programmatically correct
> > or logically acceptable (Py3.2)
> >
> >>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> > True
> >
> > while this will *always* be problematic (Py3.3)
> >
> >>>> 'Straße'.upper().lower().capitalize() == 'Straße'
> > False
>

> On the other hand (Py3.2):
>
> >>> 'Straße'.upper().isupper()
> False
>
> vs. Py3.3:
>
> >>> 'Straße'.upper().isupper()
> True
>
> There is probably no one clearly correct way to handle the problem,
> but personally this contradiction bothers me more than the example
> that you posted.

Why would it ever be wrong for 'Straße' to not equal 'Strasse'? Python
is not intended to do any sort of advanced linguistic processing. It is
comparing strings not words. It is not problematic. It makes sense.

Terry Reedy

unread,

Dec 19, 2012, 9:54:20 PM12/19/12

to pytho...@python.org

And the fact that stringbench.py is overall about as fast with 3.3 as
with 3.2 *on the same Windows 7 machine* (which uses narrow build in
3.2), and that unicode operations are not far from bytes operations when
the same thing can be done with both.

--
Terry Jan Reedy

Westley Martínez

unread,

Dec 19, 2012, 10:12:23 PM12/19/12

to pytho...@python.org

Really, why should we be so obsessed with speed anyways? Isn't
improving the language and fixing bugs far more important?

Chris Angelico

unread,

Dec 19, 2012, 10:22:41 PM12/19/12

to pytho...@python.org

On Thu, Dec 20, 2012 at 2:12 PM, Westley Martínez <anik...@gmail.com> wrote:
> Really, why should we be so obsessed with speed anyways? Isn't
> improving the language and fixing bugs far more important?

Because speed is very important in certain areas. Python can be used
in many ways:

* Command-line calculator with awesome precision and variable handling
* Proglets, written once and run once, doing one simple job and then moving on
* Applications that do heaps of work and are run multiple times a day
* Internet services (eg web server), contacted many times a second
* Etcetera
* Etcetera
* And quite a few other ways too

For the first two, performance isn't very important. No matter how
slow the language, it's still going to respond "3" instantly when you
enter "1+2", and unless you're writing something hopelessly
inefficient or brute-force, the time spent writing a proglet usually
dwarfs its execution time.

But performance is very important for something like Mercurial, which
is invoked many times and always with the user waiting for it. You
want to get back to work, not sit there for X seconds while your
source control engine fires up and does something. And with a web
server, language performance translates fairly directly into latency
AND potential requests per second on any given hardware.

To be sure, a lot of Python performance hits the level of "sufficient"
and doesn't need to go further, but it's still worth considering.

ChrisA

Terry Reedy

unread,

Dec 20, 2012, 12:32:42 AM12/20/12

to pytho...@python.org

> Really, why should we be so obsessed with speed anyways? Isn't
> improving the language and fixing bugs far more important?

Being conservative, there are probably at least 10 enhancement patches
and 30 bug fix patches for every performance patch. Performance patches
are considered enhancements and only go in new versions with
enhancements, where they go through the extended alpha, beta, candidate
test and evaluation process.

In the unicode case, Jim discovered that find was several times slower
in 3.3 than 3.2 and claimed that that was a reason to not use 3.2. I ran
the complete stringbency.py and discovered that find (and consequently
find and replace) are the only operations with such a slowdown. I also
discovered that another at least as common operation, encoding strings
that only contain ascii characters to ascii bytes for transmission, is
several times as fast in 3.3. So I reported that unless one is only
finding substrings in long strings, there is no reason to not upgrade to
3.3.

--
Terry Jan Reedy

Steven D'Aprano

unread,

Dec 20, 2012, 12:51:48 AM12/20/12

to

On Thu, 20 Dec 2012 00:32:42 -0500, Terry Reedy wrote:

> In the unicode case, Jim discovered that find was several times slower
> in 3.3 than 3.2 and claimed that that was a reason to not use 3.2. I ran
> the complete stringbency.py and discovered that find (and consequently
> find and replace) are the only operations with such a slowdown. I also
> discovered that another at least as common operation, encoding strings
> that only contain ascii characters to ascii bytes for transmission, is
> several times as fast in 3.3. So I reported that unless one is only
> finding substrings in long strings, there is no reason to not upgrade to
> 3.3.

Yes, and if you remember, Jim (jfm) based his complaints on very possibly
the worst edge-case for the new Unicode implementation:

- generate a large string of characters
- replace every character in that string with another character

By memory:

s = "a"*100000
s = s.replace("a", "b")

or equivalent. Hardly representative of normal string processing, and
likely to be the worst-performing operation on new Unicode strings. And
yet even so, many people reported either a mild slow down or, in a few
cases, a small speed up.

--
Steven

Johannes Bauer

unread,

Dec 20, 2012, 9:57:57 AM12/20/12

to

On 19.12.2012 16:40, Chris Angelico wrote:

> You may not be familiar with jmf. He's one of our resident trolls, and
> he has a bee in his bonnet about PEP 393 strings, on the basis that
> they take up more space in memory than a narrow build of Python 3.2
> would, for a string with lots of BMP characters and one non-BMP.

I was not :-( Thanks for the heads up and the good summary on what the
issue was about.

Best regards,

wxjm...@gmail.com

unread,

Dec 20, 2012, 2:19:50 PM12/20/12

to

Fact.
In order to work comfortably and with efficiency with a "scheme for
the coding of the characters", can be unicode or any coding scheme,
one has to take into account two things: 1) work with a unique set
of characters and 2) work with a contiguous block of code points.

At this point, it should be noticed I did not even wrote about
the real coding, only about characters and code points.

Now, let's take a look at what happens when one breaks the rules
above and, precisely, if one attempts to work with multiple
characters sets or if one divides - artificially - the whole range
of the unicode code points in chunks.

The first (and it should be quite obvious) consequence is that
you create bloated, unnecessary and useless code. I simplify
the flexible string representation (FSR) and will use an "ascii" /
"non-ascii" model/terminology.

If you are an "ascii" user, a FSR model has no sense. An
"ascii" user will use, per definition, only "ascii characters".

If you are a "non-ascii" user, the FSR model is also a non
sense, because you are per definition a n"on-ascii" user of
"non-ascii" character. Any optimisation for "ascii" user just
become irrelevant.

In one sense, to escape from this, you have to be at the same time
a non "ascii" user and a non "non-ascii" user. Impossible.
In both cases, a FSR model is useless and in both cases you are
forced to use bloated and unnecessary code.

The rule is to treat every character of a unique set of characters
of a coding scheme in, how to say, an "equal way". The problematic
can be seen the other way, every coding scheme has been built
to work with a unique set of characters, otherwhile it is not
properly working!

The second negative aspect of this splitting, is just the
splitting itsself. One can optimize every subset of characters,
one will always be impacted by the "switch" between the subsets.
One more reason to work with a unique set characters or this is
the reason why every coding scheme handle a unique set of
characters.

Up to now, I spoke only about the characters and the sets of
characters, not about the coding of the characters.
There is a point which is quite hard to understand and also hard
to explain. It becomes obvious with some experience.

When one works with a coding scheme, one always has to think
characters / code points. If one takes the perspective of encoded
code points, it simply does not work or may not work very well
(memory/speed). The whole problematic is that it is impossible to
work with characters, one is forced to manipulate encoded code
points as characters. Unicode is built and though to work with
code points, not with encoded code points. The serialization,
transformation code point -> encoded code point, is "only" a
technical and secondary process. Surprise, all the unicode
coding schemes (utf-8, 16, 32) are working with the same
set of characters. They differ in the serialization, but
they are all working with a unique set of characters.
The utf-16 / ucs-2 is an interesting case. Their encoding mechanisms
are quasi the same, the difference lies in the sets of characters.

There is an another way to empiricaly understand the problem.
The historical evolution of the coding of the characters. Practically,
all the coding schemes have been created to handle different sets of
characters or coding schemes have been created, because it is the
only way to work properly. If it would have been possible to work
with multiple coding schemes, I'm pretty sure a solution would
have emerged. It never happened and it would not have been necessary
to create iso10646 or unicode. Neither it would have been necessary
to create all these codings iso-8859-***, cp***, mac** which are
all *based on set of characters*.

plan9 had attempted to work with multiple characters set, it did not
work very well, main issue: the switch between the codings.

A solution à la FSR can not work or not work in a optimized way.
It is not a coding scheme, it is a composite of coding schemes
handling several characters sets. Hard to imagine something worse.

Contrary to what has been said, the bad cases I presented here are
not corner cases. There is practically and systematically a regression
in Py33 compared to Py32.
That's very easy to test. I did all my tests at the light of what
I explained above. I was not a suprise for me to this expectidly
bad behaviour.

Python is not my tool. If I'm allowing me to give an advice, a
scientifical approach.
I suggest the core devs to firstly spend their time to proof
a FSR model can beat the existing models (purely on the C level).
Then, if they succeeded, to later implement this.

My feeling is that most of the people are defending this FSR simply
because it exists, not because of its intrisic quality.

Hint: I suggest the experts to take a comprehensive look at the
cmap table of the OpenType fonts (pure unicode technology).
Those people know how to work.

I would be very happy to be wrong. Unfortunately, I'm affraid
it's not the case.

jmf

wxjm...@gmail.com

unread,

Dec 20, 2012, 2:40:21 PM12/20/12

to Python

-----

I know all of this. And this is exactly, what I explained.
I do not care about this optimization. I'm not an ascii user.
As a non ascii user, this optimization is just irrelevant.

What should a Python user think, if he sees his strings
are comsuming more memory just because he uses non ascii
characters or he sees his strings are changing just because
he "uppercases" them.
Unicode is here to serve anybody.

jmf

wxjm...@gmail.com

unread,

Dec 20, 2012, 2:42:03 PM12/20/12

to Python

----

At least, we agree on the problematic of this very special case.

jmf

wxjm...@gmail.com

unread,

Dec 20, 2012, 2:57:44 PM12/20/12

to pytho...@python.org

--------

I shew a case where the Py33 works 10 times slower than Py32,
"replace". You the devs spend your time to correct that case.

Now, if I'm putting on the table an exemple working 20 times
slower. Will you spend your time to optimize that?

I'm affraid, this is the FSR which is problematic, not the
corner cases.

jmf

wxjm...@gmail.com

unread,

Dec 20, 2012, 2:57:44 PM12/20/12

to comp.lan...@googlegroups.com, pytho...@python.org

Le jeudi 20 décembre 2012 06:32:42 UTC+1, Terry Reedy a écrit :

MRAB

unread,

Dec 20, 2012, 3:20:57 PM12/20/12

to pytho...@python.org

[snip]
It's true that in an ideal world you would treat all codepoints the
same. However, this is a case where "practicality beats purity".

In order to accommodate every codepoint you need 3 bytes per codepoint
(although for pragmatic reasons it's 4 bytes per codepoint).

But not all codepoints are used equally. Those in the "astral plane",
for example, are used rarely, so the vast majority of the time you
would be using twice as much memory as strictly necessary. There are
also, in reality, many times in which strings contain only ASCII-range
codepoints, although they may not be visible to the average user, being
the names of functions and attributes in program code, or tags and
attributes in HTML and XML.

FSR is a pragmatic solution to dealing with limited resources.

Would you prefer there to be a switch that makes strings always use 4
bytes per codepoint for those users and systems where memory is no
object?

wxjm...@gmail.com

unread,

Dec 20, 2012, 2:40:21 PM12/20/12

to comp.lan...@googlegroups.com, Python

Le mercredi 19 décembre 2012 22:31:42 UTC+1, Ian a écrit :

Chris Angelico

unread,

Dec 20, 2012, 4:19:11 PM12/20/12

to pytho...@python.org

On Fri, Dec 21, 2012 at 7:20 AM, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 2012-12-20 19:19, wxjm...@gmail.com wrote:

>> The rule is to treat every character of a unique set of characters
>> of a coding scheme in, how to say, an "equal way". The problematic
>> can be seen the other way, every coding scheme has been built
>> to work with a unique set of characters, otherwhile it is not
>> properly working!
>>

> It's true that in an ideal world you would treat all codepoints the
> same. However, this is a case where "practicality beats purity".

Actually no. Not all codepoints are the same. Ever heard of Huffman
coding? It's a broad technique used in everything from PK-ZIP/gzip
file compression to the Morse code ("here come dots!"). It exploits
and depends on a dramatically unequal usage distribution pattern, as
all text (he will ask "All?" You will respond "All!" He will
understand -- referring to Caeser) exhibits.

In the case of strings in a Python program, it's fairly obvious that
there will be *many* that are ASCII-only; and what's more, most of the
long strings will either be ASCII-only or have a large number of
non-ASCII characters. However, your microbenchmarks usually look at
two highly unusual cases: either a string with a huge number of ASCII
chars and one non-ASCII, or all the same non-ASCII (usually for your
replace() tests). I haven't seen strings like either of those come up.

Can you show us a performance regression in an *actual* *production*
*program*? And make sure you're comparing against a wide build, here.

ChrisA

wxjm...@gmail.com

unread,

Dec 20, 2012, 2:42:03 PM12/20/12

to comp.lan...@googlegroups.com, Python

Le mercredi 19 décembre 2012 22:23:15 UTC+1, Ian a écrit :

Terry Reedy

unread,

Dec 20, 2012, 5:12:20 PM12/20/12

to pytho...@python.org

On 12/20/2012 2:19 PM, wxjm...@gmail.com wrote:

>
> If you are an "ascii" user, a FSR model has no sense. An
> "ascii" user will use, per definition, only "ascii characters".
>
> If you are a "non-ascii" user, the FSR model is also a non
> sense, because you are per definition a n"on-ascii" user of
> "non-ascii" character. Any optimisation for "ascii" user just
> become irrelevant.

This is a false dichotomy. Conclusions based on falsity are false.

> In one sense, to escape from this, you have to be at the same time
> a non "ascii" user and a non "non-ascii" user. Impossible.

This is wrong. Every Python user is an ascii user. All names in the
stdlib are ascii-only. These names all become strings in code objects.
All docstrings (with a couple of rare exceptions) are ascii-only. They
also become strings. *Every Python user* benefits from the new system in
3.3.

Some Python users are also non-ascii user. This include many English
speakers, as many English texts include non-ascii characters. (Just for
starters, the copyright and trademark symbols are not in the ascii set.)

> Contrary to what has been said, the bad cases I presented here are
> not corner cases. There is practically and systematically a regression
> in Py33 compared to Py32.

I posted evidence otherwise. Jim never responded to those posts. Instead
he repeats the falsehood refuted by evidence.

> That's very easy to test.

Yes. Run stringbench.py on the OS/machine on 3.2 and 3.3 as I did.

--
Terry Jan Reedy

Terry Reedy

unread,

Dec 20, 2012, 5:30:24 PM12/20/12

to pytho...@python.org

On 12/20/2012 2:57 PM, wxjm...@gmail.com wrote:

> I shew a case where the Py33 works 10 times slower than Py32,
> "replace". You the devs spend your time to correct that case.

I discovered that it is the 'find' part of find and replace that is
slower. The comparison is worse on Windows than on *nix. There is an
issue on the tracker so it may be improved someday. Most devs are not
especially bothered and would rather fix errors as part of their
volunteer work.

> Now, if I'm putting on the table an exemple working 20 times
> slower. Will you spend your time to optimize that?
>
> I'm affraid, this is the FSR which is problematic, not the
> corner cases.

I showed another case where 3.3 is a thousand, a million times faster
than 3.2. Does that make the old way 'problematic'?

Don't you think that the bugs (wrong answers) in narrow builds to be
'problematic'? Do you really think that getting wrong answers faster is
better that getting right answers possibly slower?

The 'find' operation is just 1 of about 30 that are tested by
stringbench.py. Run that on 3.3 and 3.2, as I did, before talking about
FSR as 'problematic'.

--
Terry Jan Reedy

Terry Reedy

unread,

Dec 20, 2012, 5:48:04 PM12/20/12

to pytho...@python.org

On 12/20/2012 2:40 PM, wxjm...@gmail.com wrote:

> What should a Python user think, if he sees his strings
> are comsuming more memory just because he uses non ascii
> characters

What should a Python user think, if he (or she) sees his (or her)
strings sometimes or often consuming less memory than they did previously?

I think the person should be grateful that people volunteered to make
the improvement, rather than ungratefully bitch about it.

> or he sees his strings are changing just because
> he "uppercases" them.

Uppercasing strings is supposed to change strings.

> Unicode is here to serve anybody.

This we agree on. Python3.3 unicode serves everybody better than 3.2 does.

--
Terry Jan Reedy

Steven D'Aprano

unread,

Dec 20, 2012, 5:51:06 PM12/20/12

to

On Thu, 20 Dec 2012 11:40:21 -0800, wxjmfauth wrote:

> I do not care
> about this optimization. I'm not an ascii user. As a non ascii user,
> this optimization is just irrelevant.

WRONG.

Every Python user is an ASCII user. Every Python program has hundreds or
thousands of ASCII strings.

# === example ===
import random

There's already one ASCII string in your code: the module name "random"
is ASCII. Let's look inside that module:

py> dir(random)
['BPF', 'LOG4', 'NV_MAGICCONST', 'RECIP_BPF', 'Random', 'SG_MAGICCONST',
'SystemRandom', 'TWOPI', '_BuiltinMethodType', '_MethodType',
'_Sequence', '_Set', '__all__', '__builtins__', '__cached__', '__doc__',
'__file__', '__initializing__', '__loader__', '__name__', '__package__',
'_acos', '_ceil', '_cos', '_e', '_exp', '_inst', '_log', '_pi',
'_random', '_sha512', '_sin', '_sqrt', '_test', '_test_generator',
'_urandom', '_warn', 'betavariate', 'choice', 'expovariate',
'gammavariate', 'gauss', 'getrandbits', 'getstate', 'lognormvariate',
'normalvariate', 'paretovariate', 'randint', 'random', 'randrange',
'sample', 'seed', 'setstate', 'shuffle', 'triangular', 'uniform',
'vonmisesvariate', 'weibullvariate']

That's another 58 ASCII strings. Let's pick one of those:

py> dir(random.Random)
['VERSION', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__',
'__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__',
'__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__',
'__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__',
'__subclasshook__', '__weakref__', '_randbelow', 'betavariate', 'choice',
'expovariate', 'gammavariate', 'gauss', 'getrandbits', 'getstate',
'lognormvariate', 'normalvariate', 'paretovariate', 'randint', 'random',
'randrange', 'sample', 'seed', 'setstate', 'shuffle', 'triangular',
'uniform', 'vonmisesvariate', 'weibullvariate']

That's another 51 ASCII strings. Let's pick one of them:

py> dir(random.Random.shuffle)
['__annotations__', '__call__', '__class__', '__closure__', '__code__',
'__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__',
'__eq__', '__format__', '__ge__', '__get__', '__getattribute__',
'__globals__', '__gt__', '__hash__', '__init__', '__kwdefaults__',
'__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__',
'__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__',
'__sizeof__', '__str__', '__subclasshook__']

And another 34 ASCII strings.

So to get access to just *one* method of *one* class of *one* module, we
have already seen up to 144 ASCII strings. (Some of them will be
duplicated.)

Even if every one of *your* classes, methods, functions, modules and
variables are using non-ASCII names, you will still use ASCII strings for
built-in functions and standard library modules.

> What should a Python user think, if he sees his strings are comsuming
> more memory just because he uses non ascii characters

WRONG!

His strings are consuming just as much memory as they need to. You cannot
fit ten thousand different characters into a single byte. A single byte
can represent only 2**8 = 256 characters. Two bytes can only represent
65536 characters at most. Four bytes can represent the entire range of
every character ever represented in human history, and more, but it is
terribly wasteful: most strings do not use a billion different
characters, and so use of a four-byte character encoding uses up to four
times as much memory as necessary.

You are imagining that non-ASCII users are being discriminated against,
with their strings being unfairly bloated. But that is not the case.
Their strings would be equally large in a Python wide-build, give or take
whatever overhead of the string object that change from version to
version. If you are not comparing a wide-build of Python to Python 3.3,
then your comparison is faulty. You are comparing "buggy Unicode, cannot
handle the supplementary planes" with "fixed Unicode, can handle the
supplementary planes". Python 3.2 narrow builds save memory by
introducing bugs into Unicode strings. Python 3.3 fixes those bugs and
still saves memory.

--
Steven

Terry Reedy

unread,

Dec 20, 2012, 5:59:31 PM12/20/12

to pytho...@python.org

On 12/20/2012 2:19 PM, wxjm...@gmail.com wrote:

> My feeling is that most of the people are defending this FSR simply
> because it exists, not because of its intrisic quality.

The fact, contrary to your feeling, is that I was initially dubious that
is could be made to work as well as it does. I was only really convinced
when I ran stringbench in response to your over-genralized assertions.

It is also a fact that I proposed on the tracker and pydev list a
different method of fixing the length and index bugs in narrow builds.
It only saved space relative to wide builds but did not have the
additional space-saving of the new scheme for ascii and latin-1 text.

--
Terry Jan Reedy

Ian Kelly

unread,

Dec 20, 2012, 7:34:03 PM12/20/12

to Python

On Thu, Dec 20, 2012 at 12:19 PM, <wxjm...@gmail.com> wrote:
> The first (and it should be quite obvious) consequence is that
> you create bloated, unnecessary and useless code. I simplify
> the flexible string representation (FSR) and will use an "ascii" /
> "non-ascii" model/terminology.
>
> If you are an "ascii" user, a FSR model has no sense. An
> "ascii" user will use, per definition, only "ascii characters".
>
> If you are a "non-ascii" user, the FSR model is also a non
> sense, because you are per definition a n"on-ascii" user of
> "non-ascii" character. Any optimisation for "ascii" user just
> become irrelevant.
>
> In one sense, to escape from this, you have to be at the same time
> a non "ascii" user and a non "non-ascii" user. Impossible.
> In both cases, a FSR model is useless and in both cases you are
> forced to use bloated and unnecessary code.

As Terry and Steven have already pointed out, there is no such thing
as a "non-ascii" user. Here I will take the complementary approach
and point out that there is also no such thing as an "ascii" user.
There are only users whose strings are 99.99% (or more) ASCII. A user
may think that his program will never be given any non-ASCII input to
deal with, but experience tells us that this thought is probably
wrong.

Suppose you were to split the Unicode representation into separate
"ASCII-only" and "wide" data types. Then which data type is the
correct one to choose for an "ascii" user? The correct answer is
*always* the wide data type, for the reason stated above. If the user
chooses the ASCII-only data type, then as soon his program encounters
non-ASCII data, it breaks. The only users of the ASCII-only data type
then would be the authors of buggy programs. The same issue applies
to narrow (UTF-16) data types. So there really are only two viable,
non-buggy options for Unicode representations: FSR, or always wide
(UTF-32). The latter is wildly inefficient in many cases, so Python
went with FSR.

A third option might be proposed, which would be to have a build
switch between FSR or always wide, with the promise that the two will
be indistinguishable at the Python level (apart from the amount of
memory used). This is probably not on the table, however, as it would
have a non-negligible maintenance cost, and it's not clear that
anybody other than you would actually want it.

> A solution à la FSR can not work or not work in a optimized way.
> It is not a coding scheme, it is a composite of coding schemes
> handling several characters sets. Hard to imagine something worse.

It is not a composite of coding schemes. The str type deals with
exactly *one* character set -- the UCS. The different representations
are not different coding schemes. They are *all* UTF-32. The only
significant difference between the representations is that the leading
zero bytes of each character are made implicit (i.e. truncated) if the
nature of the string allows it.

> Contrary to what has been said, the bad cases I presented here are
> not corner cases.

The only significantly regressive case that you've presented here has
been str.replace on inputs engineered for bad performance. That's why
people characterize them as corner cases -- because that's exactly
what they are.

> There is practically and systematically a regression
> in Py33 compared to Py32.
> That's very easy to test. I did all my tests at the light of what
> I explained above. I was not a suprise for me to this expectidly
> bad behaviour.

Have you run stringbench.py yet? When I ran it on my system, the full
set of Unicode benchmarks ran in 268.15 seconds for Python 3.2 versus
198.77 seconds for Python 3.3. That's a 26% overall speedup for the
covered benchmarks, which seem reasonably thorough. That does not
demonstrate a "systematic regression". If anything, that shows a
systematic improvement.

Your cherry-picking of benchmarks is like a driver who has two routes
to their destination; one takes ten minutes on average but has one
annoyingly long traffic light, while the second takes fifteen minutes
on average but has no traffic lights (and a correspondingly higher
accident rate). Yet for some reason you insist that the second route
is better because the traffic light makes the first route
"systematically" slower.

Serhiy Storchaka

unread,

Dec 27, 2012, 2:00:37 PM12/27/12

to pytho...@python.org

On 19.12.12 17:40, Chris Angelico wrote:
> Interestingly, IDLE on my Windows box can't handle the bolded
> characters very well...
>
>>>> s="\U0001d407\U0001d41e\U0001d425\U0001d425\U0001d428, \U0001d430\U0001d428\U0001d42b\U0001d425\U0001d41d!"
>>>> print(s)
> Traceback (most recent call last):
> File "<pyshell#2>", line 1, in <module>
> print(s)
> UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001d407'
> in position 0: Non-BMP character not supported in Tk
>

> I think this is most likely a case of "yeah, Windows XP just sucks".
> But I have no reason or inclination to get myself a newer Windows to
> find out if it's any different.

No, this is a Tcl/Tk limitation (I don't know if this was fixed in 8.6).

wxjm...@gmail.com

unread,

Dec 27, 2012, 2:36:48 PM12/27/12

to pytho...@python.org

-----

This is a strange error message. Remember: a coding scheme
covers a *set of characters*.
The guilty code point corresponds to a character which
is not part of the ucs-2 characters set!

jmf

wxjm...@gmail.com

unread,

Dec 27, 2012, 2:36:48 PM12/27/12

to comp.lan...@googlegroups.com, pytho...@python.org

Le jeudi 27 décembre 2012 20:00:37 UTC+1, Serhiy Storchaka a écrit :

wxjm...@gmail.com

unread,

Dec 29, 2012, 2:16:55 PM12/29/12

to Stefan Krah, pytho...@python.org

Le mercredi 19 décembre 2012 16:33:50 UTC+1, Christian Heimes a écrit :
>
> I think Python 3.3+ is using uppercase mapping (uc) instead of simple
>
> upper case (suc).

I think you are thinking correctly. This a clever answer.

Note: I do not care about the uc / suc choice. As long
there is consistency, I'm fine with the choice. Anyway, the
only valid "programming technique" on that field is to create
a dedicated lib for a given script (esp. French!)

jmf

>
>
>
>
>
> Some background:
>
>
>
> The old German Fractur has three variants of the letter S:
>
>
>
> capital s: S
>
> long s: ſ
>
> round s: s.
>
>
>
> ß is a ligature of ſs. ſ is usually used at the beginning or middle of a
>
> syllable while s is used at the end of a syllable. Compare Wachſtube
>
> (Wach-Stube == guard room) to Wachstube (Wachs-Tube == tube of wax). :)
>
>
>
> Christian

wxjm...@gmail.com

unread,

Dec 29, 2012, 2:16:55 PM12/29/12

to comp.lan...@googlegroups.com, pytho...@python.org