Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Python Unicode handling wins again -- mostly

259 views
Skip to first unread message

Steven D'Aprano

unread,
Nov 29, 2013, 7:44:13 PM11/29/13
to
There's a recent blog post complaining about the lousy support for
Unicode text in most programming languages:

http://mortoray.com/2013/11/27/the-string-type-is-broken/

The author, Mortoray, gives nine basic tests to understand how well the
string type in a language works. The first four involve "user-perceived
characters", also known as grapheme clusters.


(1) Does the decomposed string "noe\u0308l" print correctly? Notice that
the accented letter ë has been decomposed into a pair of code points,
U+0065 (LATIN SMALL LETTER E) and U+0308 (COMBINING DIAERESIS).

Python 3.3 passes this test:

py> print("noe\u0308l")
noël

although I expect that depends on the terminal you are running in.


(2) If you reverse that string, does it give "lëon"? The implication of
this question is that strings should operate on grapheme clusters rather
than code points. Python fails this test:

py> print("noe\u0308l"[::-1])
leon

Some terminals may display the umlaut over the l, or following the l.

I'm not completely sure it is fair to expect a string type to operate on
grapheme clusters (collections of decomposed characters) as the author
expects. I think that is going above and beyond what a basic string type
should be expected to do. I would expect a solid Unicode implementation
to include support for grapheme clusters, and in that regard Python is
lacking functionality.


(3) What are the first three characters? The author suggests that the
answer should be "noë", in which case Python fails again:

py> print("noe\u0308l"[:3])
noe

but again I'm not convinced that slicing should operate across decomposed
strings in this way. Surely the point of decomposing the string like that
is in order to count the base character e and the accent "\u0308"
separately?


(4) Likewise, what is the length of the decomposed string? The author
expects 4, but Python gives 5:

py> len("noe\u0308l")
5

So far, Python passes only one of the four tests, but I'm not convinced
that the three failed tests are fair for a string type. If strings
operated on grapheme clusters, these would be good tests, but it is not a
given that strings should.

The next few tests have to do with characters in the Supplementary
Multilingual Planes, and this is where Python 3.3 shines. (In older
versions, wide builds would also pass, but narrow builds would fail.)

(5) What is the length of "😸😾"?

Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E
(POUTING CAT FACE) are outside the Basic Multilingual Plane, which means
they require more than two bytes each. Most programming languages using
UTF-16 encodings internally (including Javascript and Java) fail this
test. Python 3.3 passes:

py> s = '😸😾'
py> len(s)
2

(Older versions of Python distinguished between *narrow builds*, which
used UTF-16 internally and *wide builds*, which used UTF-32. Narrow
builds would also fail this test.)

This makes Python one of a very few programming languages which can
easily handle so-called "astral characters" from the Supplementary
Multilingual Planes while still having O(1) indexing operations.


(6) What is the substring after the first character? The right answer is
a single character POUTING CAT FACE, and Python gets that correct:

py> unicodedata.name(s[1:])
'POUTING CAT FACE'

UTF-16 languages invariable end up with broken, invalid strings
containing half of a surrogate pair.


(7) What is the reverse of the string?

Python passes this test too:

py> print(s[::-1])
😾😸
py> for c in s[::-1]:
... unicodedata.name(c)
...
'POUTING CAT FACE'
'GRINNING CAT FACE WITH SMILING EYES'

UTF-16 based languages typically break, again getting invalid strings
containing surrogate pairs in the wrong order.


The next test involves ligatures. Ligatures are pairs, or triples, of
characters which have been moved closer together in order to look better.
Normally you would expect the type-setter to handle ligatures by
adjusting the spacing between characters, but there are a few pairs (such
as "fi" <=> "fi" where type designers provided them as custom-designed
single characters, and Unicode includes them as legacy characters.

(8) What's the uppercase of "baffle" spelled with an ffl ligature?

Like most other languages, Python 3.2 fails:

py> 'baffle'.upper()
'BAfflE'

but Python 3.3 passes:

py> 'baffle'.upper()
'BAFFLE'


Lastly, Mortoray returns to noël, and compares the composed and
decomposed versions of the string:

(9) Does "noël" equal "noe\u0308l"?

Python (correctly, in my opinion) reports that they do not:

py> "noël" == "noe\u0308l"
False

Again, one might argue whether a string type should report these as equal
or not, I believe Python is doing the right thing here. As the author
points out, any decent Unicode-aware language should at least offer the
ability to convert between normalisation forms, and Python passes this
test:

py> unicodedata.normalize("NFD", "noël") == "noe\u0308l"
True
py> "noël" == unicodedata.normalize("NFC", "noe\u0308l")
True


Out of the nine tests, Python 3.3 passes six, with three tests being
failures or dubious. If you believe that the native string type should
operate on code-points, then you'll think that Python does the right
thing. If you think it should operate on grapheme clusters, as the author
of the blog post does, then you'll think Python fails those three tests.


A call to arms
==============

As the Unicode Consortium itself acknowledges, sometimes you want to
operate on an array of code points, and sometimes on an array of
graphemes ("user-perceived characters"). Python 3.3 is now halfway there,
having excellent support for code-points across the entire Unicode
character set, not just the BMP.

The next step is to provide either a data type, or a library, for working
on grapheme clusters. The Unicode Consortium provides a detailed
discussion of this issue here:

http://www.unicode.org/reports/tr29/

If anyone is looking for a meaty project to work on, providing support
for grapheme clusters could be it. And if not, hopefully you've learned
something about Unicode and the limitations of Python's Unicode support.


--
Steven

Mark Lawrence

unread,
Nov 29, 2013, 8:07:29 PM11/29/13
to pytho...@python.org
On 30/11/2013 00:44, Steven D'Aprano wrote:
>
> (5) What is the length of "😸😾"?
>
> Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E
> (POUTING CAT FACE) are outside the Basic Multilingual Plane, which means
> they require more than two bytes each. Most programming languages using
> UTF-16 encodings internally (including Javascript and Java) fail this
> test. Python 3.3 passes:
>
> py> s = '😸😾'
> py> len(s)
> 2
>

I couldn't care less if it passes, it's too slow and uses too much
memory[1], so please get the completely bug ridden Python 2 unicode
implementation restored at the earliest possible opportunity :)

[1]because I say so although I don't actually have any evidence to
support my case. :) :)

--
Python is the second best programming language in the world.
But the best has yet to be invented. Christian Tismer

Mark Lawrence

Roy Smith

unread,
Nov 29, 2013, 9:08:49 PM11/29/13
to
In article <529934dc$0$29993$c3e8da3$5496...@news.astraweb.com>,
Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> (8) What's the uppercase of "baffle" spelled with an ffl ligature?
>
> Like most other languages, Python 3.2 fails:
>
> py> 'baffle'.upper()
> 'BAfflE'
>
> but Python 3.3 passes:
>
> py> 'baffle'.upper()
> 'BAFFLE'

I disagree.

The whole idea of ligatures like fi is purely typographic. The crossbar
on the "f" (at least in some fonts) runs into the dot on the "i".
Likewise, the top curl on an "f" run into the serif on top of the "l"
(and similarly for ffl).

There is no such thing as a "FFL" ligature, because the upper case
letterforms don't run into each other like the lower case ones do.
Thus, I would argue that it's wrong to say that calling upper() on an
ffl ligature should yield FFL.

I would certainly expect, x.lower() == x.upper().lower(), to be True for
all values of x over the set of valid unicode codepoints. Having
u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) ==
len(x.upper()) to be True.

Chris Angelico

unread,
Nov 29, 2013, 9:12:34 PM11/29/13
to pytho...@python.org
On Sat, Nov 30, 2013 at 1:08 PM, Roy Smith <r...@panix.com> wrote:
> I would certainly expect, x.lower() == x.upper().lower(), to be True for
> all values of x over the set of valid unicode codepoints. Having
> u"\uFB04".upper() ==> "FFL" breaks that. I would also expect len(x) ==
> len(x.upper()) to be True.

That's a nice theory, but the Unicode consortium disagrees with you on
both points.

ChrisA

Roy Smith

unread,
Nov 29, 2013, 9:28:47 PM11/29/13
to
In article <mailman.3417.1385777...@python.org>,
Harumph.

Dave Angel

unread,
Nov 29, 2013, 10:06:21 PM11/29/13
to pytho...@python.org
And they were already false long before Unicode. I don’t know
specifics but there are many cases where there are no uppercase
equivalents for a particular lowercase character. And others where
the uppercase equivalent takes multiple characters.

--
DaveA

Steven D'Aprano

unread,
Nov 29, 2013, 11:21:49 PM11/29/13
to
On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:

> In article <529934dc$0$29993$c3e8da3$5496...@news.astraweb.com>,
> Steven D'Aprano <steve+comp....@pearwood.info> wrote:
>
>> (8) What's the uppercase of "baffle" spelled with an ffl ligature?
>>
>> Like most other languages, Python 3.2 fails:
>>
>> py> 'baffle'.upper()
>> 'BAfflE'

You edited my text to remove the ligature? That's... unfortunate.



>> but Python 3.3 passes:
>>
>> py> 'baffle'.upper()
>> 'BAFFLE'
>
> I disagree.
>
> The whole idea of ligatures like fi is purely typographic.

In English, that's correct. I'm not sure if we can generalise that to all
languages that have ligatures. It also partly depends on how you define
ligatures. For example, would you consider that ampersand & to be a
ligature? These days, I would consider & to be a distinct character, but
originally it began as a ligature for "et" (Latin for "and").

But let's skip such corner cases, as they provide much heat but no
illumination, and I'll agree that when it comes to ligatures like fl, fi
and ffl, they are purely typographic.


> The crossbar
> on the "f" (at least in some fonts) runs into the dot on the "i".
> Likewise, the top curl on an "f" run into the serif on top of the "l"
> (and similarly for ffl).
>
> There is no such thing as a "FFL" ligature, because the upper case
> letterforms don't run into each other like the lower case ones do. Thus,
> I would argue that it's wrong to say that calling upper() on an ffl
> ligature should yield FFL.

Your conclusion doesn't follow from the argument you are making. Since
the ffl ligature ffl is purely a typographical feature, then it should
uppercase to FFL (there being no typographic feature for uppercase FFL
ligature).

Consider the examples shown above, where you or your software
unfortunately edited out the ligature and replaced it with ASCII "ffl".
Or perhaps I should say *fortunately*, since it demonstrates the problem.

Since we agree that the ffl ligature is merely a typographic artifact of
some type-designers whimsy, we can expect that the word "baffle" is
semantically exactly the same as the word "baffle". How foolish Python
would look if it did this:

py> 'baffle'.upper()
'BAfflE'


Replace the 'ffl' with the ligature, and the conclusion remains:

py> 'baffle'.upper()
'BAfflE'

would be equally wrong.

Now, I accept that this picture isn't entirely black and white. For
example, we might argue that if ffl is purely typographical in nature,
surely we would also want 'baffle' == 'baffle' too? Or maybe not. This
indicates that capturing *all* the rules for text across the many
languages, writing systems and conventions is impossible.

There are some circumstances where we would want 'baffle' and 'baffle' to
compare equal, and others where we would want them to compare the same.
Python gives us both:

py> "bapy> "baffle" == "baffle"
False
ffle" == unicodedata.normalize("NFKC", "baffle")
True


but frankly I'm baffled *wink* that you think there are any circumstances
where you would want the uppercase of ffl to be anything but FFL.


> I would certainly expect, x.lower() == x.upper().lower(), to be True for
> all values of x over the set of valid unicode codepoints.

You would expect wrongly. You are over-generalising from English, and if
you include ligatures and other special cases, not even all of English.

See, for example:

http://www.unicode.org/faq/casemap_charprop.html#7a

Apart from ligatures, some examples of troublesome characters with regard
to case are:

* German Eszett (sharp-S) ß can be uppercased to SS, SZ or ẞ depending
on context, particular when dealing with placenames and family names.

(That last character, LATIN CAPITAL LETTER SHARP S, goes back to at
least the 1930s, although the official rules of German orthography
still insist on uppercasing ß to SS.)

* The English long-s ſ is uppercased to regular S.

* Turkish dotted and dotless I (İ and i, I and ı) uses the same Latin
letters I and i but the case conversion rules are different.

* Both the Greek sigma σ and final sigma ς uppercase to Σ.


That last one is especially interesting: Python 3.3 gets it right, while
older Pythons do not. In Python 3.2:

py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύσ (Odysseus)'

while in 3.3 it roundtrips correctly:

py> 'Ὀδυσσεύς (Odysseus)'.upper().title()
'Ὀδυσσεύς (Odysseus)'


So... case conversions are not as simple as they appear at first glance.
They aren't always reversible, nor do they always roundtrip. Titlecase is
not necessarily the same as "uppercase the first letter and lowercase the
rest". Case conversions can be context or locale sensitive.

Anyway... even if you disagree with everything I have said, it is a fact
that Python has committed to following the Unicode standard, and the
Unicode standard requires that certain ligatures, including FFL, FL and
FI, are decomposed when converted to uppercase.



--
Steven

Roy Smith

unread,
Nov 29, 2013, 11:30:22 PM11/29/13
to
In article <529967dc$0$29993$c3e8da3$5496...@news.astraweb.com>,
Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> You edited my text to remove the ligature? That's... unfortunate.

It was un-ligated by the time it reached me.

Zero Piraeus

unread,
Nov 30, 2013, 12:05:59 AM11/30/13
to pytho...@python.org
:

On Sat, Nov 30, 2013 at 04:21:49AM +0000, Steven D'Aprano wrote:
> On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:
> > The whole idea of ligatures like fi is purely typographic.
>
> In English, that's correct. I'm not sure if we can generalise that to
> all languages that have ligatures. It also partly depends on how you
> define ligatures. For example, would you consider that ampersand & to
> be a ligature? These days, I would consider & to be a distinct
> character, but originally it began as a ligature for "et" (Latin for
> "and").
>
> But let's skip such corner cases, as they provide much heat but no
> illumination, [...]

In the interest of warmth (I know it's winter in some parts of the
world) ...

As I understand it, "&" has always been used to replace the word "et"
specifically, rather than the letter-pair e,t (no-one has ever written
"k&tle" other than ironically), which makes it a logogram rather than a
ligature (like "@").

(I happen to think the presence of ligatures in Unicode is insane, but
my dictator-of-the-world certificate appears to have gotten lost in the
post, so fixing that will have to wait).

-[]z.

--
Zero Piraeus: inter caetera
http://etiol.net/pubkey.asc

Gene Heskett

unread,
Nov 30, 2013, 12:25:48 AM11/30/13
to pytho...@python.org
On Saturday 30 November 2013 00:23:22 Zero Piraeus did opine:

> On Sat, Nov 30, 2013 at 04:21:49AM +0000, Steven D'Aprano wrote:
> > On Fri, 29 Nov 2013 21:08:49 -0500, Roy Smith wrote:
> > > The whole idea of ligatures like fi is purely typographic.
> >
> > In English, that's correct. I'm not sure if we can generalise that to
> > all languages that have ligatures. It also partly depends on how you
> > define ligatures. For example, would you consider that ampersand & to
> > be a ligature? These days, I would consider & to be a distinct
> > character, but originally it began as a ligature for "et" (Latin for
> > "and").
> >
> > But let's skip such corner cases, as they provide much heat but no
> > illumination, [...]
>
> In the interest of warmth (I know it's winter in some parts of the
> world) ...
>
> As I understand it, "&" has always been used to replace the word "et"
> specifically, rather than the letter-pair e,t (no-one has ever written
> "k&tle" other than ironically), which makes it a logogram rather than a
> ligature (like "@").

Whereas in these here parts, the "&" has always been read as a single
character shortcut for the word "and".
>
> (I happen to think the presence of ligatures in Unicode is insane, but
> my dictator-of-the-world certificate appears to have gotten lost in the
> post, so fixing that will have to wait).
>
> -[]z.


Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

"I remember when I was a kid I used to come home from Sunday School and
my mother would get drunk and try to make pancakes."
-- George Carlin
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.

Roy Smith

unread,
Nov 30, 2013, 12:37:17 AM11/30/13
to
In article <529967dc$0$29993$c3e8da3$5496...@news.astraweb.com>,
Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> > The whole idea of ligatures like fi is purely typographic.
>
> In English, that's correct. I'm not sure if we can generalise that to all
> languages that have ligatures. It also partly depends on how you define
> ligatures.

I was speaking specifically of "ligatures like fi" (or, if you prefer,
"ligatures like ό". By which I mean those things printers invented
because some letter combinations look funny when typeset as two distinct
letters.

There are other kinds of ligatures. For example, œ is a dipthong. It
makes sense (well, to me, anyway) that upper case œ is Έ.

Well, anyway, that's the truth according to me. Apparently the Unicode
Consortium disagrees. So, who am I to argue with the people who decided
that I needed to be able to type a "PILE OF POO" character. Which, by
the way, I can find in my "Character Viewer" input helper, but which MT
Newswatcher doesn't appear to be willing to insert into text. I guess
Basic Multilingual Poo would have been OK but Astral Poo is too much for
it.

Ian Kelly

unread,
Nov 30, 2013, 1:00:27 AM11/30/13
to Python
On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <r...@panix.com> wrote:
> I was speaking specifically of "ligatures like fi" (or, if you prefer,
> "ligatures like ό". By which I mean those things printers invented
> because some letter combinations look funny when typeset as two distinct
> letters.

I think the encoding of your email is incorrect, because GREEK SMALL
LETTER OMICRON WITH TONOS is not a ligature.

> There are other kinds of ligatures. For example, oe is a dipthong. It
> makes sense (well, to me, anyway) that upper case oe is Έ.

As above. I can't fathom why would it make sense for the upper case of
LATIN SMALL LIGATURE OE to be GREEK CAPITAL LETTER EPSILON WITH TONOS.

Steven D'Aprano

unread,
Nov 30, 2013, 1:25:49 AM11/30/13
to
On Sat, 30 Nov 2013 02:05:59 -0300, Zero Piraeus wrote:

> (I happen to think the presence of ligatures in Unicode is insane, but
> my dictator-of-the-world certificate appears to have gotten lost in the
> post, so fixing that will have to wait).

You're probably right, but we live in an insane world of dozens of insane
legacy encodings, and part of the mission of Unicode is to include every
single character that those legacy encodings did. Since some of them
included ligatures, so must Unicode. Sad but true.

(Unicode is intended as a replacement for the insanity of dozens of
multiply incompatible character sets. It cannot hope to replace them if
it cannot represent every distinct character they represent.)


--
Steven

Steven D'Aprano

unread,
Nov 30, 2013, 2:11:59 AM11/30/13
to
On Fri, 29 Nov 2013 23:00:27 -0700, Ian Kelly wrote:

> On Fri, Nov 29, 2013 at 10:37 PM, Roy Smith <r...@panix.com> wrote:
>> I was speaking specifically of "ligatures like fi" (or, if you prefer,
>> "ligatures like ό". By which I mean those things printers invented
>> because some letter combinations look funny when typeset as two
>> distinct letters.
>
> I think the encoding of your email is incorrect, because GREEK SMALL
> LETTER OMICRON WITH TONOS is not a ligature.

Roy's post, which is sent via Usenet not email, doesn't have an encoding
set. Since he's sending from a Mac, his software may believe that the
entire universe understands the Mac Roman encoding, which makes a certain
amount of sense since if I recall correctly the fi and fl ligatures
originally appeared in early Mac fonts.

I'm going to give Roy the benefit of the doubt and assume he actually
entered the fi ligature at his end. If his software was using Mac Roman,
it would insert a single byte DE into the message:

py> '\N{LATIN SMALL LIGATURE FI}'.encode('macroman')
b'\xde'


But that's not what his post includes. The message actually includes two
bytes CF8C, in other words:

'\N{LATIN SMALL LIGATURE FI}'.encode('who the hell knows')
=> b'\xCF\x8C'


Since nearly all of his post is in single bytes, it's some variable-width
encoding, but not UTF-8.

With no encoding set, our newsreader software starts off assuming that
the post uses UTF-8 ('cos that's the only sensible default), and those
two bytes happen to encode to ό GREEK SMALL LETTER OMICRON WITH TONOS.

I'm not surprised that Roy has a somewhat jaundiced view of Unicode, when
the tools he uses are apparently so broken. But it isn't Unicode's fault,
its the tools.

The really bizarre thing is that apparently Roy's software, MT-
NewsWatcher, knows enough Unicode to normalise ffl LATIN SMALL LIGATURE FFL
(sent in UTF-8 and therefore appearing as bytes b'\xef\xac\x84') to the
ASCII letters "ffl". That's astonishingly weird.

That is really a bizarre error. I suppose it is not entirely impossible
that the software is actually being clever rather than dumb. Having
correctly decoded the UTF-8 bytes, perhaps it realised that there was no
glyph for the ligature, and rather than display a MISSING CHAR glyph
(usually one of those empty boxes you sometimes see), it normalized it to
ASCII. But if it's that clever, why the hell doesn't it set an encoding
line in posts?????


--
Steven

Steven D'Aprano

unread,
Nov 30, 2013, 2:41:16 AM11/30/13
to
On Sat, 30 Nov 2013 00:37:17 -0500, Roy Smith wrote:

> So, who am I to argue with the people who decided that I needed to be
> able to type a "PILE OF POO" character.

Blame the Japanese for that. Apparently some of the biggest users of
Unicode are the various Japanese mobile phone manufacturers, TV stations,
map makers and similar. So there's a large number of symbols and emoji
(emoticons) specifically added for them, presumably because they pay big
dollars to the Unicode Consortium and therefore get a lot of influence in
what gets added.


--
Steven

Mark Lawrence

unread,
Nov 30, 2013, 3:07:38 AM11/30/13
to pytho...@python.org
http://bugs.python.org/issue19819 talks about these beasties. Please
don't come back to me as I haven't got a clue!!!

wxjm...@gmail.com

unread,
Nov 30, 2013, 2:11:13 PM11/30/13
to
Le samedi 30 novembre 2013 03:08:49 UTC+1, Roy Smith a écrit :
>
>
>
> The whole idea of ligatures like fi is purely typographic. The crossbar
>
> on the "f" (at least in some fonts) runs into the dot on the "i".
>
> Likewise, the top curl on an "f" run into the serif on top of the "l"
>
> (and similarly for ffl).
>


And do you know the origin of this typographical feature?
Because, mechanically, the dot of the "i" broke too often.

I cann't proof that's the truth, I read this many times in
the literature speaking about typography and about unicode.

In my opinion, a very plausible explanation.

jmf

Gregory Ewing

unread,
Nov 30, 2013, 5:37:30 PM11/30/13
to
wxjm...@gmail.com wrote:
> And do you know the origin of this typographical feature?
> Because, mechanically, the dot of the "i" broke too often.
>
> In my opinion, a very plausible explanation.

It doesn't sound very plausible to me, because there
are a lot more stand-alone 'i's in English text than
there are ones following an f. What is there to stop
them from breaking?

It's more likely to be simply a kerning issue. You
want to get the stems of the f and the i close together,
and the only practical way to do that with mechanical
type is to merge them into one piece of metal.

Which makes it even sillier to have an 'ffi' character
in this day and age, when you can simply space the
characters so that they overlap.

--
Greg

Gregory Ewing

unread,
Nov 30, 2013, 5:41:28 PM11/30/13
to
Steven D'Aprano wrote:
> On Sat, 30 Nov 2013 00:37:17 -0500, Roy Smith wrote:
>
>>So, who am I to argue with the people who decided that I needed to be
>>able to type a "PILE OF POO" character.
>
> Blame the Japanese for that. Apparently some of the biggest users of
> Unicode are the various Japanese mobile phone manufacturers, TV stations,
> map makers and similar.

Also there's apparently a pun in Japanese involving the
words for 'poo' and 'luck'. So putting a poo symbol in
your text message means 'good luck'. Given that, it's
not *quite* as silly as it seems.

--
Best of poo,
Greg

Ned Batchelder

unread,
Nov 30, 2013, 6:07:36 PM11/30/13
to pytho...@python.org
The fi ligature was created because visually, an f and i wouldn't work
well together: the crossbar of the f was near, but not connected to the
serif of the i, and the terminal bulb of the f was close to, but not
coincident, with the dot of the i.

This article goes into great detail, and has a good illustration of how
an f and i can clash, and how an fi ligature can fix the problem:
http://opentype.info/blog/2012/11/20/whats-a-ligature/ . Note the second
fi illustration, which demonstrates using a ligature to make the letters
appear *less* connected than they would individually!

This is also why "simply spacing the characters" isn't a solution: a
specially designed ligature looks better than a separate f and i, no
matter how minutely kerned.

It's unfortunate that Unicode includes presentation alternatives like
the fi (and ff, fl, ffi, and fl) ligatures. It was done to be a
superset of existing encodings.

Many typefaces have other non-encoded ligatures as well, especially
display faces, which also have alternate glyphs. Unicode is a funny mix
in that it includes some forms of alternates, but can't include all of
them, so we have to put up with both an ad-hoc Unicode that includes
presentational variants, and also some other way to specify variants
because Unicode can't include all of them.

--Ned.

Steven D'Aprano

unread,
Nov 30, 2013, 7:22:28 PM11/30/13
to
On Sun, 01 Dec 2013 11:37:30 +1300, Gregory Ewing wrote:

> Which makes it even sillier to have an 'ffi' character in this day and
> age, when you can simply space the characters so that they overlap.

It's in Unicode to support legacy character sets that included it[1].
There are a bunch of similar cases:

* LATIN CAPITAL LETTER A WITH RING ABOVE versus ANGSTROM SIGN
* KELVIN SIGN versus LATIN CAPITAL LETTER A
* DEGREE CELSIUS and DEGREE FAHRENHEIT
* the whole set of full-width and half-width forms

On the other hand, there are cases which to a naive reader might look
like needless duplication but actually aren't. For example, there are a
bunch of visually indistinguishable characters[2] in European languages,
like AΑА and BΒВ. The reason for this becomes more obvious[3] when you
lowercase them:

py> 'AΑА BΒВ'.lower()
'aαа bβв'

Sorting and case-conversion rules would become insanely complicated, and
context-sensitive, if Unicode only included a single code point per thing-
that-looks-the-same.

The rules for deciding what is and what isn't a distinct character can be
quite complex, and often politically charged. There's a lot of opposition
to Unicode in East Asian countries because it unifies Han ideograms that
look and behave the same in Chinese, Japanese and Korean. The reason they
do this is for the same reason that Unicode doesn't distinguish between
(say) English A, German A and French A. One reason some East Asians want
it to is for the same reason you or I might wish to flag a section of
text as English and another section of text as German, and have them
displayed in slightly different typefaces and spell-checked with a
different dictionary. The Unicode Consortium's answer to that is, this is
beyond the remit of the character set, and is best handled by markup or
higher-level formatting.

(Another reason for opposing Han unification is, let's be frank, pure
nationalism.)



[1] As far as I can tell, the only character supported by legacy
character sets which is not included in Unicode is the Apple logo from
Mac charsets.

[2] The actual glyphs depends on the typeface used.

[3] Again, modulo the typeface you're using to view them.



--
Steven

Tim Chase

unread,
Nov 30, 2013, 7:52:48 PM11/30/13
to pytho...@python.org
On 2013-12-01 00:22, Steven D'Aprano wrote:
> * KELVIN SIGN versus LATIN CAPITAL LETTER A

I should hope so ;-)

-tkc


Steven D'Aprano

unread,
Nov 30, 2013, 7:54:18 PM11/30/13
to
I blame my keyboard, where letters A and K are practically right next to
each other, only seven letters apart. An easy typo to make.



--
Stpvpn

Tim Chase

unread,
Nov 30, 2013, 8:05:01 PM11/30/13
to pytho...@python.org
I suppose I should have modified my attribution-quote to read "Steven
D'Kprano wrote" then :-)

-tkc



Chris Angelico

unread,
Nov 30, 2013, 8:13:54 PM11/30/13
to pytho...@python.org
“It’s an easy mistake to make” the PFY concurs “Many’s the time I’ve
picked up a cattle prod thinking it was a lint remover as I’ve helped
groom one of your predecessors before an important board meeting about
slashing the IT budget.”

http://www.theregister.co.uk/2010/11/26/bofh_2010_episode_18/

ChrisA

Roy Smith

unread,
Nov 30, 2013, 8:27:57 PM11/30/13
to
In article <mailman.3431.1385860...@python.org>,
What means "PFY"? The only thing I can think of is "Poor F---ing
Yankee" :-)

Chris Angelico

unread,
Nov 30, 2013, 8:31:42 PM11/30/13
to pytho...@python.org
On Sun, Dec 1, 2013 at 12:27 PM, Roy Smith <r...@panix.com> wrote:
>> http://www.theregister.co.uk/2010/11/26/bofh_2010_episode_18/
>>
>> ChrisA
>
> What means "PFY"? The only thing I can think of is "Poor F---ing
> Yankee" :-)

In the context of the BOFH, it stands for Pimply-Faced Youth and means
BOFH's assistant.

ChrisA

wxjm...@gmail.com

unread,
Dec 1, 2013, 11:57:44 AM12/1/13
to
I'm speaking about those times where the "characters" (some) were
not even built with metal, but with wood (see Garamond, Bodoni).

---------

Unicode is "only" collecting "characters" in the sense "abstract
entities". What is supposed to be a "character" is one problem.
How a tool is supposed to handle these "characters" is a problem
too, but a different one.

"Unicode" is not a coding scheme, it is a "repertoire".

Illustrative examples instead of explanations.

The ffl ligature is a "character" because it has always
existed.

The & and œ are considered today as unique "characters".
They were historically "ligaturated forms".

The Fahrenheit, Kelvin and Celsius are considered as
"characters", despite Fahrenheit, Kelvin are "letters".

Text justification. Calculating the space between "words"
in "rendering units" makes sense. Using a specific "character"
like a thin space to force a predefined space makes sense too.

The miscellaneous zeroes one may see, like uppercase O, O with
a dot in the center or a striked O are all the same zero, but
with stylistic variants, => a single "character" in the unicode
table.

... but this medieval "character" existing in two forms (I do not
remember which one) was finally registrated as two "characters",
and not as a stylistic variant of a single "character".

There are no "characters" for the symbols of the chemical elements,
a latin script is good enough.

The QPlainTextEdit widget from Qt does not know '\n'. It uses
only the paragraph separator and the line separator. To render
a paragraph separator, it uses one another "character", the
pilcrow.

The µ "character" in the iso-8859-1 coding scheme is a greek
letter, it must be used or percieved as a SI unit prefix.
Unicode category: Ll, unicode name: micro sign.

How to place an arrow (vector) on top of an ê, if one cann't
decompose it?

Related, there are dotless variants of i and j.

STIX fonts with the huge number of math symbols, not
yet in the unicode repertoire but present in the PUA.

etc.

Unicode is quite open. It's a good idea to keep that
openess to the developer. Shortly, if a coder decomposes
a "character" like "â" in a "a" plus a "^", it's up to
the developer to know what to do when reversing such a
string and to count this sequence as two real "characters".

jmf



Serhiy Storchaka

unread,
Dec 1, 2013, 1:00:21 PM12/1/13
to pytho...@python.org
30.11.13 02:44, Steven D'Aprano написав(ла):
> (2) If you reverse that string, does it give "lëon"? The implication of
> this question is that strings should operate on grapheme clusters rather
> than code points. Python fails this test:
>
> py> print("noe\u0308l"[::-1])
> leon

>>> print(unicodedata.normalize('NFC', "noe\u0308l")[::-1])
lëon

> (3) What are the first three characters? The author suggests that the
> answer should be "noë", in which case Python fails again:
>
> py> print("noe\u0308l"[:3])
> noe

>>> print(unicodedata.normalize('NFC', "noe\u0308l")[:3])
noë

> (4) Likewise, what is the length of the decomposed string? The author
> expects 4, but Python gives 5:
>
> py> len("noe\u0308l")
> 5

>>> print(len(unicodedata.normalize('NFC', "noe\u0308l")))
4


wxjm...@gmail.com

unread,
Dec 1, 2013, 3:15:57 PM12/1/13
to
0.11.13 02:44, Steven D'Aprano написав(ла):
> (2) If you reverse that string, does it give "lëon"? The implication of
> this question is that strings should operate on grapheme clusters rather
> than code points. ...
>

BTW, a grapheme cluster *is* a code points cluster.

jmf

Tim Delaney

unread,
Dec 1, 2013, 3:54:48 PM12/1/13
to Python-List
Anyone with a decent level of reading comprehension would have understood that Steven knows that. The implied word is "individual" i.e. "... rather than [individual] code points".

Why am I responding to a troll? Probably because out of all his baseless complaints about the FSR, he *did* have one valid point about performance that has now been fixed.

Tim Delaney

Mark Lawrence

unread,
Dec 1, 2013, 5:06:10 PM12/1/13
to pytho...@python.org
On 01/12/2013 20:54, Tim Delaney wrote:
> On 2 December 2013 07:15, <wxjm...@gmail.com
> Anyone with a decent level of reading comprehension would have
> understood that Steven knows that. The implied word is "individual" i.e.
> "... rather than [individual] code points".
>
> Why am I responding to a troll? Probably because out of all his baseless
> complaints about the FSR, he *did* have one valid point about
> performance that has now been fixed.
>
> Tim Delaney
>
>

I don't remember him ever having a valid point, so FTR can we have a
reference please. I do remember Steven D'Aprano showing that there was
a regression which I flagged up here http://bugs.python.org/issue16061.
It was fixed by Serhiy Storchaka, who appears to have forgotten more
about Python than I'll ever know, grrr!!! :)

Tim Delaney

unread,
Dec 1, 2013, 5:29:51 PM12/1/13
to Python-List
On 2 December 2013 09:06, Mark Lawrence <bream...@yahoo.co.uk> wrote:
I don't remember him ever having a valid point, so FTR can we have a reference please.  I do remember Steven D'Aprano showing that there was a regression which I flagged up here http://bugs.python.org/issue16061.  It was fixed by Serhiy Storchaka, who appears to have forgotten more about Python than I'll ever know, grrr!!! :)

From your own bug report (quoting Steven): "Nevertheless, I think there is something here. The consequences are nowhere near as dramatic as jmf claims ..."

His initial postings did lead to a regression being found.

Tim Delaney

Mark Lawrence

unread,
Dec 1, 2013, 6:10:33 PM12/1/13
to pytho...@python.org
On 01/12/2013 22:29, Tim Delaney wrote:
> On 2 December 2013 09:06, Mark Lawrence <bream...@yahoo.co.uk
> <mailto:bream...@yahoo.co.uk>> wrote:
>
> I don't remember him ever having a valid point, so FTR can we have a
> reference please. I do remember Steven D'Aprano showing that there
> was a regression which I flagged up here
> http://bugs.python.org/__issue16061
> <http://bugs.python.org/issue16061>. It was fixed by Serhiy
> Storchaka, who appears to have forgotten more about Python than I'll
> ever know, grrr!!! :)
>
>
> From your own bug report (quoting Steven): "Nevertheless, I think there
> is something here. The consequences are nowhere near as dramatic as jmf
> claims ..."
>
> His initial postings did lead to a regression being found.
>
> Tim Delaney
>
>

I'll begrudgungly concede that point, but must state that it was was an
edge case that is unlikely to have too much impact in the real world.
Unfortunately he's still making his ridiculous claims about the FSR,
hence my nickname of "Joseph McCarthy". I'll admit to liking that, it
just feels right to me, YMMV.

What also really riles me is that he uses double spaced google crap,
despite repeated requests from various people here for others to fix how
they use it, or get a decent email client.

Ethan Furman

unread,
Dec 1, 2013, 5:50:53 PM12/1/13
to pytho...@python.org
On 12/01/2013 02:06 PM, Mark Lawrence wrote:
>
> I don't remember him [jmf] ever having a valid point, so FTR can we have a reference please. I do remember Steven D'Aprano
> showing that there was a regression which I flagged up here http://bugs.python.org/issue16061. It was fixed by Serhiy
> Storchaka, who appears to have forgotten more about Python than I'll ever know, grrr!!! :)

The initial complaint came, unsurprisingly, from jmf. But don't worry much, even a stopped clock has a better track
record... it's at least right twice a day. ;)

--
~Ethan~

Mark Lawrence

unread,
Dec 1, 2013, 7:43:43 PM12/1/13
to pytho...@python.org
I had to chuckle, "initial complaint" indeed!!! He first started
complaining in August 2012 in this thread
https://mail.python.org/pipermail/python-list/2012-August/628650.html.
Then he continued in September 2012 in this thread
https://mail.python.org/pipermail/python-list/2012-September/631613.html, which
lead to issue 16061. He's been continuing to moan on and off ever
since, but funnily enough has *NEVER* produced a single shred of
evidence to back his claims. We'll have to wait until the cows come
home before he does.

Contrast that to the Victor Stinner statement here
http://bugs.python.org/issue16061#msg171413 "Python 3.3 is 2x faster
than Python 3.2 to replace a character with another if the string only
contains the character 3 times. This is not acceptable, Python 3.3 must
be as slow as Python 3.2!" Thinking about that I really do want the
Python 2 code back. Apart from the PEP 393 implementation being faster,
using less memory and being correct, it has nothing to offer. Now what
Python sketch does that remind me of? :)

wxjm...@gmail.com

unread,
Dec 2, 2013, 7:39:26 AM12/2/13
to
My English is far too be perfect, I think I understood
it correctly.

The point in not in the words "grapheme" or "code point",
neither in "individual", ;-), the point is in "rather".

If one wishes to work on a set of graphemes, one can
only work with the set of the corresponding code points.


To complete Serhiy Storchaka's example:

>>> len(unicodedata.normalize('NFKD', '\ufdfa')) == 18
True

is correct.

jmf

PS I did not even speak about the FSR.

Mark Lawrence

unread,
Dec 2, 2013, 9:46:28 AM12/2/13
to pytho...@python.org
On 02/12/2013 12:39, wxjm...@gmail.com wrote:
>
> My English is far too be perfect, I think I understood
> it correctly.
>
> PS I did not even speak about the FSR.
>

1) Your English is far from perfect as you clearly do not understand the
repeated requests *NOT* to send us double spaced crap via google groups.

2) You can't speak about the FSR as you know precisely nothing about it,
but as they say, ignorance is bliss.

Ned Batchelder

unread,
Dec 2, 2013, 10:22:43 AM12/2/13
to pytho...@python.org
On 12/2/13 9:46 AM, Mark Lawrence wrote:
> On 02/12/2013 12:39, wxjm...@gmail.com wrote:
>>
>> My English is far too be perfect, I think I understood
>> it correctly.
>>
>> PS I did not even speak about the FSR.
>>
>
> 1) Your English is far from perfect as you clearly do not understand the
> repeated requests *NOT* to send us double spaced crap via google groups.
>
> 2) You can't speak about the FSR as you know precisely nothing about it,
> but as they say, ignorance is bliss.
>

As annoying as baseless claims against the FSR were, wxjmafauth is
right: he didn't even mention the FSR in this thread. There's really no
point dragging this thread into that territory.

--Ned.

Mark Lawrence

unread,
Dec 2, 2013, 10:45:18 AM12/2/13
to pytho...@python.org
On 02/12/2013 15:22, Ned Batchelder wrote:
> On 12/2/13 9:46 AM, Mark Lawrence wrote:
>> On 02/12/2013 12:39, wxjm...@gmail.com wrote:
>>>
>>> My English is far too be perfect, I think I understood
>>> it correctly.
>>>
>>> PS I did not even speak about the FSR.
>>>
>>
>> 1) Your English is far from perfect as you clearly do not understand the
>> repeated requests *NOT* to send us double spaced crap via google groups.
>>
>> 2) You can't speak about the FSR as you know precisely nothing about it,
>> but as they say, ignorance is bliss.
>>
>
> As annoying as baseless claims against the FSR were, wxjmafauth is
> right: he didn't even mention the FSR in this thread. There's really no
> point dragging this thread into that territory.
>
> --Ned.
>

He's quite deliberately dragged it up by using p.s. Without doubt he's
the worst loser in the world and I'm *NOT* stopping getting at him. I
find his behaviour, continuously and groundlessly insulting the Python
core developers, quite disgusting.

Chris Angelico

unread,
Dec 2, 2013, 10:49:34 AM12/2/13
to pytho...@python.org
On Tue, Dec 3, 2013 at 2:45 AM, Mark Lawrence <bream...@yahoo.co.uk> wrote:
> He's quite deliberately dragged it up by using p.s. Without doubt he's the
> worst loser in the world and I'm *NOT* stopping getting at him. I find his
> behaviour, continuously and groundlessly insulting the Python core
> developers, quite disgusting.

What he does is make very sure that the awesomeness of Python 3.3+ is
constantly being brought up on python-list. New users of Python who
come here will, within a fairly short time, learn that Python actually
gets Unicode right, unlike most languages out there, and that it's
efficient and high performance.

ChrisA

Ned Batchelder

unread,
Dec 2, 2013, 10:58:26 AM12/2/13
to pytho...@python.org
On 12/2/13 10:45 AM, Mark Lawrence wrote:
> On 02/12/2013 15:22, Ned Batchelder wrote:
>> On 12/2/13 9:46 AM, Mark Lawrence wrote:
>>> On 02/12/2013 12:39, wxjm...@gmail.com wrote:
>>>>
>>>> My English is far too be perfect, I think I understood
>>>> it correctly.
>>>>
>>>> PS I did not even speak about the FSR.
>>>>
>>>
>>> 1) Your English is far from perfect as you clearly do not understand the
>>> repeated requests *NOT* to send us double spaced crap via google groups.
>>>
>>> 2) You can't speak about the FSR as you know precisely nothing about it,
>>> but as they say, ignorance is bliss.
>>>
>>
>> As annoying as baseless claims against the FSR were, wxjmafauth is
>> right: he didn't even mention the FSR in this thread. There's really no
>> point dragging this thread into that territory.
>>
>> --Ned.
>>
>
> He's quite deliberately dragged it up by using p.s. Without doubt he's
> the worst loser in the world and I'm *NOT* stopping getting at him. I
> find his behaviour, continuously and groundlessly insulting the Python
> core developers, quite disgusting.
>

His PS is in reference to you, Ethan, and Tim reminiscing about his past
complaints against the FSR. He made three posts to this thread before
you started in on him, and none of them mentioned the FSR. Tim first
mentioned it.

There's no need to call him "the worst loser in the world." Nothing
good will come from that kind of attack. It doesn't make this community
better, and it will not change his behavior.

He said nothing in this thread that insulted the Python core developers.
His posts in this thread are not about the FSR, and yet you dragged the
old fights into it. You are being the troll here.

--Ned.

Terry Reedy

unread,
Dec 2, 2013, 3:26:09 PM12/2/13
to pytho...@python.org
On 12/2/2013 10:45 AM, Mark Lawrence wrote:

> the worst loser in the world

Mark, I consider your continual direct personal attacks on other posters
to be a violation of the PSF Code of Conduct, which *does* apply to
python-list. Please stop.

--
Terry Jan Reedy, one of multiple list moderators

Mark Lawrence

unread,
Dec 2, 2013, 3:45:10 PM12/2/13
to pytho...@python.org
On 02/12/2013 20:26, Terry Reedy wrote:
> On 12/2/2013 10:45 AM, Mark Lawrence wrote:
>
>> the worst loser in the world
>
> Mark, I consider your continual direct personal attacks on other posters
> to be a violation of the PSF Code of Conduct, which *does* apply to
> python-list. Please stop.
>

The attacks that "Joseph McCarthy" has been launching on the core
developers for the last 15 months are in my view now perfectly
acceptable. This is excellent news. Everybody can now say what they
like about the core developers and there's no comeback.

You can also stuff the code of conduct, it's quite clearly only brought
into play when it suits. Never, ever aim it at somebody who goes out of
their way to stir things up, always target it at the people who fight
back *IS THE RULE HERE*.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

Ethan Furman

unread,
Dec 2, 2013, 3:38:09 PM12/2/13
to pytho...@python.org
On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>
> Out of the nine tests, Python 3.3 passes six, with three tests being
> failures or dubious. If you believe that the native string type should
> operate on code-points, then you'll think that Python does the right
> thing.

I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first.

Thanks for this excellent post.

--
~Ethan~

Ned Batchelder

unread,
Dec 2, 2013, 4:14:13 PM12/2/13
to pytho...@python.org
This is where my knowledge about Unicode gets fuzzy. Isn't it the case
that some grapheme clusters (or whatever the right word is) can't be
normalized down to a single code point? Characters can accept many
accents, for example. In that case, you can't always normalize and use
the existing string methods, but would need more specialized code.

--Ned.

Chris Angelico

unread,
Dec 2, 2013, 4:23:02 PM12/2/13
to pytho...@python.org
On Tue, Dec 3, 2013 at 8:14 AM, Ned Batchelder <n...@nedbatchelder.com> wrote:
> This is where my knowledge about Unicode gets fuzzy. Isn't it the case that
> some grapheme clusters (or whatever the right word is) can't be normalized
> down to a single code point? Characters can accept many accents, for
> example.

You can't normalize everything down to a single code point, but you
can normalize the other way by breaking out everything that can be
broken out.

>>> print(ascii(unicodedata.normalize("NFKC", "ä")))
'\xe4'
>>> print(ascii(unicodedata.normalize("NFKD", "ä")))
'a\u0308'

ChrisA

MRAB

unread,
Dec 2, 2013, 4:27:23 PM12/2/13
to pytho...@python.org
On 02/12/2013 21:14, Ned Batchelder wrote:
> On 12/2/13 3:38 PM, Ethan Furman wrote:
>> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>>
>>> Out of the nine tests, Python 3.3 passes six, with three tests being
>>> failures or dubious. If you believe that the native string type should
>>> operate on code-points, then you'll think that Python does the right
>>> thing.
>>
>> I think Python is doing it correctly. If I want to operate on
>> "clusters" I'll normalize the string first.
>>
>> Thanks for this excellent post.
>>
>> --
>> ~Ethan~
>
> This is where my knowledge about Unicode gets fuzzy. Isn't it the case
> that some grapheme clusters (or whatever the right word is) can't be
> normalized down to a single code point? Characters can accept many
> accents, for example. In that case, you can't always normalize and use
> the existing string methods, but would need more specialized code.
>
A better way of saying it is that there are codepoints for some grapheme
clusters. Those 'precomposed' codepoints exist because some legacy
character sets contained them, and having a one-to-one mapping
encouraged Unicode's adoption.

Ned Batchelder

unread,
Dec 2, 2013, 4:44:03 PM12/2/13
to pytho...@python.org
On 12/2/13 3:45 PM, Mark Lawrence wrote:
> On 02/12/2013 20:26, Terry Reedy wrote:
>> On 12/2/2013 10:45 AM, Mark Lawrence wrote:
>>
>>> the worst loser in the world
>>
>> Mark, I consider your continual direct personal attacks on other posters
>> to be a violation of the PSF Code of Conduct, which *does* apply to
>> python-list. Please stop.
>>
>
> The attacks that "Joseph McCarthy" has been launching on the core
> developers for the last 15 months are in my view now perfectly
> acceptable. This is excellent news. Everybody can now say what they
> like about the core developers and there's no comeback.
>
> You can also stuff the code of conduct, it's quite clearly only brought
> into play when it suits. Never, ever aim it at somebody who goes out of
> their way to stir things up, always target it at the people who fight
> back *IS THE RULE HERE*.
>

The point is that in this thread, no one was making attacks on core
developers. You were bringing up old animosity here for no reason at
all, and making them personal attacks to boot.

I don't see how you think wxjmfauth was "going out of his way to stir
things up" in *this* thread. He made three comments, none of which
mentioned the FSR or any other controversial topic. Can't we respond to
the content of posts, and not to past offenses by the poster?

Additionally, wxjmfauth's past complaints about the flexible string
representation were not personal. He didn't say, "Joe Smith is the
worst loser in the world for writing the FSR". He complained about a
feature of CPython, baselessly, but he never attacked the people doing
the work. His continued complaints were aggravating, I agree. I don't
know that they rose to the level of "disrespectful".

I know that your behavior here is disrespectful.

As to when the code of conduct is brought up, it's only fairly recently
that it has been mentioned in this forum. There have clearly been posts
in recent memory (the last year) which could have been examined in light
of the code of conduct, and were not. I think we are using it more
uniformly now. You helped me realize better how to apply it to this
forum, and I thank you for that. I welcome your help in applying it
better still. But it applies to you as well and I don't think it's too
much to ask that you abide by it.

The way to improve this list is to respectfully point to and demonstrate
community norms and ask people to conform to them. Spewing vitriol
isn't going to fix anything.

--Ned.


Ethan Furman

unread,
Dec 2, 2013, 4:27:08 PM12/2/13
to pytho...@python.org
On 12/02/2013 01:23 PM, Chris Angelico wrote:
> On Tue, Dec 3, 2013 at 8:14 AM, Ned Batchelder <n...@nedbatchelder.com> wrote:
>> This is where my knowledge about Unicode gets fuzzy. Isn't it the case that
>> some grapheme clusters (or whatever the right word is) can't be normalized
>> down to a single code point? Characters can accept many accents, for
>> example.
>
> You can't normalize everything down to a single code point, but you
> can normalize the other way by breaking out everything that can be
> broken out.
>
>>>> print(ascii(unicodedata.normalize("NFKC", "ä")))
> '\xe4'
>>>> print(ascii(unicodedata.normalize("NFKD", "ä")))
> 'a\u0308'

Well, Stephen was right then! There's room for a library to handle this situation. Or is there one already?

--
~Ethan~

Ethan Furman

unread,
Dec 2, 2013, 4:25:55 PM12/2/13
to pytho...@python.org
On 12/02/2013 12:45 PM, Mark Lawrence wrote:
> On 02/12/2013 20:26, Terry Reedy wrote:
>> On 12/2/2013 10:45 AM, Mark Lawrence wrote:
>>
>>> the worst loser in the world
>>
>> Mark, I consider your continual direct personal attacks on other posters
>> to be a violation of the PSF Code of Conduct, which *does* apply to
>> python-list. Please stop.
>
> The attacks that "Joseph McCarthy" has been launching on the core developers for the last 15 months are in my view now
> perfectly acceptable. This is excellent news. Everybody can now say what they like about the core developers and
> there's no comeback.
>
> You can also stuff the code of conduct, it's quite clearly only brought into play when it suits. Never, ever aim it at
> somebody who goes out of their way to stir things up, always target it at the people who fight back *IS THE RULE HERE*.

Mark, I sympathize with your feelings. jmf is certainly a troll, and it doesn't feel like anything has been, or is
being, done about that situation (or for that matter, the help vampire situation... although I haven't seen any threads
from that one lately -- did he give up, or has he been moderated away?). However, I would suggest that when you are
venting, you write the email and then just delete it. I personally don't mind the light and humorous posts, but when
the name-calling starts it makes the list an unfriendly place to be. And, to be clear, the coddling of trolls and
help-vampires also makes the list an unfriendly place to be.

Terry, would it be appropriate to share some of what the moderators do do for us on this list and the others? And what
does the Code of Conduct have to say about trolls and help-vampires?

--
~Ethan~

Mark Lawrence

unread,
Dec 2, 2013, 5:04:34 PM12/2/13
to pytho...@python.org
I deleted the first really spiteful reply, but the hypocrisy that
continues to be shown gets right up both of my nostrils, hence I
couldn't resist the above, greatly toned down response. This will
surely give an indication of how strongly I feel on issues such as this.
Rules are rules to be applied evenly, not on a pick and choose basis.

Ned Batchelder

unread,
Dec 2, 2013, 5:23:35 PM12/2/13
to pytho...@python.org
We have pointed help-vampires at the Code of Conduct:
https://mail.python.org/pipermail/python-list/2013-November/660343.html

He's also banned from the mailing list, which reduces the number of
people who see his questions, and helps keep threads from exploding. For
example, this message to the newsgroup
https://groups.google.com/d/msg/comp.lang.python/fdhF_Fr4fX0/9B0iK8jGigkJ (sorry
for the groups link, didn't know how else to link to a post) doesn't
appear at all in the mailing list, and therefore, in gmane.

But the mailing list ban isn't why you aren't seeing posts from him: he
hasn't posted again since that linked message, on Nov 21.

I think he's not posting in part because we adopted a uniform stance of
politely refusing to answer his questions, or even completely ignoring
his questions.

Of course, he could be back at any time. I hope we'll continue to
present a calm unified front.

--Ned.

Ned Batchelder

unread,
Dec 2, 2013, 5:24:37 PM12/2/13
to pytho...@python.org
On 12/2/13 4:44 PM, Ned Batchelder wrote:
> On 12/2/13 3:45 PM, Mark Lawrence wrote:
>> On 02/12/2013 20:26, Terry Reedy wrote:
>>> On 12/2/2013 10:45 AM, Mark Lawrence wrote:
>>>
>>>> the worst loser in the world
>>>
>>> Mark, I consider your continual direct personal attacks on other posters
>>> to be a violation of the PSF Code of Conduct, which *does* apply to
>>> python-list. Please stop.
>>>
>>
>> The attacks that "Joseph McCarthy" has been launching on the core
>> developers for the last 15 months are in my view now perfectly
>> acceptable. This is excellent news. Everybody can now say what they
>> like about the core developers and there's no comeback.
>>
>> You can also stuff the code of conduct, it's quite clearly only brought
>> into play when it suits. Never, ever aim it at somebody who goes out of
>> their way to stir things up, always target it at the people who fight
>> back *IS THE RULE HERE*.
>>
>
BTW: I think Mark has kill-filed me, so if anyone agrees enough with me
here to want Mark to see it, someone else will have to respond before he
gets the text.

--Ned.

Mark Lawrence

unread,
Dec 2, 2013, 5:32:09 PM12/2/13
to pytho...@python.org
I've kill-filed you on my personnal email address which I asked you
specifically *NOT* to message me on. You completely ignored that
request. FTR you're only the second person I've ever done that to, the
other being a pot smoking hippy who thankfully hasn't been seen for years.

Ned Batchelder

unread,
Dec 2, 2013, 5:53:28 PM12/2/13
to pytho...@python.org
Yes, I've apologized for that faux pas. I hope that you can forgive me.
Someday I hope to understand why it angered you so much. Good to hear
that we can communicate here.

--Ned.

Ben Finney

unread,
Dec 2, 2013, 5:56:57 PM12/2/13
to pytho...@python.org
Ned Batchelder <n...@nedbatchelder.com> writes:

> This is where my knowledge about Unicode gets fuzzy. Isn't it the
> case that some grapheme clusters (or whatever the right word is) can't
> be normalized down to a single code point? Characters can accept many
> accents, for example.

That's true, but doesn't affect the point being made: that one can have
both “sequence of Unicode code points” in Python's ‘unicode’ (now ‘str’)
type, and also deal with “sequence of text the reader will see”.

> In that case, you can't always normalize and use the existing string
> methods, but would need more specialized code.

Specialised code may not be needed. It will at least be true that “any
two code-point sequences which normalise to the same value will be
visually the same for the reader”, which is an important assertion for
addressing the complaints from Mortoray's article.

--
\ “Pray, v. To ask that the laws of the universe be annulled in |
`\ behalf of a single petitioner confessedly unworthy.” —Ambrose |
_o__) Bierce, _The Devil's Dictionary_, 1906 |
Ben Finney

Ben Finney

unread,
Dec 2, 2013, 6:11:41 PM12/2/13
to pytho...@python.org
Mark Lawrence <bream...@yahoo.co.uk> writes:

> […] the hypocrisy that continues to be shown gets right up both of my
> nostrils, hence I couldn't resist the above, greatly toned down
> response. This will surely give an indication of how strongly I feel
> on issues such as this. Rules are rules to be applied evenly, not on a
> pick and choose basis.

This forum doesn't have authorised moderators, and we don't have a body
of state employees charged with meting out justice evenly to all
parties. If you perceive uneven application of our code of conduct, that
will go a long way to explaining it.

What we do have is a community of volunteers whom we expect to both
uphold the code of conduct and self-apply it to the extent feasible.

This works only if we acknowledge both that we are human and will be
inconsistent and make errors, and conversely that what we *intend* to do
matters less than the actual and potential effects of our actions.

Anyone who feels compelled to be vitriolic here needs to find a way to
stop it, regardless how they perceive the treatment of others. We all
need each other's efforts to keep this community healthy.

--
\ “I don't know half of you half as well as I should like, and I |
`\ like less than half of you half as well as you deserve.” —Bilbo |
_o__) Baggins |
Ben Finney

Ethan Furman

unread,
Dec 2, 2013, 5:41:14 PM12/2/13
to pytho...@python.org
On 12/02/2013 02:32 PM, Mark Lawrence wrote:
>
> ... the other being a pot smoking hippy who ...

Please trim your posts. You comment a lot on people sending double-spaced google posts -- not trimming is nearly as bad.

The above is a good example of unnecessary name calling.

I value your good posts. Please keep a light-hearted and respectful tone. When light-hearted doesn't cut it, you can
still be respectful (of the other readers, even if the offender doesn't deserve it).

--
~Ethan~

Roy Smith

unread,
Dec 2, 2013, 8:38:35 PM12/2/13
to
In article <mailman.3485.1386021...@python.org>,
Mark Lawrence <bream...@yahoo.co.uk> wrote:

> My fellow Pythonistas, ask not what our language can do for you, ask
> what you can do for our language.

"I believe that Pythonistas should commit themselves to achieving the
goal, before this decade is out, of making Python 3 the default version
and having everybody be cool with unicode."

Ethan Furman

unread,
Dec 2, 2013, 8:56:14 PM12/2/13
to pytho...@python.org
On 12/02/2013 05:38 PM, Roy Smith wrote:
> Mark Lawrence wrote:
>>
>> My fellow Pythonistas, ask not what our language can do for you, ask
>> what you can do for our language.
>
> "I believe that Pythonistas should commit themselves to achieving the
> goal, before this decade is out, of making Python 3 the default version
> and having everybody be cool with unicode."

Hear, Hear!

+1000! :D

--
~Ethan~

Terry Reedy

unread,
Dec 2, 2013, 10:22:21 PM12/2/13
to pytho...@python.org
On 12/2/2013 4:25 PM, Ethan Furman wrote:
> jmf is certainly a troll

No, he is a person who discovered a minor performance regression in the
FSR, which we fixed. Unfortunately, he then continued for a year with a
strange troll-like anti-FSR crusade. But his posts in the Unicode
handling thread were not part of that. It seems to me that continually
beating someone over the head with the past discourages changed
behavior. To me, the point of asking someone to 'stop' is to persuade
them to stop. The reward for stopping should be to let the issue go.

> haven't seen any threads from that one lately -- did he give up, or has
> he been moderated away?).

Action was taken, including changing the usenet (clr) to mailing-list
gateway. (I already mentioned this twice here.) The was done by one of
the mailman infrastructure people at the request of the list
owner/moderators. The people who stuck their necks out to privately
contact the person in question displeased him and got privately
mail-bombed with repeated insults. I guess he subsequently gave up.

> the coddling of trolls and help-vampires also makes the list an
> unfriendly place to be.

I agree with the that as a statement, but not the implication. Was I
hallucinating, or did you not recently participate in the discussion and
decision to stop coddling our most obnoxious 'troll' in the community?

> Terry, would it be appropriate to share some of what the moderators do
> do for us on this list and the others?

Python-list moderators discard perhaps one spam post a day. You already
noticed a recent major benefit.

> And what does the Code of
> Conduct have to say about trolls and help-vampires?

I need to re-read it to really answer that adequately. The term and
defined concept 'help-vampire' is new to me (as of a month ago) and
probably to the CoC writers. However, the behavior strikes me as
disrespectful of the community, and that *is* generically covered.

--
Terry Jan Reedy

Terry Reedy

unread,
Dec 2, 2013, 10:39:53 PM12/2/13
to pytho...@python.org
On 12/2/2013 6:11 PM, Ben Finney wrote:

> This forum doesn't have authorised moderators,

At least some PSF mailing lists have 1 or more PSF-authorized moderators
(currently 4 for python-list) who pretty thanklessly check the initial
posts of new subscribers and posts flagged by the spam detector as
possible spam, or with other problems. We do not have 'every-post'
moderation.

> If you perceive uneven application of our code of conduct,

As far as I know, there has been just one non-spam application of CoC
to python-list: Nikos. I do not see how anyone could call that uneven or
unfair.

--
Terry Jan Reedy

Grant Edwards

unread,
Dec 2, 2013, 11:32:13 PM12/2/13
to
On 2013-12-03, Roy Smith <r...@panix.com> wrote:

> "I believe that Pythonistas should commit themselves to achieving the
> goal, before this decade is out, of making Python 3 the default version
> and having everybody be cool with unicode."

I'm cool with Unicode as long as it "just works" without me ever
having to understand it and I can interact effortlessly with plain old
ASCII files. Evertime I start to read anything about Unicode with any
technical detail at all, I start to get dizzy and bleed from the ears.

--
Grant

Ethan Furman

unread,
Dec 2, 2013, 11:11:47 PM12/2/13
to pytho...@python.org
On 12/02/2013 07:22 PM, Terry Reedy wrote:
> On 12/2/2013 4:25 PM, Ethan Furman wrote:
>> jmf is certainly a troll
>
> No, he is a person who discovered a minor performance regression in the FSR, which we fixed. Unfortunately, he then
> continued for a year with a strange troll-like anti-FSR crusade. But his posts in the Unicode handling thread were not
> part of that. It seems to me that continually beating someone over the head with the past discourages changed behavior.
> To me, the point of asking someone to 'stop' is to persuade them to stop. The reward for stopping should be to let the
> issue go.

I remember it slightly differently, but you're right -- we should let it drop.


>> the coddling of trolls and help-vampires also makes the list an
>> unfriendly place to be.
>
> I agree with the that as a statement, but not the implication. Was I hallucinating, or did you not recently participate
> in the discussion and decision to stop coddling our most obnoxious 'troll' in the community?

I'm afraid I don't see the point you are trying to make. I'm against coddling those who refuse to learn and participate
with respect to the rest of us, and I did vote to stop such coddling [1] of a certain troll. I don't see the discrepancy.

All that aside, thank you to you and the other moderators for your time and efforts.

--
~Ethan~

[1] Coddling can be an offensive word, and I wish to make clear that initial efforts to educate and help newcomers are
appropriate and warranted. However, after some time has passed and the newcomer is no longer a newcomer and is still
exhibiting rude and ignorant behavior, further attempts to help most likely won't, and that is when I would classify
such attempts as coddling.

--
~Ethan~

Steven D'Aprano

unread,
Dec 3, 2013, 12:06:26 AM12/3/13
to
That is correct.

If Unicode had a distinct code point for every possible combination of
base-character plus an arbitrary number of diacritics or accents, the
0x10FFFF code points wouldn't be anywhere near enough.

I see over 300 diacritics used just in the first 5000 code points. Let's
pretend that's only 100, and that you can use up to a maximum of 5 at a
time. That gives 79375496 combinations per base character, much larger
than the total number of Unicode code points in total.

If anyone wishes to check my logic:

# count distinct combining chars
import unicodedata
s = ''.join(chr(i) for i in range(33, 5000))
s = unicodedata.normalize('NFD', s)
t = [c for c in s if unicodedata.combining(c)]
len(set(t))

# calculate the number of combinations
def comb(r, n):
"""Combinations nCr"""
p = 1
for i in range(r+1, n+1):
p *= i
for i in range(1, n-r+1):
p /= i
return p

sum(comb(i, 100) for i in range(6))


I'm not suggesting that all of those accents are necessarily in use in
the real world, but there are languages which construct arbitrary
combinations of accents. (Or so I have been lead to believe.)


--
Steven

Steven D'Aprano

unread,
Dec 3, 2013, 12:41:07 AM12/3/13
to
On Tue, 03 Dec 2013 04:32:13 +0000, Grant Edwards wrote:

> On 2013-12-03, Roy Smith <r...@panix.com> wrote:
>
>> "I believe that Pythonistas should commit themselves to achieving the
>> goal, before this decade is out, of making Python 3 the default version
>> and having everybody be cool with unicode."
>
> I'm cool with Unicode as long as it "just works" without me ever having
> to understand it

That will never happen. Unicode is a bit like floating point maths:
there's always *some* odd corner case that will lead to annoyance and
confusion and even murder:

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail

And then there are legacy encodings. There are three things in life that
are inevitable: death, taxes, and text with the wrong encoding. Anyone
dealing with text they didn't generate themselves is going to have to
deal with mojibake at some point.

Having said that, if you control the text and always use UTF-8 for
storage and transmission, Unicode isn't that hard. Decode bytes to
Unicode as early as possible, do all your work in text rather than bytes,
then encode back to bytes as late as possible, and you'll be fine.


> and I can interact effortlessly with plain old ASCII files.

That at least is easy, provided you can guarantee that what you think if
plain ol' ASCII actually is plain ol' ASCII, which isn't as easy as you
might think given that an awful lot of people think that "extended ASCII"
is a thing and that you ought to be able to deal with it just like ASCII.


> Evertime I start to read anything about Unicode with any
> technical detail at all, I start to get dizzy and bleed from the ears.

Heh, the standard certainly covers a lot of ground.


--
Steven

joe

unread,
Dec 3, 2013, 2:35:29 AM12/3/13
to Steven D'Aprano, pytho...@python.org

How would a grapheme library work? Basic cluster combination, or would implementing other algorithms (line break, normalizing to a "canonical" form) be necessary?

How do people use grapheme clusters in non-rendering situations? Or here's perhaps here's a better question: does anyone know any non-latin (Japanese and Arabic come to mind)  speakers who use python to process text in their own language? Who could perhaps tell us what most bugs them about python's current api and which standard libraries need work.

Mark Lawrence

unread,
Dec 3, 2013, 7:14:52 AM12/3/13
to pytho...@python.org
I'm pleased to see that I'm not the only one who suffers in this way :)

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

Mark Lawrence

unread,
Dec 3, 2013, 7:11:50 AM12/3/13
to pytho...@python.org
I like that, thank you.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

Neil Cerutti

unread,
Dec 3, 2013, 8:47:42 AM12/3/13
to pytho...@python.org
On 2013-12-02, Ethan Furman <et...@stoneleaf.us> wrote:
> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>> Out of the nine tests, Python 3.3 passes six, with three tests
>> being failures or dubious. If you believe that the native
>> string type should operate on code-points, then you'll think
>> that Python does the right thing.
>
> I think Python is doing it correctly. If I want to operate on
> "clusters" I'll normalize the string first.

Normalizing doesn't resolve the issues the blog brings up; NFC
can't condense every multi-code-point sequence into one, and
normalizing can lose or mangle information. There are good
examples here: http://unicode.org/reports/tr15/

> Thanks for this excellent post.

Agreed.

--
Neil Cerutti

Ethan Furman

unread,
Dec 3, 2013, 9:26:45 AM12/3/13
to pytho...@python.org
On 12/02/2013 12:38 PM, Ethan Furman wrote:
> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>
>> Out of the nine tests, Python 3.3 passes six, with three tests being
>> failures or dubious. If you believe that the native string type should
>> operate on code-points, then you'll think that Python does the right
>> thing.
>
> I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first.

Hrmm, well, after being educated ;) I think I may have to reverse my position. Given that not every cluster can be
normalized to a single code point perhaps Python is doing it the best possible way. On the other hand, we have a
uni*code* type, not a uni*char* type. Maybe 3.5 can have that. ;)

At any rate, definitely good to be aware of the issue.

--
~Ethan~

wxjm...@gmail.com

unread,
Dec 3, 2013, 1:34:59 PM12/3/13
to
from one of my libs, bmp only

>>> import fourbiunicode5
>>> print(len(fourbiunicode5.AllCombiningMarks))
240


jmf

wxjm...@gmail.com

unread,
Dec 4, 2013, 8:52:34 AM12/4/13
to
------


Yon intuitively pointed a very important feature
of "unicode". However, it is not necessary, this is
exactly what unicode does (when used properly).

jmf

Mark Lawrence

unread,
Dec 4, 2013, 9:07:34 AM12/4/13
to pytho...@python.org
On 04/12/2013 13:52, wxjm...@gmail.com wrote:

[snip all the double spaced stuff]

>
> Yon intuitively pointed a very important feature
> of "unicode". However, it is not necessary, this is
> exactly what unicode does (when used properly).
>
> jmf
>

Presumably using unicode correctly prevents messages being sent across
the ether with superfluous, extremely irritating double spacing? Or is
that down to poor tools in combination with the ignorance of their users?

Neil Cerutti

unread,
Dec 4, 2013, 9:38:40 AM12/4/13
to pytho...@python.org
On 2013-12-04, wxjm...@gmail.com <wxjm...@gmail.com> wrote:
> Yon intuitively pointed a very important feature of "unicode".
> However, it is not necessary, this is exactly what unicode does
> (when used properly).

Unicode only provides character sets. It's not a natural language
parsing facility.

--
Neil Cerutti

0 new messages