'Straße' ('Strasse') and Python 2

wxjm...@gmail.com

unread,

Jan 12, 2014, 2:50:55 AM1/12/14

to

>>> sys.version
2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>> s = 'Straße'
>>> assert len(s) == 6
>>> assert s[5] == 'e'
>>>

jmf

Peter Otten

unread,

Jan 12, 2014, 3:31:28 AM1/12/14

to pytho...@python.org

Signifying nothing. (Macbeth)

Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> s = "Straße"
>>> assert len(s) == 6

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

>>> assert s[5] == "e"

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

Stefan Behnel

unread,

Jan 12, 2014, 4:00:58 AM1/12/14

to pytho...@python.org

Peter Otten, 12.01.2014 09:31:

> Signifying nothing. (Macbeth)
>
> Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
> [GCC 4.6.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.

> >>> s = "Straße"
> >>> assert len(s) == 6

> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AssertionError

> >>> assert s[5] == "e"

> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AssertionError

The point I think he was trying to make is that Linux is better than
Windows, because the latter fails to fail on these assertions for some reason.

Stefan :o)

Ned Batchelder

unread,

Jan 12, 2014, 7:17:18 AM1/12/14

to pytho...@python.org

On 1/12/14 2:50 AM, wxjm...@gmail.com wrote:
>>>> sys.version
> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]

>>>> s = 'Stra�e'

>>>> assert len(s) == 6
>>>> assert s[5] == 'e'
>>>>
>
> jmf
>

Dumping random snippets of Python sessions here is useless. If you are
trying to make a point, you have to put some English around it. You
know what is in your head, but we do not.

--
Ned Batchelder, http://nedbatchelder.com

Mark Lawrence

unread,

Jan 12, 2014, 7:33:07 AM1/12/14

to pytho...@python.org

On 12/01/2014 09:00, Stefan Behnel wrote:
> Peter Otten, 12.01.2014 09:31:
>> wxjm...@gmail.com wrote:
>>

>> Signifying nothing. (Macbeth)
>>
>> Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
>> [GCC 4.6.1] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.

>>>>> s = "Straße"
>>>>> assert len(s) == 6

>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> AssertionError

>>>>> assert s[5] == "e"

>> Traceback (most recent call last):
>> File "<stdin>", line 1, in <module>
>> AssertionError
>
> The point I think he was trying to make is that Linux is better than
> Windows, because the latter fails to fail on these assertions for some reason.
>
> Stefan :o)
>
>

The point he's trying to make is that he also reads the pythondev
mailing list, where Steven D'Aprano posted this very example, stating it
is "Python 2 nonsense". Fixed in Python 3. Don't mention... :)

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

MRAB

unread,

Jan 12, 2014, 1:33:03 PM1/12/14

to pytho...@python.org, pytho...@python.org

On 2014-01-12 08:31, Peter Otten wrote:

> Signifying nothing. (Macbeth)
>
> Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
> [GCC 4.6.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.

>>>> s = "Straße"
>>>> assert len(s) == 6

> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AssertionError

>>>> assert s[5] == "e"

> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> AssertionError
>
>

The point is that in Python 2 'Straße' is a bytestring and its length
depends on the encoding of the source file. If the source file is UTF-8
then 'Straße' is a string literal with 7 bytes between the single
quotes.

Thomas Rachel

unread,

Jan 13, 2014, 3:27:46 AM1/13/14

to

Am 12.01.2014 08:50 schrieb wxjm...@gmail.com:
>>>> sys.version
> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]

>>>> s = 'Stra�e'

>>>> assert len(s) == 6
>>>> assert s[5] == 'e'
>>>>

Wow. You just found one of the major differences between Python 2 and 3.

Your assertins are just wrong, as s = 'Stra�e' leads - provided you use
UTF8 - to a representation of 'Stra\xc3\x9fe', obviously leading to a
length of 7.

Thomas

wxjm...@gmail.com

unread,

Jan 13, 2014, 4:54:21 AM1/13/14

to

Le lundi 13 janvier 2014 09:27:46 UTC+1, Thomas Rachel a écrit :
> Am 12.01.2014 08:50 schrieb wxjm...@gmail.com:
>
> >>>> sys.version
>
> > 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>

> >>>> s = 'Straï¿½e'

>
> >>>> assert len(s) == 6
>
> >>>> assert s[5] == 'e'
>
> >>>>
>
>
>
> Wow. You just found one of the major differences between Python 2 and 3.
>
>
>

> Your assertins are just wrong, as s = 'Straï¿½e' leads - provided you use

>
> UTF8 - to a representation of 'Stra\xc3\x9fe', obviously leading to a
>
> length of 7.
>
>

Not at all. I'm afraid I'm understanding Python (on this
aspect very well).

Do you belong to this group of people who are naively
writing wrong Python code (usually not properly working)
during more than a decade?

'ß' is the the fourth character in that text "Straße"
(base index 0).

This assertions are correct (byte string and unicode).

>>> sys.version
'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>> assert 'Straße'[4] == 'ß'
>>> assert u'Straße'[4] == u'ß'
>>>

jmf

PS Nothing to do with Py2/Py3.

Chris Angelico

unread,

Jan 13, 2014, 5:26:01 AM1/13/14

to pytho...@python.org

On Mon, Jan 13, 2014 at 8:54 PM, <wxjm...@gmail.com> wrote:
> This assertions are correct (byte string and unicode).
>
>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>>> assert 'Straße'[4] == 'ß'
>>>> assert u'Straße'[4] == u'ß'
>>>>
>
> jmf
>
> PS Nothing to do with Py2/Py3.

This means that either your source encoding happens to include that
character, or you have assertions disabled. It does NOT mean that you
can rely on writing this string out to a file and having someone else
read it in and understand it the same way.

ChrisA

Steven D'Aprano

unread,

Jan 13, 2014, 5:38:03 AM1/13/14

to

On Mon, 13 Jan 2014 01:54:21 -0800, wxjmfauth wrote:

>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>>> assert 'Straße'[4] == 'ß'
>>>> assert u'Straße'[4] == u'ß'

I think you are using "from __future__ import unicode_literals".
Otherwise, that cannot happen in Python 2.x. Using a narrow build:

# on my machine "ando"
py> sys.version
'2.7.2 (default, May 18 2012, 18:25:10) \n[GCC 4.1.2 20080704 (Red Hat
4.1.2-52)]'
py> sys.maxunicode
65535
py> assert 'Straße'[4] == 'ß'

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

py> list('Straße')
['S', 't', 'r', 'a', '\xc3', '\x9f', 'e']

Using a wide build is the same:

# on my machine "orac"
>>> sys.maxunicode
1114111

>>> assert 'Straße'[4] == 'ß'

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

But once you run the "from __future__" line, the behaviour changes to
what you show:

py> from __future__ import unicode_literals
py> list('Straße')
[u'S', u't', u'r', u'a', u'\xdf', u'e']
py> assert 'Straße'[4] == 'ß'
py>

But I still don't understand the point you are trying to make.

--
Steven

Chris Angelico

unread,

Jan 13, 2014, 5:57:28 AM1/13/14

to pytho...@python.org

On Mon, Jan 13, 2014 at 9:38 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> I think you are using "from __future__ import unicode_literals".
> Otherwise, that cannot happen in Python 2.x.
>

Alas, not true.

>>> sys.version
'2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)]'
>>> sys.maxunicode
65535

>>> assert 'Straße'[4] == 'ß'

>>> list('Straße')
['S', 't', 'r', 'a', '\xdf', 'e']

That's Windows XP. Presumably Latin-1 (or CP-1252, they both have that
char at 0xDF). He happens to be correct, *as long as the source code
encoding matches the output encoding and is one that uses 0xDF to mean
U+00DF*. Otherwise, he's not.

ChrisA

Michael Torrie

unread,

Jan 13, 2014, 10:58:50 AM1/13/14

to pytho...@python.org

On 01/13/2014 02:54 AM, wxjm...@gmail.com wrote:
> Not at all. I'm afraid I'm understanding Python (on this
> aspect very well).

Are you sure about that? Seems to me you're still confused as to the
difference between unicode and encodings.

>
> Do you belong to this group of people who are naively
> writing wrong Python code (usually not properly working)
> during more than a decade?
>

> '�' is the the fourth character in that text "Stra�e"

> (base index 0).
>
> This assertions are correct (byte string and unicode).

How can they be? They only are true for the default encoding and
character set you are using, which happens to have '�' as a single byte.
Hence your little python 2.7 snippet is not using unicode at all, in
any form. It's using a non-unicode character set. There are methods
which can decode your character set to unicode and encode from unicode.
But let's be clear. Your byte streams are not unicode!

If the default byte encoding is UTF-8, which is a variable number of
bytes per character, your assertions are completely wrong. Maybe it's
time you stopped programming in Windows and use OS X or Linux which
throw out the random single-byte character sets and instead provide a
UTF-8 terminal environment to support non-latin characters.

>
>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'

>>>> assert 'Stra�e'[4] == '�'
>>>> assert u'Stra�e'[4] == u'�'

wxjm...@gmail.com

unread,

Jan 13, 2014, 11:24:09 AM1/13/14

to

You are right. It's on Windows. It is only showing how
Python can be a holy mess.

The funny aspect is when I'm reading " *YOUR* assertions
are false" when I'm presenting *PYTHON* assertions!

jmf

Mark Lawrence

unread,

Jan 13, 2014, 12:02:20 PM1/13/14

to pytho...@python.org

On 13/01/2014 16:24, wxjm...@gmail.com wrote:
>
> You are right. It's on Windows. It is only showing how
> Python can be a holy mess.
>

Regarding unicode Python 2 was a holy mess, fixed in Python 3.

Thomas Rachel

unread,

Jan 13, 2014, 1:37:23 PM1/13/14

to

Am 13.01.2014 10:54 schrieb wxjm...@gmail.com:

> Not at all. I'm afraid I'm understanding Python (on this
> aspect very well).

IBTD.

> Do you belong to this group of people who are naively
> writing wrong Python code (usually not properly working)
> during more than a decade?

Why should I be?

> '�' is the the fourth character in that text "Stra�e"
> (base index 0).

Character-wise, yes. But not byte-string-wise. In a byte string, this
depends on the character set used.

On CP 437, 850, 12xx (whatever Windows uses) or latin1, you are right,
but not on the widely used UTF8.

>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'

>>>> assert 'Stra�e'[4] == '�'
>>>> assert u'Stra�e'[4] == u'�'

Linux box at home:

Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> assert 'Stra�e'[4] == '�'

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

>>> assert u'Stra�e'[4] == u'�'

Python 3.3.0 (default, Oct 01 2012, 09:13:30) [GCC] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>> assert 'Stra�e'[4] == '�'
>>> assert u'Stra�e'[4] == u'�'

Windows box at work:

Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit
(AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> assert 'Stra�e'[4] == '�'
>>> assert u'Stra�e'[4] == u'�'

> PS Nothing to do with Py2/Py3.

As bytes and unicode and str stuff is heavily changed between them, of
course it has to do.

And I think you know that and try to confuse and FUD us all - with no avail.

Thomas

Terry Reedy

unread,

Jan 13, 2014, 6:05:04 PM1/13/14

to pytho...@python.org

On 1/13/2014 4:54 AM, wxjm...@gmail.com wrote:

> I'm afraid I'm understanding Python (on this
> aspect very well).

Really?

> Do you belong to this group of people who are naively
> writing wrong Python code (usually not properly working)
> during more than a decade?

To me, the important question is whether this and previous similar posts
are intentional trolls designed to stir up the flurry of responses they
get or 'innocently' misleading or even erroneous. If your claim of
understanding Python and Unicode is true, then this must be a troll
post. Either way, please desist, or your access to python-list from
google-groups may be removed.

> 'ß' is the the fourth character in that text "Straße"
> (base index 0).

As others have said, in the *unicode text "Straße", 'ß' is the fifth
character, at character index 4, ...

> This assertions are correct (byte string and unicode).

whereas, when the text is encoded into bytes, the byte index depends on
the encoding and the assertion that it is always 4 is incorrect. Did you
know this or were you truly ignorant?

>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>>> assert 'Straße'[4] == 'ß'

Sometimes true, sometimes not.

>>>> assert u'Straße'[4] == u'ß'

> PS Nothing to do with Py2/Py3.

This issue has everything to do with Py2, where 'Straße' is encoded
bytes, versus Py3, where 'Straße' is unicode text where each character
of that word takes one code unit, whether each is 2 bytes or 4 bytes.

If you replace 'ß' with any astral (non-BMP) character, this issue
appears even for unicode text in 3.2-, where an astral character
requires 2, not 1, code units on narrow builds, thereby screwing up
indexing, just as can happen for encoded bytes. In 3.3+, all characters
use 1 code unit and indexing (and slicing) always works properly. This
is another unicode issue where you appear not to understand, but might
just be trolling.

--
Terry Jan Reedy

Robin Becker

unread,

Jan 15, 2014, 7:00:51 AM1/15/14

to pytho...@python.org

On my utf8 based system

> robin@everest ~:
> $ cat ooo.py
> if __name__=='__main__':
> import sys
> s='A̅B'
> print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
> robin@everest ~:
> $ python ooo.py
> version_info=sys.version_info(major=3, minor=3, micro=3, releaselevel='final', serial=0)
> len(A̅B)=3
> robin@everest ~:
> $

so two 'characters' are 3 (or 2 or more) codepoints. If I want to isolate so
called graphemes I need an algorithm even for python's unicode ie when it really
matters, python3 str is just another encoding.
--
Robin Becker

Ned Batchelder

unread,

Jan 15, 2014, 7:13:36 AM1/15/14

to pytho...@python.org

On 1/15/14 7:00 AM, Robin Becker wrote:
> On 12/01/2014 07:50, wxjm...@gmail.com wrote:

> On my utf8 based system
>
>
>> robin@everest ~:
>> $ cat ooo.py
>> if __name__=='__main__':
>> import sys
>> s='A̅B'
>> print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>> robin@everest ~:
>> $ python ooo.py
>> version_info=sys.version_info(major=3, minor=3, micro=3,
>> releaselevel='final', serial=0)
>> len(A̅B)=3
>> robin@everest ~:
>> $
>
>
> so two 'characters' are 3 (or 2 or more) codepoints. If I want to
> isolate so called graphemes I need an algorithm even for python's
> unicode ie when it really matters, python3 str is just another encoding.

You are right that more than one codepoint makes up a grapheme, and that
you'll need code to deal with the correspondence between them. But let's
not muddy these already confusing waters by referring to that mapping as
an encoding.

In Unicode terms, an encoding is a mapping between codepoints and bytes.
Python 3's str is a sequence of codepoints.

Robin Becker

unread,

Jan 15, 2014, 7:50:10 AM1/15/14

to pytho...@python.org

On 15/01/2014 12:13, Ned Batchelder wrote:
........

>> On my utf8 based system
>>
>>
>>> robin@everest ~:
>>> $ cat ooo.py
>>> if __name__=='__main__':
>>> import sys
>>> s='A̅B'
>>> print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>>> robin@everest ~:
>>> $ python ooo.py
>>> version_info=sys.version_info(major=3, minor=3, micro=3,
>>> releaselevel='final', serial=0)
>>> len(A̅B)=3
>>> robin@everest ~:
>>> $
>>
>>

........

> You are right that more than one codepoint makes up a grapheme, and that you'll
> need code to deal with the correspondence between them. But let's not muddy
> these already confusing waters by referring to that mapping as an encoding.
>
> In Unicode terms, an encoding is a mapping between codepoints and bytes. Python
> 3's str is a sequence of codepoints.
>

Semantics is everything. For me graphemes are the endpoint (or should be); to
get a proper rendering of a sequence of graphemes I can use either a sequence of
bytes or a sequence of codepoints. They are both encodings of the graphemes;
what unicode says is an encoding doesn't define what encodings are ie mappings
from some source alphabet to a target alphabet.
--
Robin Becker

wxjm...@gmail.com

unread,

Jan 15, 2014, 9:55:45 AM1/15/14

to

Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit :

>
> ... more than one codepoint makes up a grapheme ...

No

> In Unicode terms, an encoding is a mapping between codepoints and bytes.

No

jmf

Chris Angelico

unread,

Jan 15, 2014, 10:14:38 AM1/15/14

to pytho...@python.org

On Thu, Jan 16, 2014 at 1:55 AM, <wxjm...@gmail.com> wrote:
> Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit :
>
>>
>> ... more than one codepoint makes up a grapheme ...
>
> No

Yes.
http://www.unicode.org/faq/char_combmark.html

>> In Unicode terms, an encoding is a mapping between codepoints and bytes.
>
> No

Yes.
http://www.unicode.org/reports/tr17/
Specifically:
"Character Encoding Form: a mapping from a set of nonnegative integers
that are elements of a CCS to a set of sequences of particular code
units of some specified width, such as 32-bit integers"

Or are you saying that www.unicode.org is wrong about the definitions
of Unicode terms?

ChrisA

Travis Griggs

unread,

Jan 15, 2014, 11:28:49 AM1/15/14

to pytho...@python.org

But you’re talking about two levels of encoding. One runs on top of the other. So insisting that you be able to call them all encodings, makes the term pointless, because now it’s ambiguous as to what you’re referring to. Are you referring to encoding in the sense of representing code points with bytes? Or are you referring to what the unicode guys call “forms”?

For example, the NFC form of ‘ñ’ is ’\u00F1’. ‘nThe NFD form represents the exact same grapheme, but is ‘\u006e\u0303’. You can call them encodings if you want, but I echo Ned’s sentiment that you keep that to yourself. Conventionally, they’re different forms, not different encodings. You can encode either form with an encoding, e.g.

'\u00F1'.encode('utf8’)
'\u00F1'.encode('utf16’)

'\u006e\u0303'.encode('utf8’)
'\u006e\u0303'.encode('utf16')

Robin Becker

unread,

Jan 15, 2014, 11:55:31 AM1/15/14

to pytho...@python.org

On 15/01/2014 16:28, Travis Griggs wrote:
........ of a sequence of graphemes I can use either a sequence of bytes or a

sequence of codepoints. They are both encodings of the graphemes; what unicode
says is an encoding doesn't define what encodings are ie mappings from some
source alphabet to a target alphabet.
>
> But you’re talking about two levels of encoding. One runs on top of the other. So insisting that you be able to call them all encodings, makes the term pointless, because now it’s ambiguous as to what you’re referring to. Are you referring to encoding in the sense of representing code points with bytes? Or are you referring to what the unicode guys call “forms”?
>
> For example, the NFC form of ‘ñ’ is ’\u00F1’. ‘nThe NFD form represents the exact same grapheme, but is ‘\u006e\u0303’. You can call them encodings if you want, but I echo Ned’s sentiment that you keep that to yourself. Conventionally, they’re different forms, not different encodings. You can encode either form with an encoding, e.g.
>
> '\u00F1'.encode('utf8’)
> '\u00F1'.encode('utf16’)
>
> '\u006e\u0303'.encode('utf8’)
> '\u006e\u0303'.encode('utf16')
>

I think about these as encodings, because that's what they are mathematically,
logically & practically. I can encode the target grapheme sequence as a sequence
of bytes using a particular 'unicode encoding' eg utf8 or a sequence of code points.

The fact that unicoders want to take over the meaning of encoding is not relevant.

In my utf8 bash shell the python print() takes one encoding (python3 str) and
translates that to the stdout encoding which happens to be utf8 and passes that
to the shell which probably does a lot of work to render the result as graphical
symbols (or graphemes).

I'm not anti unicode, that's just an assignment of identity to some symbols.
Coding the values of the ids is a separate issue. It's my belief that we don't
need more than the byte level encoding to represent unicode. One of the claims
made for python3 unicode is that it somehow eliminates the problems associated
with other encodings eg utf8, but in fact they will remain until we force
printers/designers to stop using complicated multi-codepoint graphemes. I
suspect that won't happen.
--
Robin Becker

Chris Angelico

unread,

Jan 15, 2014, 12:14:57 PM1/15/14

to pytho...@python.org

On Thu, Jan 16, 2014 at 3:55 AM, Robin Becker <ro...@reportlab.com> wrote:
> I think about these as encodings, because that's what they are
> mathematically, logically & practically. I can encode the target grapheme
> sequence as a sequence of bytes using a particular 'unicode encoding' eg
> utf8 or a sequence of code points.

By that definition, you can equally encode it as a bitmapped image, or
as a series of lines and arcs, and those are equally well "encodings"
of the character. This is not the normal use of that word.

http://en.wikipedia.org/wiki/Character_encoding

ChrisA

Robin Becker

unread,

Jan 15, 2014, 12:28:53 PM1/15/14

to pytho...@python.org

Actually I didn't use the term 'character encoding', but that doesn't alter the
argument. If I chose to embed the final graphemes as images encoded as bytes or
lists of numbers that would still be still be an encoding; it just wouldn't be
very easily usable (lots of typing).
--
Robin Becker

Ian Kelly

unread,

Jan 15, 2014, 1:32:19 PM1/15/14

to Python

On Wed, Jan 15, 2014 at 9:55 AM, Robin Becker <ro...@reportlab.com> wrote:
> The fact that unicoders want to take over the meaning of encoding is not
> relevant.

A virus is a small infectious agent that replicates only inside the
living cells of other organisms. In the context of computing however,
that definition is completely false, and if you insist upon it when
trying to talk about computers, you're only going to confuse people as
to what you mean. Somehow, I haven't seen any biologists complaining
that computer users want to take over the meaning of virus.

Terry Reedy

unread,

Jan 15, 2014, 7:27:35 PM1/15/14

to pytho...@python.org

On 1/15/2014 11:55 AM, Robin Becker wrote:

> The fact that unicoders want to take over the meaning of encoding is not
> relevant.

I agree with you that 'encoding' should not be limited to 'byte encoding
of a (subset of) unicode characters. For instance, .jpg and .png are
byte encodings of images. In the other hand, it is common in human
discourse to omit qualifiers in particular contexts. 'Computer virus'
gets condensed to 'virus' in computer contexts.

The problem with graphemes is that there is no fixed set of unicode
graphemes. Which is to say, the effective set of graphemes is
context-specific. Just limiting ourselves to English, 'fi' is usually 2
graphemes when printing to screen, but often just one when printing to
paper. This is why the Unicode consortium punted 'graphemes' to
'application' code.

> I'm not anti unicode, that's just an assignment of identity to some
> symbols. Coding the values of the ids is a separate issue. It's my
> belief that we don't need more than the byte level encoding to represent
> unicode. One of the claims made for python3 unicode is that it somehow
> eliminates the problems associated with other encodings eg utf8,

The claim is true for the following problems of the way-too-numerous
unicode byte encodings.

Subseting: only a subset of characters can be encoded.

Shifting: the meaning of a byte depends on a preceding shift character,
which might be back as the beginning of the sequence.

Varying size: the number of bytes to encode a character depends on the
character.

Both of the last two problems can turn O(1) operations into O(n)
operations. 3.3+ eliminates all these problems.

--
Terry Jan Reedy

Steven D'Aprano

unread,

Jan 15, 2014, 7:32:25 PM1/15/14

to

On Thu, 16 Jan 2014 02:14:38 +1100, Chris Angelico wrote:

> On Thu, Jan 16, 2014 at 1:55 AM, <wxjm...@gmail.com> wrote:
>> Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit :
>>
>>
>>> ... more than one codepoint makes up a grapheme ...
>>
>> No
>
> Yes.
> http://www.unicode.org/faq/char_combmark.html
>
>>> In Unicode terms, an encoding is a mapping between codepoints and
>>> bytes.
>>
>> No
>
> Yes.
> http://www.unicode.org/reports/tr17/
> Specifically:
> "Character Encoding Form: a mapping from a set of nonnegative integers
> that are elements of a CCS to a set of sequences of particular code
> units of some specified width, such as 32-bit integers"

Technically Unicode talks about mapping code points and code *units*, but
since code units are defined in terms of bytes, I think it is fair to cut
out one layer of indirection and talk about mapping code points to bytes.
For instance, UTF-32 uses 4-byte code units, and every code point U+0000
through U+10FFFF is mapped to a single code unit, which is always a four-
byte quantity. UTF-8, on the other hand, uses single-byte code units, and
maps code points to a variable number of code units, so UTF-8 maps code
points to either 1, 2, 3 or 4 bytes.

> Or are you saying that www.unicode.org is wrong about the definitions of
> Unicode terms?

No, I think he is saying that he doesn't know Unicode anywhere near as
well as he thinks he does. The question is, will he cherish his
ignorance, or learn from this thread?

--
Steven

Steven D'Aprano

unread,

Jan 15, 2014, 7:43:21 PM1/15/14

to

On Wed, 15 Jan 2014 12:00:51 +0000, Robin Becker wrote:

> so two 'characters' are 3 (or 2 or more) codepoints.

Yes.

> If I want to isolate so called graphemes I need an algorithm even
> for python's unicode

Correct. Graphemes are language dependent, e.g. in Dutch "ij" is usually
a single grapheme, in English it would be counted as two. Likewise, in
Czech, "ch" is a single grapheme. The Latin form of Serbo-Croation has
two two-letter graphemes, Dž and Nj (it used to have three, but Dj is now
written as Đ).

Worse, linguists sometimes disagree as to what counts as a grapheme. For
instance, some authorities consider the English "sh" to be a separate
grapheme. As a native English speaker, I'm not sure about that. Certainly
it isn't a separate letter of the alphabet, but on the other hand I can't
think of any words containing "sh" that should be considered as two
graphemes "s" followed by "h". Wait, no, that's not true... compound
words such as "glasshouse" or "disheartened" are counter examples.

> ie when it really matters, python3 str is just another encoding.

I'm not entirely sure how a programming language data type (str) can be
considered a transformation.

--
Steven

Chris Angelico

unread,

Jan 15, 2014, 8:26:20 PM1/15/14

to pytho...@python.org

On Thu, Jan 16, 2014 at 11:43 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> Worse, linguists sometimes disagree as to what counts as a grapheme. For
> instance, some authorities consider the English "sh" to be a separate
> grapheme. As a native English speaker, I'm not sure about that. Certainly
> it isn't a separate letter of the alphabet, but on the other hand I can't
> think of any words containing "sh" that should be considered as two
> graphemes "s" followed by "h". Wait, no, that's not true... compound
> words such as "glasshouse" or "disheartened" are counter examples.

Digression: When I was taught basic English during my school days, my
mum used Spalding's book and the 70 phonograms. 25 of them are single
letters (Q is not a phonogram - QU is), and the others are mostly
pairs (there are a handful of 3- and 4-letter phonograms). Not every
instance of "s" followed by "h" is the phonogram "sh" - only the times
when it makes the single sound "sh" (which it doesn't in "glasshouse"
or "disheartened").

Thing is, you can't define spelling and pronunciation in terms of each
other, because you'll always be bitten by corner cases. Everyone knows
how "Thames" is pronounced... right? Well, no. There are (at least)
two rivers of that name, the famous one in London p1[ and another one
further north [2]. The obscure one is pronounced the way the word
looks, the famous one isn't. And don't even get started on English
family names... Majorinbanks, Meux and Cholmodeley, as lampshaded [3]
in this song [4]! Even without names, though, there are the tricky
cases and the ones where different localities pronounce the same word
very differently; Unicode shouldn't have to deal with that by changing
whether something's a single character or two. Considering that
phonograms aren't even ligatures (though there is overlap, eg "Th"),
it's much cleaner to leave them as multiple characters.

ChrisA

[1] https://en.wikipedia.org/wiki/River_Thames
[2] Though it's better known as the Isis. https://en.wikipedia.org/wiki/The_Isis
[3] http://tvtropes.org/pmwiki/pmwiki.php/Main/LampshadeHanging
[4] http://www.stagebeauty.net/plays/th-arca2.html - "Mosh-banks",
"Mow", and "Chumley" are the pronunciations used

Robin Becker

unread,

Jan 16, 2014, 5:51:42 AM1/16/14

to pytho...@python.org

On 16/01/2014 00:32, Steven D'Aprano wrote:
>> >Or are you saying thatwww.unicode.org is wrong about the definitions of

>> >Unicode terms?
> No, I think he is saying that he doesn't know Unicode anywhere near as
> well as he thinks he does. The question is, will he cherish his
> ignorance, or learn from this thread?

I assure you that I fully understand my ignorance of unicode. Until recently I
didn't even know that the unicode in python 2.x is considered broken and that
str in python 3.x is considered 'better'.

I can say that having made a lot of reportlab work in both 2.7 & 3.3 I don't
understand why the latter seems slower especially since we try to convert early
to unicode/str as a desirable internal form. Probably I have some horrible error
going on(eg one of the C extensions is working in 2.7 and not in 3.3).
-stupidly yrs-
Robin Becker

Chris Angelico

unread,

Jan 16, 2014, 5:58:56 AM1/16/14

to pytho...@python.org

On Thu, Jan 16, 2014 at 9:51 PM, Robin Becker <ro...@reportlab.com> wrote:
> On 16/01/2014 00:32, Steven D'Aprano wrote:
>>>

>>> >Or are you saying thatwww.unicode.org is wrong about the definitions of

>>> >Unicode terms?
>>
>> No, I think he is saying that he doesn't know Unicode anywhere near as
>> well as he thinks he does. The question is, will he cherish his
>> ignorance, or learn from this thread?
>
>

> I assure you that I fully understand my ignorance of unicode. Until recently
> I didn't even know that the unicode in python 2.x is considered broken and
> that str in python 3.x is considered 'better'.

Your wisdom, if I may paraphrase Master Foo, is that you know you are a fool.

http://catb.org/esr/writings/unix-koans/zealot.html

ChrisA

Frank Millman

unread,

Jan 16, 2014, 7:06:35 AM1/16/14

to pytho...@python.org

"Robin Becker" <ro...@reportlab.com> wrote in message
news:52D7B9BE...@chamonix.reportlab.co.uk...

> On 16/01/2014 00:32, Steven D'Aprano wrote:

>>> >Or are you saying thatwww.unicode.org is wrong about the definitions

>>> >of
>>> >Unicode terms?
>> No, I think he is saying that he doesn't know Unicode anywhere near as
>> well as he thinks he does. The question is, will he cherish his
>> ignorance, or learn from this thread?
>

> I assure you that I fully understand my ignorance of unicode. Until
> recently I didn't even know that the unicode in python 2.x is considered
> broken and that str in python 3.x is considered 'better'.
>

Hi Robin

I am pretty sure that Steven was referring to the original post from
jmfauth, not to anything that you wrote.

May I say that I am delighted that you are putting in the effort to port
ReportLab to python3, and I trust that you will get plenty of support from
the gurus here in achieving this.

Frank Millman

Robin Becker

unread,

Jan 16, 2014, 8:03:18 AM1/16/14

to pytho...@python.org

On 16/01/2014 12:06, Frank Millman wrote:
..........

>> I assure you that I fully understand my ignorance of unicode. Until
>> recently I didn't even know that the unicode in python 2.x is considered
>> broken and that str in python 3.x is considered 'better'.
>>
>
> Hi Robin
>
> I am pretty sure that Steven was referring to the original post from
> jmfauth, not to anything that you wrote.
>

unfortunately my ignorance remains even in the absence of criticism

> May I say that I am delighted that you are putting in the effort to port
> ReportLab to python3, and I trust that you will get plenty of support from
> the gurus here in achieving this.

........
I have had a lot of support from the gurus thanks to all of them :)
--
Robin Becker

Steven D'Aprano

unread,

Jan 16, 2014, 9:07:59 AM1/16/14

to

On Thu, 16 Jan 2014 10:51:42 +0000, Robin Becker wrote:

> On 16/01/2014 00:32, Steven D'Aprano wrote:
>>> >Or are you saying thatwww.unicode.org is wrong about the definitions
>>> >of Unicode terms?
>> No, I think he is saying that he doesn't know Unicode anywhere near as
>> well as he thinks he does. The question is, will he cherish his
>> ignorance, or learn from this thread?
>
> I assure you that I fully understand my ignorance of unicode.

Robin, while I'm very happy to see that you have a good grasp of what you
don't know, I'm afraid that you're misrepresenting me. You deleted the
part of my post that made it clear that I was referring to our resident
Unicode crank, JMF <wxjm...@gmail.com>.

> Until
> recently I didn't even know that the unicode in python 2.x is considered
> broken and that str in python 3.x is considered 'better'.

No need for scare quotes.

The unicode type in Python 2.x is less-good because:

- it is not the default string type (you have to prefix the string
with a u to get Unicode);

- it is missing some functionality, e.g. casefold;

- there are two distinct implementations, narrow builds and wide builds;

- wide builds take up to four times more memory per string as needed;

- narrow builds take up to two times more memory per string as needed;

- worse, narrow builds have very naive (possibly even "broken")
handling of code points in the Supplementary Multilingual Planes.

The unicode string type in Python 3 is better because:

- it is the default string type;

- it includes more functionality;

- starting in Python 3.3, it gets rid of the distinction between
narrow and wide builds;

- which reduces the memory overhead of strings by up to a factor
of four in many cases;

- and fixes the issue of SMP code points.

> I can say that having made a lot of reportlab work in both 2.7 & 3.3 I
> don't understand why the latter seems slower especially since we try to
> convert early to unicode/str as a desirable internal form.

*shrug*

Who knows? Is it slower or does it only *seem* slower? Is the performance
regression platform specific? Have you traded correctness for speed, that
is, does 2.7 version break when given astral characters on a narrow build?

Earlier in January, you commented in another thread that

"I'm not sure if we have any non-bmp characters in the tests."

If you don't, you should have some.

There's all sorts of reasons why your code might be slower under 3.3,
including the possibility of a non-trivial performance regression. If you
can demonstrate a test case with a significant slowdown for real-world
code, I'm sure that a bug report will be treated seriously.

> Probably I
> have some horrible error going on(eg one of the C extensions is working
> in 2.7 and not in 3.3).

Well that might explain a slowdown.

But really, one should expect that moving from single byte strings to up
to four-byte strings will have *some* cost. It's exchanging functionality
for time. The same thing happened years ago, people used to be extremely
opposed to using floating point doubles instead of singles because of
performance. And, I suppose it is true that back when 64K was considered
a lot of memory, using eight whole bytes per floating point number (let
alone ten like the IEEE Extended format) might have seemed the height of
extravagance. But today we use doubles by default, and if singles would
be a tiny bit faster, who wants to go back to the bad old days of single
precision?

I believe the same applies to Unicode versus single-byte strings.

--
Steven

Tim Chase

unread,

Jan 16, 2014, 10:24:01 AM1/16/14

to pytho...@python.org

On 2014-01-16 14:07, Steven D'Aprano wrote:
> The unicode type in Python 2.x is less-good because:
>

> - it is missing some functionality, e.g. casefold;

Just for the record, str.casefold() wasn't added until 3.3, so
earlier 3.x versions (such as the 3.2.3 that is the default python3
on Debian Stable) don't have it either.

-tkc

Travis Griggs

unread,

Jan 16, 2014, 4:30:00 PM1/16/14

to pytho...@python.org

On Jan 16, 2014, at 2:51 AM, Robin Becker <ro...@reportlab.com> wrote:

> I assure you that I fully understand my ignorance of ...

Robin, don’t take this personally, I totally got what you meant.

At the same time, I got a real chuckle out of this line. That beats “army intelligence” any day.