Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How do I display unicode value stored in a string variable using ord()

374 views
Skip to first unread message

Charles Jensen

unread,
Aug 16, 2012, 6:09:47 PM8/16/12
to
Everyone knows that the python command

ord(u'…')

will output the number 8230 which is the unicode character for the horizontal ellipsis.

How would I use ord() to find the unicode value of a string stored in a variable?

So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a?

a = '…'
ord(a)

Chris Angelico

unread,
Aug 16, 2012, 6:20:15 PM8/16/12
to pytho...@python.org
On Fri, Aug 17, 2012 at 8:09 AM, Charles Jensen
<hopefull...@gmail.com> wrote:
> How would I use ord() to find the unicode value of a string stored in a variable?
>
> So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a?
>
> a = '…'
> ord(a)

I presume you're talking about Python 2, because in Python 3 your
string variable is a Unicode string and will behave as you describe
above.

You'll need to look into what the encoding is, and figure it out from there.

ChrisA

Dave Angel

unread,
Aug 16, 2012, 6:47:17 PM8/16/12
to Charles Jensen, pytho...@python.org
On 08/16/2012 06:09 PM, Charles Jensen wrote:
> Everyone knows that the python command
>
> ord(u'�')
>
> will output the number 8230 which is the unicode character for the horizontal ellipsis.
>
> How would I use ord() to find the unicode value of a string stored in a variable?
>
> So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a?
>
> a = '�'
> ord(a)

You omitted the print statement. You also didn't specify what version
of Python you're using; I'll assume Python 2.x because in Python 3.x,
the u"xx" notation would have been a syntax error.

To get the ord of a unicode variable, you do it the same as a unicode
literal:

a = u"j" #note: for this to work reliably, you probably
need the correct Unicode declaration in line 2 of the file
print ord(a)

But if you have a byte string containing some binary bits, and you want
to get a unicode character value out of it, you'll need to explicitly
convert it to unicode.

First, decide what method the byte string was encoded. If you specify
the wrong encoding, you'll likely to get an exception, or maybe just a
nonsense answer.

a = "\xc1\xc1" #I just made this value up; it's not
valid utf8
b = a.decode("utf-8")
print ord(b)



--

DaveA

Terry Reedy

unread,
Aug 16, 2012, 7:59:31 PM8/16/12
to pytho...@python.org
a = '…'
print(ord(a))
>>>
8230
Most things with unicode are easier in 3.x, and some are even better in
3.3. The current beta is good enough for most informal work. 3.3.0 will
be out in a month.

--
Terry Jan Reedy


Alister

unread,
Aug 17, 2012, 2:30:41 AM8/17/12
to
the same way you did in your original example by defining the string ass
unicode
a=u'...' ord(a)
--
Keep on keepin' on.

wxjm...@gmail.com

unread,
Aug 17, 2012, 1:49:51 PM8/17/12
to pytho...@python.org
Slightly off topic.

The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py<3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).

>>> '…'.encode('cp1252')
b'\x85'
>>> '…'.encode('mac-roman')
b'\xc9'
>>> '…'.encode('iso-8859-1') # latin-1
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)

If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?

Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!

The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.

PS When "TeX" moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, "they" know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).

Ditto for the foundries (Adobe, Linotype, ...)

jmf

wxjm...@gmail.com

unread,
Aug 17, 2012, 1:49:51 PM8/17/12
to comp.lan...@googlegroups.com, pytho...@python.org
Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :

Jerry Hill

unread,
Aug 17, 2012, 2:21:34 PM8/17/12
to pytho...@python.org
On Fri, Aug 17, 2012 at 1:49 PM, <wxjm...@gmail.com> wrote:
> The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
> is one of these characters existing in the cp1252, mac-roman
> coding schemes and not in iso-8859-1 (latin-1) and obviously
> not in ascii. It causes Py3.3 to work a few 100% slower
> than Py<3.3 versions due to the flexible string representation
> (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
>
>>>> '…'.encode('cp1252')
> b'\x85'
>>>> '…'.encode('mac-roman')
> b'\xc9'
>>>> '…'.encode('iso-8859-1') # latin-1
> Traceback (most recent call last):
> File "<eta last command>", line 1, in <module>
> UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
> in position 0: ordinal not in range(256)
>
> If one could neglect this (typographically important) glyph, what
> to say about the characters of the European scripts (languages)
> present in cp1252 or in mac-roman but not in latin-1 (eg. the
> French script/language)?

So... python should change the longstanding definition of the latin-1
character set? This isn't some sort of python limitation, it's just
the reality of legacy encodings that actually exist in the real world.


> Very nice. Python 2 was built for ascii user, now Python 3 is
> *optimized* for, let say, ascii user!
>
> The future is bright for Python. French users are better
> served with Apple or MS products, simply because these
> corporates know you can not write French with iso-8859-1.
>
> PS When "TeX" moved from the ascii encoding to iso-8859-1
> and the so called Cork encoding, "they" know this and provided
> all the complementary packages to circumvent this. It was
> in 199? (Python was not even born).
>
> Ditto for the foundries (Adobe, Linotype, ...)


I don't understand what any of this has to do with Python. Just
output your text in UTF-8 like any civilized person in the 21st
century, and none of that is a problem at all. Python make that easy.
It also makes it easy to interoperate with older encodings if you
have to.

--
Jerry

wxjm...@gmail.com

unread,
Aug 17, 2012, 2:45:02 PM8/17/12
to comp.lan...@googlegroups.com, pytho...@python.org
Sorry, you missed the point.

My comment had nothing to do with the code source coding,
the coding of a Python "string" in the code source or with
the display of a Python3 <str>.
I wrote about the *internal* Python "coding", the
way Python keeps "strings" in memory. See PEP 393.

jmf

wxjm...@gmail.com

unread,
Aug 17, 2012, 2:45:02 PM8/17/12
to pytho...@python.org
Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :

Dave Angel

unread,
Aug 17, 2012, 4:55:14 PM8/17/12
to wxjm...@gmail.com, pytho...@python.org
On 08/17/2012 02:45 PM, wxjm...@gmail.com wrote:
> Le vendredi 17 ao�t 2012 20:21:34 UTC+2, Jerry Hill a �crit :
>> <SNIP>
>>
>> I don't understand what any of this has to do with Python. Just
>>
>> output your text in UTF-8 like any civilized person in the 21st
>>
>> century, and none of that is a problem at all. Python make that easy.
>>
>> It also makes it easy to interoperate with older encodings if you
>>
>> have to.
>>
> Sorry, you missed the point.
>
> My comment had nothing to do with the code source coding,
> the coding of a Python "string" in the code source or with
> the display of a Python3 <str>.
> I wrote about the *internal* Python "coding", the
> way Python keeps "strings" in memory. See PEP 393.
>
> jmf

The internal coding described in PEP 393 has nothing to do with latin-1
encoding. So what IS your point? Make it clearly, without all the
snide side-comments.



--

DaveA

Dave Angel

unread,
Aug 17, 2012, 11:30:22 PM8/17/12
to Ian Kelly, Python
On 08/17/2012 08:21 PM, Ian Kelly wrote:
> On Aug 17, 2012 2:58 PM, "Dave Angel" <d...@davea.name> wrote:
>> The internal coding described in PEP 393 has nothing to do with latin-1
>> encoding.
> It certainly does. PEP 393 provides for Unicode strings to be represented
> internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and
> sufficient to contain the data. I understand the complaint to be that while
> the change is great for strings that happen to fit in Latin-1, it is less
> efficient than previous versions for strings that do not.

That's not the way I interpreted the PEP 393. It takes a pure unicode
string, finds the largest code point in that string, and chooses 1, 2 or
4 bytes for every character, based on how many bits it'd take for that
largest code point. Further i read it to mean that only 00 bytes would
be dropped in the process, no other bytes would be changed. I take it
as a coincidence that it happens to match latin-1; that's the way
Unicode happened historically, and is not Python's fault. Am I reading
it wrong?

I also figure this is going to be more space efficient than Python 3.2
for any string which had a max code point of 65535 or less (in Windows),
or 4billion or less (in real systems). So unless French has code points
over 64k, I can't figure that anything is lost.

I have no idea about the times involved, so i wanted a more specific
complaint.

> I don't know how much merit there is to this claim. It would seem to me
> that even in non-western locales, most strings are likely to be Latin-1 or
> even ASCII, e.g. class and attribute and function names.
>
>

The jmfauth rant I was responding to was saying that French isn't
efficiently encoded, and that performance of some vague operations were
somehow reduced by several fold. I was just trying to get him to be
more specific.



--

DaveA

Steven D'Aprano

unread,
Aug 17, 2012, 11:59:47 PM8/17/12
to
On Fri, 17 Aug 2012 11:45:02 -0700, wxjmfauth wrote:

> Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
>> On Fri, Aug 17, 2012 at 1:49 PM, <wxjm...@gmail.com> wrote:
>>
>> > The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
>> > is one of these characters existing in the cp1252, mac-roman
>> > coding schemes and not in iso-8859-1 (latin-1) and obviously
>> > not in ascii. It causes Py3.3 to work a few 100% slower
>> > than Py<3.3 versions due to the flexible string representation
>> > (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
[...]
> Sorry, you missed the point.
>
> My comment had nothing to do with the code source coding, the coding of
> a Python "string" in the code source or with the display of a Python3
> <str>.
> I wrote about the *internal* Python "coding", the way Python keeps
> "strings" in memory. See PEP 393.


The PEP does not support your claim that flexible string storage is 100%
to 1000% slower. It claims 1% - 30% slowdown, with a saving of up to 60%
of the memory used for strings.

I don't really understand what message you are trying to give here. Are
you saying that PEP 393 is a good thing or a bad thing?

In Python 1.x, there was no support for Unicode at all. You could only
work with pure byte strings. Support for non-ascii characters like … ∞ é ñ
£ π Ж ش was purely by accident -- if your terminal happened to be set to
an encoding that supported a character, and you happened to use the
appropriate byte value, you might see the character you wanted.

In Python 2.2, Python gained support for Unicode. You could now guarantee
support for any Unicode character in the Basic Multilingual Plane (BMP)
by writing your strings using the u"..." style. In Python 3, you no
longer need the leading U, all strings are unicode.

But there is a problem: if your Python interpreter is a "narrow build",
it *only* supports Unicode characters in the BMP. When Python is a "wide
build", compiled with support for the additional character planes, then
strings take much more memory, even if they are in the BMP, or are simple
ASCII strings.

PEP 393 fixes this problem and gets rid of the distinction between narrow
and wide builds. From Python 3.3 onwards, all Python compilers will have
the same support for unicode, rather than most being BMP-only. Each
individual string's internal storage will use only as many bytes-per-
character as needed to store the largest character in the string.

This will save a lot of memory for those using mostly ASCII or Latin-1
but a few multibyte characters. While the increased complexity causes a
small slowdown, the increased functionality makes it well worthwhile.



--
Steven

Steven D'Aprano

unread,
Aug 18, 2012, 12:10:30 AM8/18/12
to
On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote:

> On 08/17/2012 08:21 PM, Ian Kelly wrote:
>> On Aug 17, 2012 2:58 PM, "Dave Angel" <d...@davea.name> wrote:
>>> The internal coding described in PEP 393 has nothing to do with
>>> latin-1 encoding.
>> It certainly does. PEP 393 provides for Unicode strings to be
>> represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is
>> smallest and sufficient to contain the data.

Unicode strings are not represented as Latin-1 internally. Latin-1 is a
byte encoding, not a unicode internal format. Perhaps you mean to say
that they are represented as a single byte format?

>> I understand the complaint
>> to be that while the change is great for strings that happen to fit in
>> Latin-1, it is less efficient than previous versions for strings that
>> do not.
>
> That's not the way I interpreted the PEP 393. It takes a pure unicode
> string, finds the largest code point in that string, and chooses 1, 2 or
> 4 bytes for every character, based on how many bits it'd take for that
> largest code point.

That's how I interpret it too.


> Further i read it to mean that only 00 bytes would
> be dropped in the process, no other bytes would be changed.

Just to clarify, you aren't talking about the \0 character, but only to
extraneous "padding" 00 bytes.


> I also figure this is going to be more space efficient than Python 3.2
> for any string which had a max code point of 65535 or less (in Windows),
> or 4billion or less (in real systems). So unless French has code points
> over 64k, I can't figure that anything is lost.

I think that on narrow builds, it won't make terribly much difference.
The big savings are for wide builds.


--
Steven

wxjm...@gmail.com

unread,
Aug 18, 2012, 4:09:26 AM8/18/12
to
>>> sys.version
'3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
0.8158757139801764

>>> sys.version
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit
(Intel)]'
>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
61.919225272152346
>>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
1.2918679017971044

timeit.timeit("('ab…' * 10).replace('…', '€…')")
1.2484133226156757

* I intuitively and empirically noticed, this happens for
cp1252 or mac-roman characters and not characters which are
elements of the latin-1 coding scheme.

* Bad luck, such characters are usual characters in French scripts
(and in some other European language).

* I do not recall the extreme cases I found. Believe me, when
I'm speaking about a few 100%, I do not lie.

My take of the subject.

This is a typical Python desease. Do not solve a problem, but
find a way, a workaround, which is expecting to solve a problem
and which finally solves nothing. As far as I know, to break
the "BMP limit", the tools are here. They are called utf-8 or
ucs-4/utf-32.

One day, I fell on very, very old mail message, dating at the
time of the introduction of the unicode type in Python 2.
If I recall correctly it was from Victor Stinner. He wrote
something like this "Let's go with ucs-4, and the problems
are solved for ever". He was so right.

I'm spying the dev-list since years, my feeling is that
there is always a latent and permanent conflict between
"ascii users" and "non ascii users" (see the unicode
literal reintroduction).

Please, do not get me wrong. As a non-computer scientist,
I'm very happy with Python. If I try to take a distant
eye, I became more and more sceptical.

PS Py3.3b2 is still crashing, silently exiting, with
cp65001.

jmf

Steven D'Aprano

unread,
Aug 18, 2012, 8:27:23 AM8/18/12
to
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

>>>> sys.version
> '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
>>>> timeit.timeit("('ab…' * 1000).replace('…', '……')")
> 37.32762490493721
> timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
>
>>>> sys.version
> '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
> bit (Intel)]'
>>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346

"imeit"?

It is hard to take your results seriously when you have so obviously
edited your timing results, not just copied and pasted them.


Here are my results, on my laptop running Debian Linux. First, testing on
Python 3.2:

steve@runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 50.2 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 45.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 51.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 47.6 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 45.9 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 57.5 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 49.7 usec per loop


As you can see, the timing results are all consistently around 50
microseconds per loop, regardless of which characters I use, whether they
are in Latin-1 or not. The differences between one test and another are
not meaningful.


Now I do them again using Python 3.3:

steve@runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 64.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 67.8 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 66 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 67.6 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 68.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 67.9 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 66.9 usec per loop

The results are all consistently around 67 microseconds. So Python's
string handling is about 30% slower in the examples show here.

If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:


http://bugs.python.org/

Don't forget to report your operating system.



> My take of the subject.
>
> This is a typical Python desease. Do not solve a problem, but find a
> way, a workaround, which is expecting to solve a problem and which
> finally solves nothing. As far as I know, to break the "BMP limit", the
> tools are here. They are called utf-8 or ucs-4/utf-32.

The problem with UCS-4 is that every character requires four bytes.
Every. Single. One.

So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but
of course UCS-2 can only represent characters in the BMP. A pure ASCII
string would only take 11 bytes, but we're not going back to pure ASCII.

(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.)

The difference between 44 bytes and 22 bytes for one little string is not
very important, but when you double the memory required for every single
string it becomes huge. Remember that every class, function and method
has a name, which is a string; every attribute and variable has a name,
all strings; functions and classes have doc strings, all strings. Strings
are used everywhere in Python, and doubling the memory needed by Python
means that it will perform worse.

With PEP 393, each Python string will be stored in the most efficient
format possible:

- if it only contains ASCII characters, it will be stored using 1 byte
per character;

- if it only contains characters in the BMP, it will be stored using
UCS-2 (2 bytes per character);

- if it contains non-BMP characters, the string will be stored using
UCS-4 (4 bytes per character).



--
Steven

wxjm...@gmail.com

unread,
Aug 18, 2012, 11:07:05 AM8/18/12
to
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
> [...]
> The problem with UCS-4 is that every character requires four bytes.
> [...]

I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american" product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

[*] This not even true, if we consider the €uro currency
symbol used all around the world (banking, accounting
applications).

jmf

Ian Kelly

unread,
Aug 18, 2012, 11:18:39 AM8/18/12
to Python
(Resending this to the list because I previously sent it only to
Steven by mistake. Also showing off a case where top-posting is
reasonable, since this bit requires no context. :-)

On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly <ian.g...@gmail.com> wrote:
>
> On Aug 17, 2012 10:17 PM, "Steven D&apos;Aprano"
> <steve+comp....@pearwood.info> wrote:
>>
>> Unicode strings are not represented as Latin-1 internally. Latin-1 is a
>> byte encoding, not a unicode internal format. Perhaps you mean to say
>> that they are represented as a single byte format?
>
> They are represented as a single-byte format that happens to be equivalent
> to Latin-1, because Latin-1 is a proper subset of Unicode; every character
> representable in Latin-1 has a byte value equal to its Unicode codepoint.
> This talk of whether it's a byte encoding or a 1-byte Unicode representation
> is then just semantics. Even the PEP refers to the 1-byte representation as
> Latin-1.
>
>>
>> >> I understand the complaint
>> >> to be that while the change is great for strings that happen to fit in
>> >> Latin-1, it is less efficient than previous versions for strings that
>> >> do not.
>> >
>> > That's not the way I interpreted the PEP 393. It takes a pure unicode
>> > string, finds the largest code point in that string, and chooses 1, 2 or
>> > 4 bytes for every character, based on how many bits it'd take for that
>> > largest code point.
>>
>> That's how I interpret it too.
>
> I don't see how this is any different from what I described. Using all 4
> bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get
> UCS-2. Truncating to 1 byte, you get Latin-1.

Mark Lawrence

unread,
Aug 18, 2012, 11:25:47 AM8/18/12
to pytho...@python.org
On 18/08/2012 16:07, wxjm...@gmail.com wrote:
> Le samedi 18 ao�t 2012 14:27:23 UTC+2, Steven D'Aprano a �crit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
>
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. Memory.
>
> Let me ask. Is Python an 'american" product for us-users
> or is it a tool for everybody [*]?
> Is there any reason why non ascii users are somehow penalized
> compared to ascii users?
>
> This flexible string representation is a regression (ascii users
> or not).
>
> I recognize in practice the real impact is for many users
> closed to zero (including me) but I have shown (I think) that
> this flexible representation is, by design, not as optimal
> as it is supposed to be. This is in my mind the relevant point.
>
> [*] This not even true, if we consider the �uro currency
> symbol used all around the world (banking, accounting
> applications).
>
> jmf
>

Sorry but you've got me completely baffled. Could you please explain in
words of one syllable or less so I can attempt to grasp what the hell
you're on about?

--
Cheers.

Mark Lawrence.

Chris Angelico

unread,
Aug 18, 2012, 11:36:01 AM8/18/12
to pytho...@python.org
On Sun, Aug 19, 2012 at 1:07 AM, <wxjm...@gmail.com> wrote:
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. Memory.
>
> Let me ask. Is Python an 'american" product for us-users
> or is it a tool for everybody [*]?
> Is there any reason why non ascii users are somehow penalized
> compared to ascii users?

Regardless of your own native language, "len" is the name of a popular
Python function. And "dict" is a well-used class. Both those names are
representable in ASCII, even if every quoted string in your code
requires more bytes to store.

And memory usage has significance in many other areas, too. CPU cache
utilization turns a space saving into a time saving. That's why
structure packing still exists, even though member alignment has other
advantages.

You'd be amazed how many non-USA strings still fit inside seven bits,
too. Are you appending a space to something? Splitting on newlines?
You'll have lots of strings that are going now to be space-optimized.
Of course, the performance gains from shortening some of the strings
may be offset by costs when comparing one-byte and multi-byte strings,
but presumably that's all been gone into in great detail elsewhere.

ChrisA

Ian Kelly

unread,
Aug 18, 2012, 11:51:37 AM8/18/12
to Python
On Sat, Aug 18, 2012 at 9:07 AM, <wxjm...@gmail.com> wrote:
> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
>
> I'm aware of this (and all the blah blah blah you are
> explaining). This always the same song. Memory.
>
> Let me ask. Is Python an 'american" product for us-users
> or is it a tool for everybody [*]?
> Is there any reason why non ascii users are somehow penalized
> compared to ascii users?

The change does not just benefit ASCII users. It primarily benefits
anybody using a wide unicode build with strings mostly containing only
BMP characters. Even for narrow build users, there is the benefit
that with approximately the same amount of memory usage in most cases,
they no longer have to worry about non-BMP characters sneaking in and
breaking their code.

There is some additional benefit for Latin-1 users, but this has
nothing to do with Python. If Python is going to have the option of a
1-byte representation (and as long as we have the flexible
representation, I can see no reason not to), then it is going to be
Latin-1 by definition, because that's what 1-byte Unicode (UCS-1, if
you will) is. If you have an issue with that, take it up with the
designers of Unicode.

>
> This flexible string representation is a regression (ascii users
> or not).
>
> I recognize in practice the real impact is for many users
> closed to zero (including me) but I have shown (I think) that
> this flexible representation is, by design, not as optimal
> as it is supposed to be. This is in my mind the relevant point.

You've shown nothing of the sort. You've demonstrated only one out of
many possible benchmarks, and other users on this list can't even
reproduce that.

wxjm...@gmail.com

unread,
Aug 18, 2012, 12:38:09 PM8/18/12
to Python
Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Now, the reason. I think it is due the "flexible represention".

Deeper reason. The "boss" do not wish to hear from a (pure)
ucs-4/utf-32 "engine" (this has been discussed I do not know
how many times).

jmf

wxjm...@gmail.com

unread,
Aug 18, 2012, 12:38:09 PM8/18/12
to comp.lan...@googlegroups.com, Python

Chris Angelico

unread,
Aug 18, 2012, 12:57:33 PM8/18/12
to pytho...@python.org
On Sun, Aug 19, 2012 at 2:38 AM, <wxjm...@gmail.com> wrote:
> Sorry guys, I'm not stupid (I think). I can open IDLE with
> Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
> always slower. Period.

Ah, but what about all those other operations that use strings under
the covers? As mentioned, namespace lookups do, among other things.
And how is performance in the (very real) case where a C routine wants
to return a value to Python as a string, where the data is currently
guaranteed to be ASCII (previously using PyUnicode_FromString, now
able to use PyUnicode_FromKindAndData)? Again, I'm sure this has been
gone into in great detail before the PEP was accepted (am I
negative-bikeshedding here? "atomic reactoring"???), and I'm sure that
the gains outweigh the costs.

ChrisA

Mark Lawrence

unread,
Aug 18, 2012, 1:28:26 PM8/18/12
to pytho...@python.org
On 18/08/2012 17:38, wxjm...@gmail.com wrote:
> Sorry guys, I'm not stupid (I think). I can open IDLE with
> Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
> always slower. Period.

Proof that is acceptable to everybody please, not just yourself.

>
> Now, the reason. I think it is due the "flexible represention".
>
> Deeper reason. The "boss" do not wish to hear from a (pure)
> ucs-4/utf-32 "engine" (this has been discussed I do not know
> how many times).
>
> jmf
>

--
Cheers.

Mark Lawrence.

Steven D'Aprano

unread,
Aug 18, 2012, 1:59:18 PM8/18/12
to
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:

> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
>
> I'm aware of this (and all the blah blah blah you are explaining). This
> always the same song. Memory.

Exactly. The reason it is always the same song is because it is an
important song.


> Let me ask. Is Python an 'american" product for us-users or is it a tool
> for everybody [*]?

It is a product for everyone, which is exactly why PEP 393 is so
important. PEP 393 means that users who have only a few non-BMP
characters don't have to pay the cost of UCS-4 for every single string in
their application, only for the ones that actually require it. PEP 393
means that using Unicode strings is now cheaper for everybody.

You seem to be arguing that the way forward is not to make Unicode
cheaper for everyone, but to make ASCII strings more expensive so that
everyone suffers equally. I reject that idea.


> Is there any reason why non ascii users are somehow penalized compared
> to ascii users?

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?



> This flexible string representation is a regression (ascii users or
> not).

No it is not. It is a great step forward to more efficient Unicode.

And it means that now Python can correctly deal with non-BMP characters
without the nonsense of UTF-16 surrogates:

steve@runes:~$ python3.3 -c "print(len(chr(1114000)))" # Right!
1
steve@runes:~$ python3.2 -c "print(len(chr(1114000)))" # Wrong!
2

without doubling the storage of every string.

This is an important step towards making the full range of Unicode
available more widely.


> I recognize in practice the real impact is for many users closed to zero

Then what's the problem?


> (including me) but I have shown (I think) that this flexible
> representation is, by design, not as optimal as it is supposed to be.

You have not shown any real problem at all.

You have shown untrustworthy, edited timing results that don't match what
other people are reporting.

Even if your timing results are genuine, you haven't shown that they make
any difference for real code that does useful work.



--
Steven

wxjm...@gmail.com

unread,
Aug 18, 2012, 2:05:07 PM8/18/12
to pytho...@python.org
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
>
> Proof that is acceptable to everybody please, not just yourself.
>
>
I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.

Intuitively I expect there is some kind slow down between
all these "strings" conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3,
I attempted to toy with sizeof and stuct, but without
success.

jmf

wxjm...@gmail.com

unread,
Aug 18, 2012, 2:05:07 PM8/18/12
to comp.lan...@googlegroups.com, pytho...@python.org
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
>
> Proof that is acceptable to everybody please, not just yourself.
>
>

Paul Rubin

unread,
Aug 18, 2012, 2:26:21 PM8/18/12
to
Steven D'Aprano <steve+comp....@pearwood.info> writes:
> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
> using two code points. This is fragile and doesn't work very well,
> because string-handling methods can break the surrogate pairs apart,
> leaving you with invalid unicode string. Not good.)
...
> With PEP 393, each Python string will be stored in the most efficient
> format possible:

Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.

wxjm...@gmail.com

unread,
Aug 18, 2012, 2:30:19 PM8/18/12
to
Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>
>
>
> > Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>
> >> [...]
>
> >> The problem with UCS-4 is that every character requires four bytes.
>
> >> [...]
>
> >
>
> > I'm aware of this (and all the blah blah blah you are explaining). This
>
> > always the same song. Memory.
>
>
>
> Exactly. The reason it is always the same song is because it is an
>
> important song.
>
>
No offense here. But this is an *american* answer.

The same story as the coding of text files, where "utf-8 == ascii"
and the rest of the world doesn't count.

jmf

MRAB

unread,
Aug 18, 2012, 2:34:50 PM8/18/12
to pytho...@python.org
On 18/08/2012 19:05, wxjm...@gmail.com wrote:
> Le samedi 18 ao�t 2012 19:28:26 UTC+2, Mark Lawrence a �crit :
>>
>> Proof that is acceptable to everybody please, not just yourself.
>>
>>
> I cann't, I'm only facing the fact it works slower on my
> Windows platform.
>
> As I understand (I think) the undelying mechanism, I
> can only say, it is not a surprise that it happens.
>
> Imagine an editor, I type an "a", internally the text is
> saved as ascii, then I type en "�", the text can only
> be saved in at least latin-1. Then I enter an "�", the text
> become an internal ucs-4 "string". The remove the "�" and so
> on.
>
[snip]

"a" will be stored as 1 byte/codepoint.

Adding "�", it will still be stored as 1 byte/codepoint.

Adding "�", it will still be stored as 2 bytes/codepoint.

But then you wouldn't be adding them one at a time in Python, you'd be
building a list and then joining them together in one operation.

rusi

unread,
Aug 18, 2012, 2:40:23 PM8/18/12
to
On Aug 18, 10:59 pm, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> > Is there any reason why non ascii users are somehow penalized compared
> > to ascii users?
>
> Of course there is a reason.
>
> If you want to represent 1114111 different characters in a string, as
> Unicode supports, you can't use a single byte per character, or even two
> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
> must be more expensive than supporting 128 of them.
>
> But why should you carry the cost of 4-bytes per character just because
> someday you *might* need a non-BMP character?

I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html

MRAB

unread,
Aug 18, 2012, 2:59:32 PM8/18/12
to pytho...@python.org
On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.

Mark Lawrence

unread,
Aug 18, 2012, 3:45:53 PM8/18/12
to pytho...@python.org
On 18/08/2012 19:30, wxjm...@gmail.com wrote:
> Le samedi 18 ao�t 2012 19:59:18 UTC+2, Steven D'Aprano a �crit :
>> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>>
>>
>>
>>> Le samedi 18 ao�t 2012 14:27:23 UTC+2, Steven D'Aprano a �crit :
>>
>>>> [...]
>>
>>>> The problem with UCS-4 is that every character requires four bytes.
>>
>>>> [...]
>>
>>>
>>
>>> I'm aware of this (and all the blah blah blah you are explaining). This
>>
>>> always the same song. Memory.
>>
>>
>>
>> Exactly. The reason it is always the same song is because it is an
>>
>> important song.
>>
>>
> No offense here. But this is an *american* answer.
>
> The same story as the coding of text files, where "utf-8 == ascii"
> and the rest of the world doesn't count.
>
> jmf
>

Thinking about it I entirely agree with you. Steven D'Aprano strikes me
as typically American, in the same way that I'm typically Brazilian :)

--
Cheers.

Mark Lawrence.

Mark Lawrence

unread,
Aug 18, 2012, 3:50:58 PM8/18/12
to pytho...@python.org
ROFLMAO doesn't adequately some up how much I laughed.

--
Cheers.

Mark Lawrence.

Terry Reedy

unread,
Aug 18, 2012, 4:09:14 PM8/18/12
to pytho...@python.org
On 8/18/2012 12:38 PM, wxjm...@gmail.com wrote:
> Sorry guys, I'm not stupid (I think). I can open IDLE with
> Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
> always slower. Period.

You have not tried enough tests ;-).

On my Win7-64 system:
from timeit import timeit

print(timeit(" 'a'*10000 "))
3.3.0b2: .5
3.2.3: .8

print(timeit("c in a", "c = '…'; a = 'a'*10000"))
3.3: .05 (independent of len(a)!)
3.2: 5.8 100 times slower! Increase len(a) and the ratio can be made as
high as one wants!

print(timeit("a.encode()", "a = 'a'*1000"))
3.2: 1.5
3.3: .26

Similar with encoding='utf-8' added to call.

Jim, please stop the ranting. It does not help improve Python. utf-32 is
not a panacea; it has problems of time, space, and system compatibility
(Windows and others). Victor Stinner, whatever he may have once thought
and said, put a *lot* of effort into making the new implementation both
correct and fast.

On your replace example
>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346
>>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
> 1.2918679017971044

I do not see the point of changing both length and replacement. For me,
the time is about the same for either replacement. I do see about the
same slowdown ratio for 3.3 versus 3.2 I also see it for pure search
without replacement.

print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.

--
Terry Jan Reedy


wxjm...@gmail.com

unread,
Aug 18, 2012, 4:22:00 PM8/18/12
to
I thing it's time to leave the discussion and to go to bed.

You can take the problem the way you wish, Python 3.3 is "slower"
than Python 3.2.

If you see the present status as an optimisation, I'm condidering
this as a regression.

I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf

Mark Lawrence

unread,
Aug 18, 2012, 5:37:13 PM8/18/12
to pytho...@python.org
On 18/08/2012 21:22, wxjm...@gmail.com wrote:
> Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :
>> On Aug 18, 10:59 pm, Steven D'Aprano <steve
>>
>> +comp.lang.pyt...@pearwood.info> wrote:
>>
>>> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>>
>>>> Is there any reason why non ascii users are somehow penalized compared
>>
>>>> to ascii users?
>>
>>>
>>
>>> Of course there is a reason.
>>
>>>
>>
>>> If you want to represent 1114111 different characters in a string, as
>>
>>> Unicode supports, you can't use a single byte per character, or even two
>>
>>> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
>>
>>> must be more expensive than supporting 128 of them.
>>
>>>
>>
>>> But why should you carry the cost of 4-bytes per character just because
>>
>>> someday you *might* need a non-BMP character?
>>
>>
>>
>> I am reminded of: http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605
>>
>>
>>
>> Original above does not open for me but here's a copy that does:
>>
>>
>>
>> http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html
>
> I thing it's time to leave the discussion and to go to bed.

In plain English, duck out cos I'm losing.

>
> You can take the problem the way you wish, Python 3.3 is "slower"
> than Python 3.2.

I'll ask for the second time. Provide proof that is acceptable to
everybody and not just yourself.

>
> If you see the present status as an optimisation, I'm condidering
> this as a regression.

Considering does not equate to proof. Where are the figures which back
up your claim?

>
> I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
> the correct solution.

I look forward to seeing your patch on the bug tracker. If and only if
you can find something that needs patching, which from the course of
this thread I think is highly unlikely.


>
> To be extreme, tools using pure utf-16 or utf-32 are, at least,
> considering all the citizen on this planet in the same way.
>
> jmf
>


--
Cheers.

Mark Lawrence.

Chris Angelico

unread,
Aug 18, 2012, 8:46:18 PM8/18/12
to pytho...@python.org
On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin <no.e...@nospam.invalid> wrote:
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more? Switching between encodings based on the string contents seems
> silly at first glance. Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages. I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.

UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character? You have to
scan from the beginning. The same applies when surrogate pairs are
used to represent single characters, unless the representation leaks
and a surrogate is indexed as two - which is where the breaking-apart
happens.

ChrisA

Paul Rubin

unread,
Aug 18, 2012, 10:11:38 PM8/18/12
to
Chris Angelico <ros...@gmail.com> writes:
> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
> few thousand bytes, how do you locate the 273rd character?

How often do you need to do that, as opposed to traversing the string by
iteration? Anyway, you could use a rope-like implementation, or an
index structure over the string.

Chris Angelico

unread,
Aug 18, 2012, 10:19:00 PM8/18/12
to pytho...@python.org
Well, imagine if Python strings were stored in UTF-8. How would you slice it?

>>> "asdfqwer"[4:]
'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

ChrisA

Paul Rubin

unread,
Aug 18, 2012, 10:35:44 PM8/18/12
to
Chris Angelico <ros...@gmail.com> writes:
>>>> "asdfqwer"[4:]
> 'qwer'
>
> That's a not uncommon operation when parsing strings or manipulating
> data. You'd need to completely rework your algorithms to maintain a
> position somewhere.

Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal. It gets more expensive if you
want to index far more deeply into the string. I'm asking how often
that is done in real code. Obviously one can concoct hypothetical
examples that would suffer.

Chris Angelico

unread,
Aug 18, 2012, 11:01:46 PM8/18/12
to pytho...@python.org
Sure, four characters isn't a big deal to step through. But it still
makes indexing and slicing operations O(N) instead of O(1), plus you'd
have to zark the whole string up to where you want to work. It'd be
workable, but you'd have to redo your algorithms significantly; I
don't have a Python example of parsing a huge string, but I've done it
in other languages, and when I can depend on indexing being a cheap
operation, I'll happily do exactly that.

ChrisA

Paul Rubin

unread,
Aug 18, 2012, 11:10:35 PM8/18/12
to
Chris Angelico <ros...@gmail.com> writes:
> Sure, four characters isn't a big deal to step through. But it still
> makes indexing and slicing operations O(N) instead of O(1), plus you'd
> have to zark the whole string up to where you want to work.

I know some systems chop the strings into blocks of (say) a few
hundred chars, so you can immediately get to the correct
block, then scan into the block to get to the desired char offset.

> I don't have a Python example of parsing a huge string, but I've done
> it in other languages, and when I can depend on indexing being a cheap
> operation, I'll happily do exactly that.

I'd be interested to know what the context was, where you parsed
a big unicode string in a way that required random access to
the nth character in the string.

Terry Reedy

unread,
Aug 18, 2012, 11:12:50 PM8/18/12
to pytho...@python.org
On 8/18/2012 4:09 PM, Terry Reedy wrote:

> print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
> # .6 in 3.2.3, 1.2 in 3.3.0
>
> This does not make sense to me and I will ask about it.

I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 1000000 repetitions in a loop, the reported times
are microseconds per operation and thus not practically significant.'
3. 'There is a stringbench.py with a large number of such micro benchmarks.'

I believe there are also whole-application benchmarks that try to mimic
real-world mixtures of operations.

People making improvements must consider performance on multiple systems
and multiple benchmarks. If someone wants to work on search speed, they
cannot just optimize that one operation on one system.

--
Terry Jan Reedy


Chris Angelico

unread,
Aug 18, 2012, 11:31:21 PM8/18/12
to pytho...@python.org
On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin <no.e...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>> I don't have a Python example of parsing a huge string, but I've done
>> it in other languages, and when I can depend on indexing being a cheap
>> operation, I'll happily do exactly that.
>
> I'd be interested to know what the context was, where you parsed
> a big unicode string in a way that required random access to
> the nth character in the string.

It's something I've done in C/C++ fairly often. Take one big fat
buffer, slice it and dice it as you get the information you want out
of it. I'll retain and/or calculate indices (when I'm not using
pointers, but that's a different kettle of fish). Generally, I'm
working with pure ASCII, but port those same algorithms to Python and
you'll easily be able to read in a file in some known encoding and
manipulate it as Unicode.

It's not so much 'random access to the nth character' as an efficient
way of jumping forward. For instance, if I know that the next thing is
a literal string of n characters (that I don't care about), I want to
skip over that and keep parsing. The Adobe Message Format is
particularly noteworthy in this, but it's a stupid format and I don't
recommend people spend too much time reading up on it (unless you like
that sensation of your brain trying to escape through your ear).

ChrisA

Paul Rubin

unread,
Aug 19, 2012, 1:58:14 AM8/19/12
to
Chris Angelico <ros...@gmail.com> writes:
> Generally, I'm working with pure ASCII, but port those same algorithms
> to Python and you'll easily be able to read in a file in some known
> encoding and manipulate it as Unicode.

If it's pure ASCII, you can use the bytes or bytearray type.

> It's not so much 'random access to the nth character' as an efficient
> way of jumping forward. For instance, if I know that the next thing is
> a literal string of n characters (that I don't care about), I want to
> skip over that and keep parsing.

I don't understand how this is supposed to work. You're going to read a
large unicode text file (let's say it's UTF-8) into a single big string?
So the runtime library has to scan the encoded contents to find the
highest numbered codepoint (let's say it's mostly ascii but has a few
characters outside the BMP), expand it all (in this case) to UCS-4
giving 4x memory bloat and requiring decoding all the UTF-8 regardless,
and now we should worry about the efficiency of skipping n characters?

Since you have to decode the n characters regardless, I'd think this
skipping part should only be an issue if you have to do it a lot of
times.

Steven D'Aprano

unread,
Aug 19, 2012, 2:09:49 AM8/19/12
to
This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with "To recap".


On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve+comp....@pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>> characters using two code points. This is fragile and doesn't work very
>> well, because string-handling methods can break the surrogate pairs
>> apart, leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
>
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more? Switching between encodings based on the string contents seems
> silly at first glance.

Forget encodings! We're not talking about encodings. Encodings are used
for converting text as bytes for transmission over the wire or storage on
disk. PEP 393 talks about the internal representation of text within
Python, the C-level data structure.

In 3.2, that data structure depends on a compile-time switch. In a
"narrow build", text is stored using two-bytes per character, so the
string "len" (as in the name of the built-in function) will be stored as

006c 0065 006e

(or possibly 6c00 6500 6e00, depending on whether your system is
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.

Since most identifiers are ASCII, that's already using twice as much
memory as needed. This standard data structure is called UCS-2, and it
only handles characters in the Basic Multilingual Plane, the BMP (roughly
the first 64000 Unicode code points). I'll come back to that.

In a "wide build", text is stored as four-bytes per character, so "len"
is stored as either:

0000006c 00000065 0000006e
6c000000 65000000 6e000000

Now memory is cheap, but it's not *that* cheap, and no matter how much
memory you have, you can always use more.

This system is called UCS-4, and it can handle the entire Unicode
character set, for now and forever. (If we ever need more that four-bytes
worth of characters, it won't be called Unicode.)

Remember I said that UCS-2 can only handle the 64K characters
[technically: code points] in the Basic Multilingual Plane? There's an
extension to UCS-2 called UTF-16 which extends it to the entire Unicode
range. Yes, that's the same name as the UTF-16 encoding, because it's
more or less the same system.

UTF-16 says "let's represent characters in the BMP by two bytes, but
characters outside the BMP by four bytes." There's a neat trick to this:
the BMP doesn't use the entire two-byte range, so there are some byte
pairs which are illegal in UCS-2 -- they don't correspond to *any*
character. UTF-16 used those byte pairs to signal "this is half a
character, you need to look at the next pair for the rest of the
character".

Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".

Except this comes at a big cost: you can no longer tell how long a string
is by counting the number of bytes, which is fast, because sometimes four
bytes is two characters and sometimes it's one and you can't tell which
it will be until you actually inspect all four bytes.

Copying sub-strings now becomes either slow, or buggy. Say you want to
grab the 10th characters in a string. The fast way using UCS-2 is to
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we
start counting at zero) and you're done. Fast and safe if you're willing
to give up the non-BMP characters.

It's also fast and safe if you use USC-4, but then everything takes twice
as much space, so you probably end up spending so much time copying null
bytes that you're probably slower anyway. Especially when your OS starts
paging memory like mad.

But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8
and 9 are half of a surrogate pair, and you've now split the pair and
ended up with an invalid string. That's what Python 3.2 does, it fails to
handle surrogate pairs properly:

py> s = chr(0xFFFF + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'


I've just split a single valid Unicode character into two invalid
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that
my data is now junk.

Since any character can be a surrogate pair, you have to scan every pair
of bytes in order to index a string, or work out it's length, or copy a
substring. It's not enough to just check if the last pair is a surrogate.

When you don't, you have bugs like this from Python 3.2:

py> s = "01234" + chr(0xFFFF + 1) + "6789"
py> s[9] == '9'
False
py> s[9], len(s)
('8', 11)

Which is now fixed in Python 3.3.

So variable-width data structures like UTF-8 or UTF-16 are crap for the
internal representation of strings -- they are either fast or correct but
cannot be both.

But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4
is too because ASCII-only strings like identifiers end up being four
times as big as they need to be. 1-byte schemes like Latin-1 are
unspeakable because they only handle 256 characters, fewer if you don't
count the C0 and C1 control codes.

PEP 393 to the rescue! What if you could encode pure-ASCII strings like
"len" using one byte per character, and BMP strings using two bytes per
character (UCS-2), and fall back to four bytes (UCS-4) only when you
really need it?

The benefits are:

* Americans and English-Canadians and Australians and other barbarians of
that ilk who only use ASCII save a heap of memory;

* people who mostly use non-BMP characters only pay the cost of four-
bytes per character for strings that actually *need* four-bytes per
character;

* people who use lots of non-BMP characters are no worse off.

The costs are:

* string routines need to be smarter -- they have to handle three
different data structures (ASCII, UCS-2, UCS-4) instead of just one;

* there's a certain amount of overhead when creating a string -- you have
to work out which in-memory format to use, and that's not necessarily
trivial, but at least it's a once-off cost when you create the string;

* people who misunderstand what's going on get all upset over micro-
benchmarks.


> Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages. I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.

To recap:

* Variable-byte formats like UTF-8 and UTF-16 mean that basic string
operations are not O(1) but are O(N). That means they are slow, or buggy,
pick one.

* Fixed width UCS-2 doesn't handle the full Unicode range, only the BMP.
That's better than it sounds: the BMP supports most character sets, but
not all. Still, there are people who need the supplementary planes, and
UCS-2 lets them down.

* Fixed width UCS-4 does handle the full Unicode range, without
surrogates, but at the cost of using 2-4 times more string memory for the
vast majority of users.

* PEP 393 doesn't use variable-width characters, but variable-width
strings. Instead of choosing between 1, 2 and 4 bytes per character, it
chooses *per string*. This keeps basic string operations O(1) instead of
O(N), saves memory where possible, while still supporting the full
Unicode range without a compile-time option.



--
Steven

Steven D'Aprano

unread,
Aug 19, 2012, 2:13:29 AM8/19/12
to
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote:

>> > I'm aware of this (and all the blah blah blah you are explaining).
>> > This always the same song. Memory.
>>
>>
>>
>> Exactly. The reason it is always the same song is because it is an
>> important song.
>>
>>
> No offense here. But this is an *american* answer.

I am not American.

I am not aware that computers outside of the USA, and Australia, have
unlimited amounts of memory. You must be very lucky.


> The same story as the coding of text files, where "utf-8 == ascii" and
> the rest of the world doesn't count.

UTF-8 is not ASCII.



--
Steven

Steven D'Aprano

unread,
Aug 19, 2012, 2:30:54 AM8/19/12
to
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:

> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.
>
> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at least latin-1. Then
> I enter an "€", the text become an internal ucs-4 "string". The remove
> the "€" and so on.

Firstly, that is not what Python does. For starters, € is in the BMP, and
so is nearly every character you're ever going to use unless you are
Asian or a historian using some obscure ancient script. NONE of the
examples you have shown in your emails have included 4-byte characters,
they have all been ASCII or UCS-2.

You are suffering from a misunderstanding about what is going on and
misinterpreting what you have seen.


In *both* Python 3.2 and 3.3, both é and € are represented by two bytes.
That will not change. There is a tiny amount of fixed overhead for
strings, and that overhead is slightly different between the versions,
but you'll never notice the difference.

Secondly, how a text editor or word processor chooses to store the text
that you type is not the same as how Python does it. A text editor is not
going to be creating a new immutable string after every key press. That
will be slow slow SLOW. The usual way is to keep a buffer for each
paragraph, and add and subtract characters from the buffer.


> Intuitively I expect there is some kind slow down between all these
> "strings" conversion.

Your intuition is wrong. Strings are not converted from ASCII to USC-2 to
USC-4 on the fly, they are converted once, when the string is created.

The tests we ran earlier, e.g.:

('ab…' * 1000).replace('…', 'œ…')

show the *worst possible case* for the new string handling, because all
we do is create new strings. First we create a string 'ab…', then we
create another string 'ab…'*1000, then we create two new strings '…' and
'œ…', and finally we call replace and create yet another new string.

But in real applications, once you have created a string, you don't just
immediately create a new one and throw the old one away. You likely do
work with that string:

steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.41 usec per loop

steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag =
s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.29 usec per loop

Once you start doing *real work* with the strings, the overhead of
deciding whether they should be stored using 1, 2 or 4 bytes begins to
fade into the noise.


> When I tested this flexible representation, a few months ago, at the
> first alpha release. This is precisely what, I tested. String
> manipulations which are forcing this internal change and I concluded the
> result is not brillant. Realy, a factor 0.n up to 10.

Like I said, if you really think that there is a significant, repeatable
slow-down on Windows, report it as a bug.


> Does any body know a way to get the size of the internal "string" in
> bytes?

sys.getsizeof(some_string)

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10030
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
('abcœ…'*1000))"
10038


As I said, there is a *tiny* overhead difference. But identifiers will
generally be smaller:

steve@runes:~$ python3.2 -c "from sys import getsizeof as size; print(size
(size.__name__))"
48
steve@runes:~$ python3.3 -c "from sys import getsizeof as size; print(size
(size.__name__))"
34

You can check the object overhead by looking at the size of the empty
string.



--
Steven

Steven D'Aprano

unread,
Aug 19, 2012, 2:33:28 AM8/19/12
to
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:

> The change does not just benefit ASCII users. It primarily benefits
> anybody using a wide unicode build with strings mostly containing only
> BMP characters.

Just to be clear:

If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.

But if you have many strings which are all BMP, and only a few strings
containing non-BMP characters, then you will see a big benefit.


> Even for narrow build users, there is the benefit that
> with approximately the same amount of memory usage in most cases, they
> no longer have to worry about non-BMP characters sneaking in and
> breaking their code.

Yes! +1000 on that.


> There is some additional benefit for Latin-1 users, but this has nothing
> to do with Python. If Python is going to have the option of a 1-byte
> representation (and as long as we have the flexible representation, I
> can see no reason not to),

The PEP explicitly states that it only uses a 1-byte format for ASCII
strings, not Latin-1:

"ASCII-only Unicode strings will again use only one byte per character"

and later:

"If the maximum character is less than 128, they use the PyASCIIObject
structure"

and:

"The data and utf8 pointers point to the same memory if the string uses
only ASCII characters (using only Latin-1 is not sufficient)."


> then it is going to be Latin-1 by definition,

Certainly not, either in fact or in principle. There are a large number
of 1-byte encodings, Latin-1 is hardly the only one.


> because that's what 1-byte Unicode (UCS-1, if you will) is. If you have
> an issue with that, take it up with the designers of Unicode.

The designers of Unicode have never created a standard "1-byte Unicode"
or UCS-1, as far as I can determine.

The Unicode standard refers to some multiple million code points, far too
many to fit in a single byte. There is some historical justification for
using "Unicode" to mean UCS-2, but with the standard being extended
beyond the BMP, that is no longer valid.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.


I think what you are trying to say is that the Unicode designers
deliberately matched the Latin-1 standard for Unicode's first 256 code
points. That's not the same thing though: there is no Unicode standard
mapping to a single byte format.


--
Steven

Steven D'Aprano

unread,
Aug 19, 2012, 2:35:11 AM8/19/12
to
On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:

> "a" will be stored as 1 byte/codepoint.
>
> Adding "é", it will still be stored as 1 byte/codepoint.

Wrong. It will be 2 bytes, just like it already is in Python 3.2.

I don't know where people are getting this myth that PEP 393 uses Latin-1
internally, it does not. Read the PEP, it explicitly states that 1-byte
formats are only used for ASCII strings.


> Adding "€", it will still be stored as 2 bytes/codepoint.

That is correct.



--
Steven

Steven D'Aprano

unread,
Aug 19, 2012, 3:17:10 AM8/19/12
to
On Sat, 18 Aug 2012 19:59:32 +0100, MRAB wrote:

> The problem with strings containing surrogate pairs is that you could
> inadvertently slice the string in the middle of the surrogate pair.

That's the *least* of the problems with surrogate pairs. That would be
easy to fix: check the point of the slice, and back up or forward if
you're on a surrogate pair. But that's not good enough, because the
surrogates could be anywhere in the string. You have to touch every
single character in order to know how many there are.

The problem with surrogate pairs is that they make basic string
operations O(N) instead of O(1).



--
Steven

Peter Otten

unread,
Aug 19, 2012, 3:43:13 AM8/19/12
to pytho...@python.org
Steven D'Aprano wrote:

> On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:
>
>> "a" will be stored as 1 byte/codepoint.
>>
>> Adding "é", it will still be stored as 1 byte/codepoint.
>
> Wrong. It will be 2 bytes, just like it already is in Python 3.2.
>
> I don't know where people are getting this myth that PEP 393 uses Latin-1
> internally, it does not. Read the PEP, it explicitly states that 1-byte
> formats are only used for ASCII strings.

From

Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51)
[GCC 4.6.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> [sys.getsizeof("é"*i) for i in range(10)]
[49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
>>> [sys.getsizeof("e"*i) for i in range(10)]
[49, 50, 51, 52, 53, 54, 55, 56, 57, 58]
>>> sys.getsizeof("é"*101)-sys.getsizeof("é")
100
>>> sys.getsizeof("e"*101)-sys.getsizeof("e")
100
>>> sys.getsizeof("€"*101)-sys.getsizeof("€")
200

I infer that

(1) both ASCII and Latin1 strings require one byte per character.
(2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system)
over ASCII-only.

Steven D'Aprano

unread,
Aug 19, 2012, 4:01:46 AM8/19/12
to
On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote:

> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal. It gets more expensive if you
> want to index far more deeply into the string. I'm asking how often
> that is done in real code.

It happens all the time.

Let's say you've got a bunch of text, and you use a regex to scan through
it looking for a match. Let's ignore the regular expression engine, since
it has to look at every character anyway. But you've done your search and
found your matching text and now want everything *after* it. That's not
exactly an unusual use-case.

mo = re.search(pattern, text)
if mo:
start, end = mo.span()
result = text[end:]


Easy-peasy, right? But behind the scenes, you have a problem: how does
Python know where text[end:] starts? With fixed-size characters, that's
O(1): Python just moves forward end*width bytes into the string. Nice and
fast.

With a variable-sized characters, Python has to start from the beginning
again, and inspect each byte or pair of bytes. This turns the slice
operation into O(N) and the combined op (search + slice) into O(N**2),
and that starts getting *horrible*.

As always, "everything is fast for small enough N", but you *really*
don't want O(N**2) operations when dealing with large amounts of data.

Insisting that the regex functions only ever return offsets to valid
character boundaries doesn't help you, because the string slice method
cannot know where the indexes came from.

I suppose you could have a "fast slice" and a "slow slice" method, but
really, that sucks, and besides all that does is pass responsibility for
tracking character boundaries to the developer instead of the language,
and you know damn well that they will get it wrong and their code will
silently do the wrong thing and they'll say that Python sucks and we
never used to have this problem back in the good old days with ASCII. Boo
sucks to that.

UCS-4 is an option, since that's fixed-width. But it's also bulky. For
typical users, you end up wasting memory. That is the complaint driving
PEP 393 -- memory is cheap, but it's not so cheap that you can afford to
multiply your string memory by four just in case somebody someday gives
you a character in one of the supplementary planes.

If you have oodles of memory and small data sets, then UCS-4 is probably
all you'll ever need. I hear that the club for people who have all the
memory they'll ever need is holding their annual general meeting in a
phone-booth this year.

You could say "Screw the full Unicode standard, who needs more than 64K
different characters anyway?" Well apart from Asians, and historians, and
a bunch of other people. If you can control your data and make sure no
non-BMP characters are used, UCS-2 is fine -- except Python doesn't
actually use that.

You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up
to the individual programmer to track character boundaries, and we know
how well that works. Luckily the supplementary planes are only rarely
used, and people who need them tend to buy more memory and use wide
builds. People who only need a few non-BMP characters in a narrow build
generally just cross their fingers and hope for the best.

You could add a whole lot more heavyweight infrastructure to strings,
turn them into suped-up ropes-on-steroids. All those extra indexes mean
that you don't save any memory. Because the objects are so much bigger
and more complex, your CPU cache goes to the dogs and your code still
runs slow.

Which leaves us right back where we started, PEP 393.


> Obviously one can concoct hypothetical examples that would suffer.

If you think "slicing at arbitrary indexes" is a hypothetical example, I
don't know what to say.



--
Steven

Paul Rubin

unread,
Aug 19, 2012, 4:04:25 AM8/19/12
to
Steven D'Aprano <steve+comp....@pearwood.info> writes:
> This is a long post. If you don't feel like reading an essay, skip to the
> very bottom and read my last few paragraphs, starting with "To recap".

I'm very flattered that you took the trouble to write that excellent
exposition of different Unicode encodings in response to my post. I can
only hope some readers will benefit from it. I regret that I wasn't
more clear about the perspective I posted from, i.e. that I'm already
familiar with how those encodings work.

After reading all of it, I still have the same skepticism on the main
point as before, but I think I see what the issue in contention is, and
some differences in perspectice. First of all, you wrote:

> This standard data structure is called UCS-2 ... There's an extension
> to UCS-2 called UTF-16

My own understanding is UCS-2 simply shouldn't be used any more.
Unicode was historically supposed to be a 16-bit character set, but that
turned out to not be enough, so the supplementary planes were added.
UCS-2 thus became obsolete and UTF-16 superseded it in 1996. UTF-16 in
turn is rather clumsy and the later UTF-8 is better in a lot of ways,
but both of these are at least capable of encoding all the character
codes.

On to the main issue:

> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
> operations are not O(1) but are O(N). That means they are slow, or buggy,
> pick one.

This I don't see. What are the basic string operations?

* Examine the first character, or first few characters ("few" = "usually
bounded by a small constant") such as to parse a token from an input
stream. This is O(1) with either encoding.

* Slice off the first N characters. This is O(N) with either encoding
if it involves copying the chars. I guess you could share references
into the same string, but if the slice reference persists while the
big reference is released, you end up not freeing the memory until
later than you really should.

* Concatenate two strings. O(N) either way.

* Find length of string. O(1) either way since you'd store it in
the string header when you build the string in the first place.
Building the string has to have been an O(N) operation in either
representation.

And finally:

* Access the nth char in the string for some large random n, or maybe
get a small slice from some random place in a big string. This is
where fixed-width representation is O(1) while variable-width is O(N).

What I'm not convinced of, is that the last thing happens all that
often.

Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision. That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

py> s = chr(0xFFFF + 1)
py> a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error. s is a one-character string and should not be unpackable.

I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered. By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string. Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it. Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.

Paul Rubin

unread,
Aug 19, 2012, 4:11:56 AM8/19/12
to
Steven D'Aprano <steve+comp....@pearwood.info> writes:
> result = text[end:]

if end not near the end of the original string, then this is O(N)
even with fixed-width representation, because of the char copying.

if it is near the end, by knowing where the string data area
ends, I think it should be possible to scan backwards from
the end, recognizing what bytes can be the beginning of code points and
counting off the appropriate number. This is O(1) if "near the end"
means "within a constant".

> You could say "Screw the full Unicode standard, who needs more than 64K

No if you're claiming the language supports unicode it should be
the whole standard.

> You could do what Python 3.2 narrow builds do: use UTF-16 and leave it
> up to the individual programmer to track character boundaries,

I'm surprised the Python 3 implementers even considered that approach
much less went ahead with it. It's obviously wrong.

> You could add a whole lot more heavyweight infrastructure to strings,
> turn them into suped-up ropes-on-steroids.

I'm not persuaded that PEP 393 isn't even worse.

Chris Angelico

unread,
Aug 19, 2012, 4:24:57 AM8/19/12
to pytho...@python.org
On Sun, Aug 19, 2012 at 6:11 PM, Paul Rubin <no.e...@nospam.invalid> wrote:
> Steven D'Aprano <steve+comp....@pearwood.info> writes:
>> result = text[end:]
>
> if end not near the end of the original string, then this is O(N)
> even with fixed-width representation, because of the char copying.
>
> if it is near the end, by knowing where the string data area
> ends, I think it should be possible to scan backwards from
> the end, recognizing what bytes can be the beginning of code points and
> counting off the appropriate number. This is O(1) if "near the end"
> means "within a constant".

Only if you know exactly where the end is (which requires storing and
maintaining a character length - this may already be happening, I
don't know). But that approach means you need to have code for both
ways (forward search or reverse), and of course it relies on your
encoding being reverse-scannable in this way (as UTF-8 is, but not
all).

And of course, taking the *entire* rest of the string isn't the only
thing you do. What if you want to take the next six characters after
that index? That would be constant time with a fixed-width storage
format.

ChrisA

Paul Rubin

unread,
Aug 19, 2012, 4:44:44 AM8/19/12
to
Chris Angelico <ros...@gmail.com> writes:
> And of course, taking the *entire* rest of the string isn't the only
> thing you do. What if you want to take the next six characters after
> that index? That would be constant time with a fixed-width storage
> format.

How often is this an issue in practice?

I wonder how other languages deal with this. The examples I can think
of are poor role models:

1. C/C++ - unicode impaired, other than a wchar type

2. Java - bogus UCS-2-like(?) representation for historical reasons
Also has some modified UTF=8 for reasons that made no sense and
that I don't remember

3. Haskell - basic string type is a linked list of code points.
"hello" is five list nodes. New Data.Text library (much more
efficient) uses something like ropes, I think, with UTF-16 underneath.

4. Erlang - I think like Haskell. Efficiently handles byte blocks.

5. Perl 6 -- ???

6. Ruby - ??? (but probably quite slow like the rest of Ruby)

7. Objective C -- ???

8, 9 ... (any other important ones?)

wxjm...@gmail.com

unread,
Aug 19, 2012, 4:54:44 AM8/19/12
to
About the exemples contested by Steven:

eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")


And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.

The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.

The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are
not even aware, characters are "coded". I'm the first
to think, this is legitimate.

Memory or "ability to treat all text in the same and equal
way"?

End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.

Have a nice day.

jmf

Steven D'Aprano

unread,
Aug 19, 2012, 4:56:36 AM8/19/12
to
On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:

> Steven D'Aprano wrote:

>> I don't know where people are getting this myth that PEP 393 uses
>> Latin-1 internally, it does not. Read the PEP, it explicitly states
>> that 1-byte formats are only used for ASCII strings.
>
> From
>
> Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC
> 4.6.1] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import sys
>>>> [sys.getsizeof("é"*i) for i in range(10)]
> [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]

Interesting. Say, I don't suppose you're using a 64-bit build? Because
that would explain why your sizes are so larger than mine:

py> [sys.getsizeof("é"*i) for i in range(10)]
[25, 38, 39, 40, 41, 42, 43, 44, 45, 46]


py> [sys.getsizeof("€"*i) for i in range(10)]
[25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

py> c = chr(0xFFFF + 1)
py> [sys.getsizeof(c*i) for i in range(10)]
[25, 44, 48, 52, 56, 60, 64, 68, 72, 76]


On re-reading the PEP more closely, it looks like I did misunderstand the
internal implementation, and strings which fit exactly in Latin-1 will
also use 1 byte per character. There are three structures used:

PyASCIIObject
PyCompactUnicodeObject
PyUnicodeObject

and the third one comes in three variant forms, for 1-byte, 2-byte and 4-
byte data. So I stand corrected.


--
Steven

wxjm...@gmail.com

unread,
Aug 19, 2012, 5:24:04 AM8/19/12
to
Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit :
>
> internal implementation, and strings which fit exactly in Latin-1 will
>

And this is the crucial point. latin-1 is an obsolete and non usable
coding scheme (esp. for european languages).

We fall on the point I mentionned above. Microsoft know this, ditto
for Apple, ditto for "TeX", ditto for the foundries.
Even, "ISO" has recognized its error and produced iso-8859-15.

The question? Why is it still used?

jmf



Peter Otten

unread,
Aug 19, 2012, 5:37:09 AM8/19/12
to pytho...@python.org
Steven D'Aprano wrote:

> On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:
>
>> Steven D'Aprano wrote:
>
>>> I don't know where people are getting this myth that PEP 393 uses
>>> Latin-1 internally, it does not. Read the PEP, it explicitly states
>>> that 1-byte formats are only used for ASCII strings.
>>
>> From
>>
>> Python 3.3.0a4+ (default:10a8ad665749, Jun 9 2012, 08:57:51) [GCC
>> 4.6.1] on linux
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> import sys
>>>>> [sys.getsizeof("é"*i) for i in range(10)]
>> [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
>
> Interesting. Say, I don't suppose you're using a 64-bit build? Because
> that would explain why your sizes are so larger than mine:
>
> py> [sys.getsizeof("é"*i) for i in range(10)]
> [25, 38, 39, 40, 41, 42, 43, 44, 45, 46]
>
>
> py> [sys.getsizeof("€"*i) for i in range(10)]
> [25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

Yes, I am using a 64-bit build. I thought that

>> (2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit
>> system) over ASCII-only.

would convey that. The corresponding data structure

typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length;
} PyCompactUnicodeObject;

makes for 12 extra bytes on 32 bit, and both Py_ssize_t and pointers double
in size (from 4 to 8 bytes) on 64 bit. I'm sure you can do the maths for the
embedded PyASCIIObject yourself.

lipska the kat

unread,
Aug 19, 2012, 6:13:11 AM8/19/12
to
On 19/08/12 07:09, Steven D'Aprano wrote:
> This is a long post. If you don't feel like reading an essay, skip to the
> very bottom and read my last few paragraphs, starting with "To recap".

Thank you for this excellent post,
it has certainly cleared up a few things for me

[snip]

incidentally

> But in UTF-16, ...

[snip]

> py> s = chr(0xFFFF + 1)
> py> a, b = s
> py> a
> '\ud800'
> py> b
> '\udc00'

in IDLE

Python 3.2.3 (default, May 3 2012, 15:51:42)
[GCC 4.6.3] on linux2
Type "copyright", "credits" or "license()" for more information.
==== No Subprocess ====
>>> s = chr(0xFFFF + 1)
>>> a, b = s
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
a, b = s
ValueError: need more than 1 value to unpack

At a terminal prompt

[lipska@ubuntu ~]$ python3.2
Python 3.2.3 (default, Jul 17 2012, 14:23:10)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = chr(0xFFFF + 1)
>>> a, b = s
>>> a
'\ud800'
>>> b
'\udc00'
>>>

The date stamp is different but the Python version is the same

No idea why this is happening, I just thought it was interesting

lipska

--
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun

Chris Angelico

unread,
Aug 19, 2012, 6:19:11 AM8/19/12
to pytho...@python.org
On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat
<lipska...@yahoo.co.uk> wrote:
> The date stamp is different but the Python version is the same

Check out what 'sys.maxunicode' is in each of those Pythons. It's
possible that one is a wide build and the other narrow.

ChrisA

wxjm...@gmail.com

unread,
Aug 19, 2012, 6:19:23 AM8/19/12
to comp.lan...@googlegroups.com, pytho...@python.org
Le dimanche 19 août 2012 11:37:09 UTC+2, Peter Otten a écrit :


You know, the techincal aspect is one thing. Understanding
the coding of the characters as a whole is something
else. The important point is not the coding per se, the
relevant point is the set of characters a coding may
represent.

You can build the most sophisticated mechanism you which,
if it does not take that point into account, it will
always fail or be not optimal.

This is precicely the weak point of this flexible
representation. It uses latin-1 and latin-1 is for
most users simply unusable.

Fascinating, isn't it? Devs are developing sophisticed
tools based on a non working basis.

jmf

wxjm...@gmail.com

unread,
Aug 19, 2012, 6:19:23 AM8/19/12
to pytho...@python.org

Chris Angelico

unread,
Aug 19, 2012, 6:26:44 AM8/19/12
to pytho...@python.org
On Sun, Aug 19, 2012 at 8:19 PM, <wxjm...@gmail.com> wrote:
> This is precicely the weak point of this flexible
> representation. It uses latin-1 and latin-1 is for
> most users simply unusable.

No, it uses Unicode, and as an optimization, attempts to store the
codepoints in less than four bytes for most strings. The fact that a
one-byte storage format happens to look like latin-1 is rather
coincidental.

ChrisA

Mark Lawrence

unread,
Aug 19, 2012, 6:46:14 AM8/19/12
to pytho...@python.org
On 19/08/2012 09:54, wxjm...@gmail.com wrote:
> About the exemples contested by Steven:
>
> eg: timeit.timeit("('ab�' * 10).replace('�', '��')")
Roughly translated. "I've been shot to pieces and having seen Monty
Python and the Holy Grail I know what to do. Run away, run away"

--
Cheers.

Mark Lawrence.

lipska the kat

unread,
Aug 19, 2012, 6:49:21 AM8/19/12
to
Ah ...

I built my local version from source
and no, I didn't read the makefile so I didn't configure for a wide
build :-( not that I would have known the difference at that time.

[lipska@ubuntu ~]$ python3.2
Python 3.2.3 (default, Jul 17 2012, 14:23:10)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535
>>>

Later, I did an apt-get install idle3 which pulled
down a precompiled IDLE from the Ubuntu repos
This was obviously compiled 'wide'

Python 3.2.3 (default, May 3 2012, 15:51:42)
[GCC 4.6.3] on linux2
Type "copyright", "credits" or "license()" for more information.
==== No Subprocess ====
>>> import sys
>>> sys.maxunicode
1114111
>>>

All very interesting and enlightening

Thanks

Steven D'Aprano

unread,
Aug 19, 2012, 6:51:26 AM8/19/12
to
On Sun, 19 Aug 2012 01:11:56 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve+comp....@pearwood.info> writes:
>> result = text[end:]
>
> if end not near the end of the original string, then this is O(N) even
> with fixed-width representation, because of the char copying.

Technically, yes. But it's a straight copy of a chunk of memory, which
means it's fast: your OS and hardware tries to make straight memory
copies as fast as possible. Big-Oh analysis frequently glosses over
implementation details like that.

Of course, that assumption gets shaky when you start talking about extra
large blocks, and it falls apart completely when your OS starts paging
memory to disk.

But if it helps to avoid irrelevant technical details, change it to
text[end:end+10] or something.


> if it is near the end, by knowing where the string data area ends, I
> think it should be possible to scan backwards from the end, recognizing
> what bytes can be the beginning of code points and counting off the
> appropriate number. This is O(1) if "near the end" means "within a
> constant".

You know, I think you are misusing Big-Oh analysis here. It really
wouldn't be helpful for me to say "Bubble Sort is O(1) if you only sort
lists with a single item". Well, yes, that is absolutely true, but that's
a special case that doesn't give you any insight into why using Bubble
Sort as your general purpose sort routine is a terrible idea.

Using variable-sized strings like UTF-8 and UTF-16 for in-memory
representations is a terrible idea because you can't assume that people
will only every want to index the first or last character. On average,
you need to scan half the string, one character at a time. In Big-Oh, we
can ignore the factor of 1/2 and just say we scan the string, O(N).

That's why languages tend to use fixed character arrays for strings.
Haskell is an exception, using linked lists which require traversing the
string to jump to an index. The manual even warns:

[quote]
If you think of a Text value as an array of Char values (which it is
not), you run the risk of writing inefficient code.

An idiom that is common in some languages is to find the numeric offset
of a character or substring, then use that number to split or trim the
searched string. With a Text value, this approach would require two O(n)
operations: one to perform the search, and one to operate from wherever
the search ended.
[end quote]

http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text.html



--
Steven

wxjm...@gmail.com

unread,
Aug 19, 2012, 8:14:21 AM8/19/12
to pytho...@python.org
And this this is the common basic mistake. You do not push your
argumentation far enough. A character may "fall" accidentally in a latin-1.
The problem lies in these european characters, which can not fall in this
coding. This *is* the cause of the negative side effects.
If you are using a correct coding scheme, like cp1252, mac-roman or
iso-8859-15, you will never see such a negative side effect.
Again, the problem is not the result, the encoded character. The critical
part is the character which may cause this side effect.
You should think "character set" and not encoded "code point", considering
this kind of expression has a sense in 8-bits coding scheme.

jmf

wxjm...@gmail.com

unread,
Aug 19, 2012, 8:14:21 AM8/19/12
to comp.lan...@googlegroups.com, pytho...@python.org
Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :

Dave Angel

unread,
Aug 19, 2012, 8:29:17 AM8/19/12
to wxjm...@gmail.com, pytho...@python.org
On 08/19/2012 08:14 AM, wxjm...@gmail.com wrote:
> Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :
But that choice was made decades ago when Unicode picked its second 128
characters. The internal form used in this PEP is simply the low-order
byte of the Unicode code point. Trying to scan the string deciding if
converting to cp1252 (for example) would be a much more expensive
operation than seeing how many bytes it'd take for the largest code point.





--

DaveA

Dave Angel

unread,
Aug 19, 2012, 8:35:23 AM8/19/12
to wxjm...@gmail.com, pytho...@python.org
(pardon the resend, but I accidentally omitted a couple of words)
On 08/19/2012 08:14 AM, wxjm...@gmail.com wrote:
> Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :
>> <SNIP>
>>
>>
>> No, it uses Unicode, and as an optimization, attempts to store the
>> codepoints in less than four bytes for most strings. The fact that a
>> one-byte storage format happens to look like latin-1 is rather
>> coincidental.
>>
> And this this is the common basic mistake. You do not push your
> argumentation far enough. A character may "fall" accidentally in a latin-1.
> The problem lies in these european characters, which can not fall in this
> coding. This *is* the cause of the negative side effects.
> If you are using a correct coding scheme, like cp1252, mac-roman or
> iso-8859-15, you will never see such a negative side effect.
> Again, the problem is not the result, the encoded character. The critical
> part is the character which may cause this side effect.
> You should think "character set" and not encoded "code point", considering
> this kind of expression has a sense in 8-bits coding scheme.
>
> jmf

But that choice was made decades ago when Unicode picked its second 128
characters. The internal form used in this PEP is simply the low-order
byte of the Unicode code point. Trying to scan the string deciding if
converting to cp1252 (for example) would work, would be a much more
expensive operation than seeing how many bytes it'd take for the largest
code point.

The 8 bit form is used if all the code points are less than 256. That
is a simple description, and simple code. As several people have said,
the fact that this byte matches on of the DECODED forms is coincidence.

--

DaveA

wxjm...@gmail.com

unread,
Aug 19, 2012, 8:59:51 AM8/19/12
to wxjm...@gmail.com, pytho...@python.org, d...@davea.name
Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit :
> On 08/19/2012 08:14 AM, wxjm...@gmail.com wrote:
>
> > Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :
You are absoletely right. (I'm quite comfortable with Unicode).
If Python wish to perpetuate this, lets call it, design mistake
or ennoyement, it will continue to live with problems.

People (tools) who chose pure utf-16 or utf-32 are not suffering
from this issue.

*My* final comment on this thread.

In August 2012, after 20 years of development, Python is not
able to display a piece of text correctly on a Windows console
(eg cp65001).

I downloaded the go language, zero experience, I did not succeed
to display incorrecly a piece of text. (This is by the way *the*
reason why I tested it). Where the problems are coming from, I have
no idea.

I find this situation quite comic. Python is able to
produce this:

>>> (1.1).hex()
'0x1.199999999999ap+0'

but it is not able to display a piece of text!

Try to convince end users IEEE 754 is more important than the
ability to read/wirite a piece a text, a 6-years kid has learned
at school :-)

(I'm not suffering from this kind of effect, as a Windows user,
I'm always working via gui, it still remains, the problem exists.

Regards,
jmf

wxjm...@gmail.com

unread,
Aug 19, 2012, 8:59:51 AM8/19/12
to comp.lan...@googlegroups.com, pytho...@python.org, wxjm...@gmail.com, d...@davea.name
Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit :
> On 08/19/2012 08:14 AM, wxjm...@gmail.com wrote:
>
> > Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :

Steven D'Aprano

unread,
Aug 19, 2012, 9:25:14 AM8/19/12
to
On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote:

> Steven D'Aprano <steve+comp....@pearwood.info> writes:

>> This standard data structure is called UCS-2 ... There's an extension
>> to UCS-2 called UTF-16
>
> My own understanding is UCS-2 simply shouldn't be used any more.

Pretty much. But UTF-16 with lax support for surrogates (that is,
surrogates are included but treated as two characters) is essentially
UCS-2 with the restriction against surrogates lifted. That's what Python
currently does, and Javascript.

http://mathiasbynens.be/notes/javascript-encoding

The reality is that support for the Unicode supplementary planes is
pretty poor. Even when applications support it, most fonts don't have
glyphs for the characters. Anything which makes handling of Unicode
supplementary characters better is a step forward.


>> * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
>> operations are not O(1) but are O(N). That means they are slow, or
>> buggy, pick one.
>
> This I don't see. What are the basic string operations?

The ones I'm specifically referring to are indexing and copying
substrings. There may be others.


> * Examine the first character, or first few characters ("few" = "usually
> bounded by a small constant") such as to parse a token from an input
> stream. This is O(1) with either encoding.

That's actually O(K), for K = "a few", whatever "a few" means. But we
know that anything is fast for small enough N (or K in this case).


> * Slice off the first N characters. This is O(N) with either encoding
> if it involves copying the chars. I guess you could share references
> into the same string, but if the slice reference persists while the
> big reference is released, you end up not freeing the memory until
> later than you really should.

As a first approximation, memory copying is assumed to be free, or at
least constant time. That's not strictly true, but Big Oh analysis is
looking at algorithmic complexity. It's not a substitute for actual
benchmarks.


> Meanwhile, an example of the 393 approach failing: I was involved in a
> project that dealt with terabytes of OCR data of mostly English text.

I assume that this wasn't one giant multi-terrabyte string.

> So
> the chars were mostly ascii, but there would be occasional non-ascii
> chars including supplementary plane characters, either because of
> special symbols that were really in the text, or the typical OCR
> confusion emitting those symbols due to printing imprecision. That's a
> natural for UTF-8 but the PEP-393 approach would bloat up the memory
> requirements by a factor of 4.

Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.


[...]
> I realize the folks who designed and implemented PEP 393 are very smart
> cookies and considered stuff carefully, while I'm just an internet user
> posting an immediate impression of something I hadn't seen before (I
> still use Python 2.6), but I still have to ask: if the 393 approach
> makes sense, why don't other languages do it?

There has to be a first time for everything.


> Ropes of UTF-8 segments seems like the most obvious approach and I
> wonder if it was considered.

Ropes have been considered and rejected because while they are
asymptotically fast, in common cases the added complexity actually makes
them slower. Especially for immutable strings where you aren't inserting
into the middle of a string.

http://mail.python.org/pipermail/python-dev/2000-February/002321.html

PyPy has revisited ropes and uses, or at least used, ropes as their
native string data structure. But that's ropes of *bytes*, not UTF-8.

http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html


--
Steven

Steven D'Aprano

unread,
Aug 19, 2012, 9:33:05 AM8/19/12
to
On Sun, 19 Aug 2012 03:19:23 -0700, wxjmfauth wrote:

> This is precicely the weak point of this flexible representation. It
> uses latin-1 and latin-1 is for most users simply unusable.

That's very funny.

Are you aware that your post is entirely Latin-1?


> Fascinating, isn't it? Devs are developing sophisticed tools based on a
> non working basis.

At the end of the day, PEP 393 fixes some major design limitations of the
Unicode implementation in the "narrow build" Python, while saving memory
for people using the "wide build". Everybody wins here. Your objection
appears to be based on some sort of philosophical objection to Latin-1
than on any genuine problem.


--
Steven

Mark Lawrence

unread,
Aug 19, 2012, 9:46:34 AM8/19/12
to pytho...@python.org
On 19/08/2012 13:59, wxjm...@gmail.com wrote:
> Le dimanche 19 ao�t 2012 14:29:17 UTC+2, Dave Angel a �crit :
>> On 08/19/2012 08:14 AM, wxjm...@gmail.com wrote:
>>
>>> Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :
Please give a precise description of the design mistake and what you
would do to correct it.

>
> People (tools) who chose pure utf-16 or utf-32 are not suffering
> from this issue.
>
> *My* final comment on this thread.
>
> In August 2012, after 20 years of development, Python is not
> able to display a piece of text correctly on a Windows console
> (eg cp65001).

Examples please.

>
> I downloaded the go language, zero experience, I did not succeed
> to display incorrecly a piece of text. (This is by the way *the*
> reason why I tested it). Where the problems are coming from, I have
> no idea.
>
> I find this situation quite comic. Python is able to
> produce this:
>
>>>> (1.1).hex()
> '0x1.199999999999ap+0'
>
> but it is not able to display a piece of text!

So you keep saying, but when asked for examples or evidence nothing gets
produced.

>
> Try to convince end users IEEE 754 is more important than the
> ability to read/wirite a piece a text, a 6-years kid has learned
> at school :-)
>
> (I'm not suffering from this kind of effect, as a Windows user,
> I'm always working via gui, it still remains, the problem exists.

Windows is a law unto itself. Its problems are hardly specific to Python.

>
> Regards,
> jmf
>

Now two or three times you've said you're going but have come back. If
you come again could you please provide examples and or evidence of what
you're on about, because you still have me baffled.

--
Cheers.

Mark Lawrence.

wxjm...@gmail.com

unread,
Aug 19, 2012, 10:09:14 AM8/19/12
to comp.lan...@googlegroups.com, pytho...@python.org
Le dimanche 19 août 2012 15:46:34 UTC+2, Mark Lawrence a écrit :
> On 19/08/2012 13:59, wxjm...@gmail.com wrote:
>
> > Le dimanche 19 ao�t 2012 14:29:17 UTC+2, Dave Angel a �crit :
>
> >> On 08/19/2012 08:14 AM, wxjm...@gmail.com wrote:
>
> >>
>
> >>> Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :
Yesterday, I went to bed.
More seriously.

I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.

It is up to you, the core developers to give an explanation
about this behaviour.

As I understand a little bit the coding of the characters,
I pointed out, this is most probably due to this flexible
string representation (with arguments appearing randomly
in the misc. messages, mainly latin-1).

I can not do more.

(I stupidly spoke about factors 0.1 to ..., you should
read of course, 1.1, to ...)

jmf

wxjm...@gmail.com

unread,
Aug 19, 2012, 10:09:14 AM8/19/12
to pytho...@python.org
Le dimanche 19 août 2012 15:46:34 UTC+2, Mark Lawrence a écrit :
> On 19/08/2012 13:59, wxjm...@gmail.com wrote:
>
> > Le dimanche 19 ao�t 2012 14:29:17 UTC+2, Dave Angel a �crit :
>
> >> On 08/19/2012 08:14 AM, wxjm...@gmail.com wrote:
>
> >>
>
> >>> Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :

Mark Lawrence

unread,
Aug 19, 2012, 10:48:48 AM8/19/12
to pytho...@python.org
On 19/08/2012 15:09, wxjm...@gmail.com wrote:

>
> I can not give you more numbers than those I gave.
> As a end user, I noticed and experimented my random tests
> are always slower in Py3.3 than in Py3.2 on my Windows platform.

Once again you refuse to supply anything to back up what you say.

>
> It is up to you, the core developers to give an explanation
> about this behaviour.

Core developers cannot give an explanation for something that doesn't
exist, except in your imagination. Unless you can produce the evidence
that supports your claims, including details of OS, benchmarks used and
so on and so forth.

>
> As I understand a little bit the coding of the characters,
> I pointed out, this is most probably due to this flexible
> string representation (with arguments appearing randomly
> in the misc. messages, mainly latin-1).
>
> I can not do more.
>
> (I stupidly spoke about factors 0.1 to ..., you should
> read of course, 1.1, to ...)
>
> jmf
>

I suspect that I'll be dead and buried long before you can produce
anything concrete in the way of evidence. I've thrown down the gauntlet
several times, do you now have the courage to pick it up, or are you
going to resort to the FUD approach that you've been using throughout
this thread?

--
Cheers.

Mark Lawrence.

DJC

unread,
Aug 19, 2012, 11:32:06 AM8/19/12
to
On 19/08/12 15:25, Steven D'Aprano wrote:

> Not necessarily. Presumably you're scanning each page into a single
> string. Then only the pages containing a supplementary plane char will be
> bloated, which is likely to be rare. Especially since I don't expect your
> OCR application would recognise many non-BMP characters -- what does
> U+110F3, "SORA SOMPENG DIGIT THREE", look like? If the OCR software
> doesn't recognise it, you can't get it in your output. (If you do, the
> OCR software has a nasty bug.)
>
> Anyway, in my ignorant opinion the proper fix here is to tell the OCR
> software not to bother trying to recognise Imperial Aramaic, Domino
> Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
> expecting them in your source material. Not only will the scanning go
> faster, but you'll get fewer wrong characters.

Consider the automated recognition of a CAPTCHA. As the chars have to be
entered by the user on a keyboard, only the most basic charset can be
used, so the problem of which chars are possible is quite limited.

wxjm...@gmail.com

unread,
Aug 19, 2012, 12:19:31 PM8/19/12
to comp.lan...@googlegroups.com, pytho...@python.org
I do not remember the tests I'have done at the 1st alpha release
time. It was with an interactive interpreter. I precisely pay
attention to test these chars you can find in the range 128..256
in all 8-bits coding schemes. Chars I suspected to be problematic.

Here a short test again, a random single test, the first
idea coming in my mind.

Py 3.2.3
>>> timeit.timeit("('aœ€'*100).replace('a', 'œ€é')")
4.99396356635981

Py 3.3b2
>>> timeit.timeit("('aœ€'*100).replace('a', 'œ€é')")
7.560455708007855

Maybe, not so demonstative. It shows at least, we
are far away from the 10-30% "annouced".

>>> 7.56 / 5
1.512
>>> 5 / (7.56 - 5) * 100
195.31250000000003


jmf


wxjm...@gmail.com

unread,
Aug 19, 2012, 12:19:31 PM8/19/12
to pytho...@python.org

Terry Reedy

unread,
Aug 19, 2012, 12:31:50 PM8/19/12
to pytho...@python.org
On 8/19/2012 4:54 AM, wxjm...@gmail.com wrote:
> About the exemples contested by Steven:
> eg: timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
> And it is good enough to show the problem. Period.

Repeating a false claim over and over does not make it true. Two people
on pydev claim that 3.3 is *faster* on their systems (one unspecified,
one OSX10.8).

--
Terry Jan Reedy


Blind Anagram

unread,
Aug 19, 2012, 1:03:34 PM8/19/12
to
"Steven D'Aprano" wrote in message
news:502f8a2a$0$29978$c3e8da3$5496...@news.astraweb.com...

On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

[...]
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:

http://bugs.python.org/

Don't forget to report your operating system.

====================================================
For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz)
running Windows 7 x64.

Running Python from a Windows command prompt, I got the following on Python
3.2.3 and 3.3 beta 2:

python33\python" -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 39.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 51.8 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 52 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 50.3 usec per loop
python33\python" -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 51.6 usec per loop
python33\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 38.3 usec per loop
python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
10000 loops, best of 3: 50.3 usec per loop

python32\python" -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 24.5 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 24.7 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 24.8 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 24 usec per loop
python32\python" -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 24.1 usec per loop
python32\python" -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 24.4 usec per loop
python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
10000 loops, best of 3: 24.3 usec per loop

This is an average slowdown by a factor of close to 2.3 on 3.3 when compared
with 3.2.

I am not posting this to perpetuate this thread but simply to ask whether,
as you suggest, I should report this as a possible problem with the beta?

Terry Reedy

unread,
Aug 19, 2012, 1:34:09 PM8/19/12
to pytho...@python.org
On 8/19/2012 4:04 AM, Paul Rubin wrote:


> Meanwhile, an example of the 393 approach failing:

I am completely baffled by this, as this example is one where the 393
approach potentially wins.

> I was involved in a
> project that dealt with terabytes of OCR data of mostly English text.
> So the chars were mostly ascii,

3.3 stores ascii pages 1 byte/char rather than 2 or 4.

> but there would be occasional non-ascii
> chars including supplementary plane characters, either because of
> special symbols that were really in the text, or the typical OCR
> confusion emitting those symbols due to printing imprecision.

I doubt that there are really any non-bmp chars. As Steven said, reject
such false identifications.

> That's a natural for UTF-8

3.3 would convert to utf-8 for storage on disk.

> but the PEP-393 approach would bloat up the memory
> requirements by a factor of 4.

3.2- wide builds would *always* use 4 bytes/char. Is not occasionally
better than always?

> py> s = chr(0xFFFF + 1)
> py> a, b = s
>
> That looks like Python 3.2 is buggy and that sample should just throw an
> error. s is a one-character string and should not be unpackable.

That looks like a 3.2- narrow build. Such which treat unicode strings as
sequences of code units rather than sequences of codepoints. Not an
implementation bug, but compromise design that goes back about a decade
to when unicode was added to Python. At that time, there were only a few
defined non-BMP chars and their usage was extremely rare. There are now
more extended chars than BMP chars and usage will become more common
even in English text.

Pre 3.3, there are really 2 sub-versions of every Python version: a
narrow build and a wide build version, with not very well documented
different behaviors for any string with extended chars. That is and
would have become an increasing problem as extended chars are
increasingly used. If you want to say that what was once a practical
compromise has become a design bug, I would not argue. In any case, 3.3
fixes that split and returns Python to being one cross-platform language.

> I realize the folks who designed and implemented PEP 393 are very smart
> cookies and considered stuff carefully, while I'm just an internet user
> posting an immediate impression of something I hadn't seen before (I
> still use Python 2.6), but I still have to ask: if the 393 approach
> makes sense, why don't other languages do it?

Python has often copied or borrowed, with adjustments. This time it is
the first. We will see how it goes, but it has been tested for nearly a
year already.

> Ropes of UTF-8 segments seems like the most obvious approach and I
> wonder if it was considered. By that I mean pick some implementation
> constant k (say k=128) and represent the string as a UTF-8 encoded byte
> array, accompanied by a vector n//k pointers into the byte array, where
> n is the number of codepoints in the string. Then you can reach any
> offset analogously to reading a random byte on a disk, by seeking to the
> appropriate block, and then reading the block and getting the char you
> want within it. Random access is then O(1) though the constant is
> higher than it would be with fixed width encoding.

I would call it O(k), where k is a selectable constant. Slowing access
by a factor of 100 is hardly acceptable to me. For strings less than k,
access is O(len). I believe slicing would require re-indexing.

As 393 was near adoption, I proposed a scheme using utf-16 (narrow
builds) with a supplementary index of extended chars when there are any.
That makes access O(1) if there are none and O(log(k)), where k is the
number of extended chars in the string, if there are some.

--
Terry Jan Reedy

wxjm...@gmail.com

unread,
Aug 19, 2012, 1:33:25 PM8/19/12
to
I use win7 pro 32bits in intel?

Thanks for reporting these numbers.
To be clear: I'm not complaining, but the fact that
there is a slow down is a clear indication (in my mind),
there is a point somewhere.

jmf

Paul Rubin

unread,
Aug 19, 2012, 1:48:06 PM8/19/12
to
Terry Reedy <tjr...@udel.edu> writes:
>> Meanwhile, an example of the 393 approach failing:
> I am completely baffled by this, as this example is one where the 393
> approach potentially wins.

What? The 393 approach is supposed to avoid memory bloat and that
does the opposite.

>> I was involved in a project that dealt with terabytes of OCR data of
>> mostly English text. So the chars were mostly ascii,
> 3.3 stores ascii pages 1 byte/char rather than 2 or 4.

But they are not ascii pages, they are (as stated) MOSTLY ascii.
E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
a much more memory-expensive encoding than UTF-8.

> I doubt that there are really any non-bmp chars.

You may be right about this. I thought about it some more after
posting and I'm not certain that there were supplemental characters.

> As Steven said, reject such false identifications.

Reject them how?

>> That's a natural for UTF-8
> 3.3 would convert to utf-8 for storage on disk.

They are already in utf-8 on disk though that doesn't matter since
they are also compressed.

>> but the PEP-393 approach would bloat up the memory
>> requirements by a factor of 4.
> 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally
> better than always?

The bloat is in comparison with utf-8, in that example.

> That looks like a 3.2- narrow build. Such which treat unicode strings
> as sequences of code units rather than sequences of codepoints. Not an
> implementation bug, but compromise design that goes back about a
> decade to when unicode was added to Python.

I thought the whole point of Python 3's disruptive incompatibility with
Python 2 was to clean up past mistakes and compromises, of which unicode
headaches was near the top of the list. So I'm surprised they seem to
repeated a mistake there.

> I would call it O(k), where k is a selectable constant. Slowing access
> by a factor of 100 is hardly acceptable to me.

If k is constant then O(k) is the same as O(1). That is how O notation
works. I wouldn't believe the 100x figure without seeing it measured in
real-world applications.

Terry Reedy

unread,
Aug 19, 2012, 1:48:35 PM8/19/12
to pytho...@python.org
On 8/19/2012 10:09 AM, wxjm...@gmail.com wrote:

> I can not give you more numbers than those I gave.
> As a end user, I noticed and experimented my random tests
> are always slower in Py3.3 than in Py3.2 on my Windows platform.

And I gave other examples where 3.3 is *faster* on my Windows, which you
have thus far not even acknowledged, let alone try.

> It is up to you, the core developers to give an explanation
> about this behaviour.

System variation, unimportance of sub-microsecond variations, and
attention to more important issues.

Other developer say 3.3 is generally faster on their sy
stems (OSX 10.8, and unspecified). To talk about speed sensibly, one
must run the full stringbench.py benchmark and real applications on
multiple Windows, *nix, and Mac systems. Python is not optimized for
your particular current computer.

--
Terry Jan Reedy

Ian Kelly

unread,
Aug 19, 2012, 1:50:12 PM8/19/12
to Steven D'Aprano, pytho...@python.org
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
>> There is some additional benefit for Latin-1 users, but this has nothing
>> to do with Python. If Python is going to have the option of a 1-byte
>> representation (and as long as we have the flexible representation, I
>> can see no reason not to),
>
> The PEP explicitly states that it only uses a 1-byte format for ASCII
> strings, not Latin-1:

I think you misunderstand the PEP then, because that is empirically false.

Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC
v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getsizeof(bytes(range(256)).decode('latin1'))
329

The constructed string contains all 256 Latin-1 characters, so if
Latin-1 strings must be stored in the 2-byte format, then the size
should be at least 512 bytes. It is not, so I think it must be using
the 1-byte encoding.


> "ASCII-only Unicode strings will again use only one byte per character"

This says nothing one way or the other about non-ASCII Latin-1 strings.

> "If the maximum character is less than 128, they use the PyASCIIObject
> structure"

Note that this only describes the structure of "compact" string
objects, which I have to admit I do not fully understand from the PEP.
The wording suggests that it only uses the PyASCIIObject structure,
not the derived structures. It then says that for compact ASCII
strings "the UTF-8 data, the UTF-8 length and the wstr length are the
same as the length of the ASCII data." But these fields are part of
the PyCompactUnicodeObject structure, not the base PyASCIIObject
structure, so they would not exist if only PyASCIIObject were used.
It would also imply that compact non-ASCII strings are stored
internally as UTF-8, which would be surprising.

> and:
>
> "The data and utf8 pointers point to the same memory if the string uses
> only ASCII characters (using only Latin-1 is not sufficient)."

This says that if the data are ASCII, then the 1-byte representation
and the utf8 pointer will share the same memory. It does not imply
that the 1-byte representation is not used for Latin-1, only that it
cannot also share memory with the utf8 pointer.

wxjm...@gmail.com

unread,
Aug 19, 2012, 1:51:08 PM8/19/12
to pytho...@python.org
Just for the story.

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
"Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table".

jmf

wxjm...@gmail.com

unread,
Aug 19, 2012, 1:51:08 PM8/19/12
to comp.lan...@googlegroups.com, pytho...@python.org

Terry Reedy

unread,
Aug 19, 2012, 1:56:24 PM8/19/12
to pytho...@python.org
On 8/19/2012 8:59 AM, wxjm...@gmail.com wrote:

> In August 2012, after 20 years of development, Python is not able to
> display a piece of text correctly on a Windows console (eg cp65001).

cp65001 is known to not work right. It has been very frustrating. Bug
Microsoft about it, and indeed their whole policy of still dividing the
world into code page regions, even in their next version, instead of
moving toward unicode and utf-8, at least as an option.

> I downloaded the go language, zero experience, I did not succeed to
> display incorrecly a piece of text. (This is by the way *the* reason
> why I tested it). Where the problems are coming from, I have no
> idea.

If go can display all unicode chars on a Windows console, perhaps you
can do some research and find out how they do so. Then we could consider
copying it.

--
Terry Jan Reedy

Blind Anagram

unread,
Aug 19, 2012, 2:04:49 PM8/19/12
to
wrote in message
news:5dfd1779-9442-4858...@googlegroups.com...

Le dimanche 19 ao�t 2012 19:03:34 UTC+2, Blind Anagram a �crit :
> "Steven D'Aprano" wrote in message
>
> news:502f8a2a$0$29978$c3e8da3$5496...@news.astraweb.com...
>
>
>
> On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
>
>
>
> [...]
>
> If you can consistently replicate a 100% to 1000% slowdown in string
>
> handling, please report it as a performance bug:
>
>
>
> http://bugs.python.org/
>
>
>
> Don't forget to report your operating system.
>
>
>
> ====================================================
>
> For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz)
>
> running Windows 7 x64.
>
>
>
> Running Python from a Windows command prompt, I got the following on
> Python
>
> 3.2.3 and 3.3 beta 2:
>
>
>
> python33\python" -m timeit "('abc' * 1000).replace('c', 'de')"
>
> 10000 loops, best of 3: 39.3 usec per loop
>
> python33\python" -m timeit "('ab�' * 1000).replace('�', '��')"
>
> 10000 loops, best of 3: 51.8 usec per loop
>
> python33\python" -m timeit "('ab�' * 1000).replace('�', 'x�')"
>
> 10000 loops, best of 3: 52 usec per loop
>
> python33\python" -m timeit "('ab�' * 1000).replace('�', '��')"
>
> 10000 loops, best of 3: 50.3 usec per loop
>
> python33\python" -m timeit "('ab�' * 1000).replace('�', '��')"
>
> 10000 loops, best of 3: 51.6 usec per loop
>
> python33\python" -m timeit "('XYZ' * 1000).replace('X', '��')"
>
> 10000 loops, best of 3: 38.3 usec per loop
>
> python33\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
>
> 10000 loops, best of 3: 50.3 usec per loop
>
>
>
> python32\python" -m timeit "('abc' * 1000).replace('c', 'de')"
>
> 10000 loops, best of 3: 24.5 usec per loop
>
> python32\python" -m timeit "('ab�' * 1000).replace('�', '��')"
>
> 10000 loops, best of 3: 24.7 usec per loop
>
> python32\python" -m timeit "('ab�' * 1000).replace('�', 'x�')"
>
> 10000 loops, best of 3: 24.8 usec per loop
>
> python32\python" -m timeit "('ab�' * 1000).replace('�', '��')"
>
> 10000 loops, best of 3: 24 usec per loop
>
> python32\python" -m timeit "('ab�' * 1000).replace('�', '��')"
>
> 10000 loops, best of 3: 24.1 usec per loop
>
> python32\python" -m timeit "('XYZ' * 1000).replace('X', '��')"
>
> 10000 loops, best of 3: 24.4 usec per loop
>
> python32\python" -m timeit "('XYZ' * 1000).replace('Y', 'p?')"
>
> 10000 loops, best of 3: 24.3 usec per loop
>
>
>
> This is an average slowdown by a factor of close to 2.3 on 3.3 when
> compared
>
> with 3.2.
>
>
>
> I am not posting this to perpetuate this thread but simply to ask whether,
>
> as you suggest, I should report this as a possible problem with the beta?

I use win7 pro 32bits in intel?

Thanks for reporting these numbers.
To be clear: I'm not complaining, but the fact that
there is a slow down is a clear indication (in my mind),
there is a point somewhere.

====================================
I may be reading your input wrongly, but it seems to me that you are not
only reporting a slowdown but you are also suggesting that this slowdown is
the result of bad design decisions by the Python development team.

I don't want to get involved in the latter part of your argument because I
am convinced that the Python team are doing their very best to find a good
compromise between the various design constraints that they face in meeting
these needs.

Nevertheless, the post that I responded to contained the suggestion that
slowdowns above 100% (which I took as a factor of 2) would be worth
reporting as a possible bug. So I thought that it was worth asking about
this as I may have misunderstood the level of slowdown that is worth
reporting. There is also a potential problem in timings on laptops with
turbo-boost (as I have), although the times look fairly consistent.

Mark Lawrence

unread,
Aug 19, 2012, 2:09:38 PM8/19/12
to pytho...@python.org
How convenient.

--
Cheers.

Mark Lawrence.

Dave Angel

unread,
Aug 19, 2012, 2:05:48 PM8/19/12
to Blind Anagram, pytho...@python.org
Using your measurement numbers, I get an average of 1.95, not 2.3



--

DaveA

wxjm...@gmail.com

unread,
Aug 19, 2012, 2:11:21 PM8/19/12
to
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
>
>
> But they are not ascii pages, they are (as stated) MOSTLY ascii.
>
> E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
>
> a much more memory-expensive encoding than UTF-8.
>
>

Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.

Well, it seems some software producers know what they
are doing.

>>> '€'.encode('cp1252')
b'\x80'
>>> '€'.encode('mac-roman')
b'\xdb'
>>> '€'.encode('iso-8859-1')
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)

jmf
It is loading more messages.
0 new messages