Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Performance of int/long in Python 3

198 views
Skip to first unread message

Chris Angelico

unread,
Mar 25, 2013, 5:51:07 PM3/25/13
to pytho...@python.org
The Python 3 merge of int and long has effectively penalized
small-number arithmetic by removing an optimization. As we've seen
from PEP 393 strings (jmf aside), there can be huge benefits from
having a single type with multiple representations internally. Is
there value in making the int type have a machine-word optimization in
the same way?

The cost is clear. Compare these methods for calculating the sum of
all numbers up to 65535, which stays under 2^31:

def range_sum(n):
return sum(range(n+1))

def forloop(n):
tot=0
for i in range(n+1):
tot+=i
return tot

def forloop_offset(n):
tot=1000000000000000
for i in range(n+1):
tot+=i
return tot-1000000000000000

import timeit
import sys
print(sys.version)
print("inline: %d"%sum(range(65536)))
print(timeit.timeit("sum(range(65536))",number=1000))
for func in ['range_sum','forloop','forloop_offset']:
print("%s: %r"%(func,(globals()[func](65535))))
print(timeit.timeit(func+"(65535)","from __main__ import "+func,number=1000))


Windows XP:
C:\>python26\python inttime.py
2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)]
inline: 2147450880
2.36770455463
range_sum: 2147450880
2.61778550067
forloop: 2147450880
7.91409131608
forloop_offset: 2147450880L
23.3116954809

C:\>python33\python inttime.py
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)]
inline: 2147450880
5.25038713020789
range_sum: 2147450880
5.412975112758745
forloop: 2147450880
17.875799577879313
forloop_offset: 2147450880
19.31672544974291

Debian Wheezy:
rosuav@sikorsky:~$ python inttime.py
2.7.3 (default, Jan 2 2013, 13:56:14)
[GCC 4.7.2]
inline: 2147450880
1.92763710022
range_sum: 2147450880
1.93409109116
forloop: 2147450880
5.14633893967
forloop_offset: 2147450880
5.13459300995
rosuav@sikorsky:~$ python3 inttime.py
3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2]
inline: 2147450880
2.884124994277954
range_sum: 2147450880
2.6586129665374756
forloop: 2147450880
7.660192012786865
forloop_offset: 2147450880
8.11817193031311


On 2.6/2.7, there's a massive penalty for switching to longs; on
3.2/3.3, the two for-loop versions are nearly identical in time.

(Side point: I'm often seeing that 3.2 on Linux is marginally faster
calling my range_sum function than doing the same thing inline. I do
not understand this. If anyone can explain what's going on there, I'm
all ears!)

Python 3's int is faster than Python 2's long, but slower than Python
2's int. So the question really is, would a two-form representation be
beneficial, and if so, is it worth the coding trouble?

ChrisA

Ethan Furman

unread,
Mar 25, 2013, 7:16:05 PM3/25/13
to pytho...@python.org
On 03/25/2013 02:51 PM, Chris Angelico wrote:
> Python 3's int is faster than Python 2's long, but slower than Python
> 2's int. So the question really is, would a two-form representation be
> beneficial, and if so, is it worth the coding trouble?

I'm inclined to say it's not worth the trouble. If you're working with numbers, and speed is an issue, you really
should be using one of the numeric or scientific packages out there.

--
~Ethan~

Cousin Stanley

unread,
Mar 25, 2013, 7:35:41 PM3/25/13
to

Chris Angelico wrote:

> The Python 3 merge of int and long has effectively penalized
> small-number arithmetic by removing an optimization.
> ....
> The cost is clear.
> ....

The cost isn't quite as clear
under Debian Wheezy here ....

Stanley C. Kitching
Debian Wheezy

python inline range_sum forloop forloop_offset

2.7.3 3.1359 3.0725 9.0778 15.6475

3.2.3 2.8226 2.8074 13.47624 13.6430


# ---------------------------------------------------------

Chris Angelico
Debian Wheezy

python inline range_sum forloop forloop_offset

2.7.3 1.9276 1.9341 5.1463 5.1346

3.2.3 2.8841 2.6586 7.6602 8.1182


--
Stanley C. Kitching
Human Being
Phoenix, Arizona

Dan Stromberg

unread,
Mar 25, 2013, 8:12:10 PM3/25/13
to Cousin Stanley, pytho...@python.org
On Mon, Mar 25, 2013 at 4:35 PM, Cousin Stanley <cousin...@gmail.com> wrote:

Chris Angelico wrote:

> The Python 3 merge of int and long has effectively penalized
> small-number arithmetic by removing an optimization.
> ....
> The cost is clear.
> ....
 
I thought I heard that Python 3.x will use machine words for small integers, and automatically coerce internally to a 2.x long as needed.

Either way, it's better to have a small performance cost to avoid problems when computers move from 32 to 64 bit words, or 64 bit to 128 bit words.  With 3.x int's, you don't have to worry about a new crop of CPU's breaking your code.

Steven D'Aprano

unread,
Mar 25, 2013, 8:17:04 PM3/25/13
to
Or PyPy, which will probably optimize it just fine.

Also, speaking as somebody who remembers a time when ints where not
automatically promoted to longs (introduced in, Python 2.2, I think?) let
me say that having a single unified int type is *fantastic*, and managing
ints/longs by hand is a right royal PITA.

What I would like to see though is a module where I can import fixed-
width signed and unsigned integers that behave like in C, complete with
overflow, for writing code that matches the same behaviour as other
languages.


--
Steven

Chris Angelico

unread,
Mar 25, 2013, 8:28:25 PM3/25/13
to pytho...@python.org
On Tue, Mar 26, 2013 at 11:17 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> Also, speaking as somebody who remembers a time when ints where not
> automatically promoted to longs (introduced in, Python 2.2, I think?) let
> me say that having a single unified int type is *fantastic*, and managing
> ints/longs by hand is a right royal PITA.

Oh, I absolutely agree! I'm just looking at performance here, but
definitely the int/long unification is a Good Thing.

ChrisA

Oscar Benjamin

unread,
Mar 25, 2013, 8:49:42 PM3/25/13
to Steven D'Aprano, Python List
On 26 March 2013 00:17, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Mon, 25 Mar 2013 16:16:05 -0700, Ethan Furman wrote:
>
[snip]
>> If you're working with
>> numbers, and speed is an issue, you really should be using one of the
>> numeric or scientific packages out there.
>
[snip]
> What I would like to see though is a module where I can import fixed-
> width signed and unsigned integers that behave like in C, complete with
> overflow, for writing code that matches the same behaviour as other
> languages.

Numpy can do this:

>>> import numpy
>>> a = numpy.array([0], numpy.uint32)
>>> a
array([0], dtype=uint32)
>>> a[0] = -1
>>> a
array([4294967295], dtype=uint32)

Unfortunately it doesn't work with numpy "scalars", so to use this
without the index syntax you'd need a wrapper class. Also it uses
Python style floor rounding rather than truncation as in C (actually I
seem to remember discovering that in C this is implementation
defined).

Presumably ctypes has something like this as well.


Oscar

Roy Smith

unread,
Mar 25, 2013, 8:55:03 PM3/25/13
to
In article <5150e900$0$29998$c3e8da3$5496...@news.astraweb.com>,
Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> Also, speaking as somebody who remembers a time when ints where not
> automatically promoted to longs (introduced in, Python 2.2, I think?) let
> me say that having a single unified int type is *fantastic*,

And incredibly useful when solving Project Euler problems :-)

[I remember when strings didn't have methods]

Steven D'Aprano

unread,
Mar 26, 2013, 1:01:41 AM3/26/13
to
No string methods? You were lucky. When I were a lad, you couldn't even
use "" delimiters for strings.

>>> "b string"
Parsing error: file <stdin>, line 1:
"b string"
^
Unhandled exception: run-time error: syntax error


Python 0.9.1.


--
Steven

Chris Angelico

unread,
Mar 26, 2013, 2:12:06 AM3/26/13
to pytho...@python.org
On Tue, Mar 26, 2013 at 4:01 PM, Steven D'Aprano > No string methods?
You were lucky. When I were a lad, you couldn't even
> use "" delimiters for strings.
>
>>>> "b string"
> Parsing error: file <stdin>, line 1:
> "b string"
> ^
> Unhandled exception: run-time error: syntax error
>
>
> Python 0.9.1.

Well of course that's an error. Anyone can see it should have been:
>>> "a string"

*duck*

ChrisA

Chris Angelico

unread,
Mar 26, 2013, 2:26:31 AM3/26/13
to pytho...@python.org
On Tue, Mar 26, 2013 at 10:35 AM, Cousin Stanley
<cousin...@gmail.com> wrote:
>
> Chris Angelico wrote:
>
>> The Python 3 merge of int and long has effectively penalized
>> small-number arithmetic by removing an optimization.
>> ....
>> The cost is clear.
>> ....
>
> The cost isn't quite as clear
> under Debian Wheezy here ....
>
> Stanley C. Kitching
> Debian Wheezy
>
> python inline range_sum forloop forloop_offset
>
> 2.7.3 3.1359 3.0725 9.0778 15.6475
>
> 3.2.3 2.8226 2.8074 13.47624 13.6430

Interesting, so your 3.x sum() is optimizing something somewhere.
Strange. Are we both running the same Python? I got those from
apt-get, aiming for consistency (rather than building a 3.3 from
source).

The cost is still visible in the for-loop versions, though, and you're
still seeing the <2^31 and >2^31 for-loops behave the same way in 3.x
but perform quite differently in 2.x. So it's looking like things are
mostly the same.

ChrisA

Roy Smith

unread,
Mar 26, 2013, 9:18:25 AM3/26/13
to
In article <51512bb5$0$29973$c3e8da3$5496...@news.astraweb.com>,
Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> On Mon, 25 Mar 2013 20:55:03 -0400, Roy Smith wrote:
>
> > In article <5150e900$0$29998$c3e8da3$5496...@news.astraweb.com>,
> > Steven D'Aprano <steve+comp....@pearwood.info> wrote:
> >
> >> Also, speaking as somebody who remembers a time when ints where not
> >> automatically promoted to longs (introduced in, Python 2.2, I think?)
> >> let me say that having a single unified int type is *fantastic*,
> >
> > And incredibly useful when solving Project Euler problems :-)
> >
> > [I remember when strings didn't have methods]
>
> No string methods? You were lucky. When I were a lad, you couldn't even
> use "" delimiters for strings.
>
> >>> "b string"
> Parsing error: file <stdin>, line 1:
> "b string"
> ^
> Unhandled exception: run-time error: syntax error
>
>
> Python 0.9.1.

OK, you've got me beat. For Python. Am I going to have to go dig out
my old IBM-1130 assembler decks?

Cousin Stanley

unread,
Mar 26, 2013, 9:38:39 AM3/26/13
to
Chris Angelico wrote:

> Interesting, so your 3.x sum() is optimizing something somewhere.
> Strange. Are we both running the same Python ?
>
> I got those from apt-get
> ....

I also installed python here under Debian Wheezy
via apt-get and our versions look to be the same ....

-sk-

2.7.3 (default, Jan 2 2013, 16:53:07) [GCC 4.7.2]

3.2.3 (default, Feb 20 2013, 17:02:41) [GCC 4.7.2]

CPU : Intel(R) Celeron(R) D CPU 3.33GHz


-ca-

2.7.3 (default, Jan 2 2013, 13:56:14) [GCC 4.7.2]

3.2.3 (default, Feb 20 2013, 14:44:27) [GCC 4.7.2]

CPU : ???


Could differences in underlying CPU architecture
lead to our differing python integer results ?

Chris Angelico

unread,
Mar 26, 2013, 10:08:17 AM3/26/13
to pytho...@python.org
Doubtful. I have Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz quad-core
with hyperthreading, but I'm only using one core for this job. I've
run the tests several times and each time, Py2 is a shade under two
seconds for inline/range_sum, and Py3 is about 2.5 seconds for each.
Fascinating.

Just for curiosity's sake, I spun up the tests on my reiplophobic
server, still running Ubuntu Karmic. Pentium(R) Dual-Core CPU
E6500 @ 2.93GHz.

gideon@gideon:~$ python inttime.py
2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1]
inline: 2147450880
2.7050409317
range_sum: 2147450880
2.64918494225
forloop: 2147450880
6.58765792847
forloop_offset: 2147450880L
16.5167789459
gideon@gideon:~$ python3 inttime.py
3.1.1+ (r311:74480, Nov 2 2009, 14:49:22)
[GCC 4.4.1]
inline: 2147450880
4.44533085823
range_sum: 2147450880
4.37314105034
forloop: 2147450880
12.4834370613
forloop_offset: 2147450880
13.5000522137

Once again, Py3 is slower on small integers than Py2. So where's the
difference with your system? This is really weird! I assume you can
repeat the tests and get the same result every time?

ChrisA

Cousin Stanley

unread,
Mar 26, 2013, 12:41:13 PM3/26/13
to

Chris Angelico wrote:

> Once again, Py3 is slower on small integers than Py2.

Chris Angelico
Ubuntu Karmic.
Pentium(R) Dual-Core CPU E6500 @ 2.93GHz.

python inline range_sum forloop forloop_offset

2.6.4 2.7050 2.6492 6.5877 16.5168

3.1.1 4.4453 4.3731 12.4834 13.5001

You do seem to have a slight py3 improvement
under ubuntu for the forloop_offset case ....


> So where's the difference with your system ?

CPU ????


> This is really weird !

Yep ...


> I assume you can repeat the tests
> and get the same result every time ?

Yes ....

First lines of numbers below are from yesterday
while second lines are from today ....

Stanley C. Kitching
Debian Wheezy
Intel(R) Celeron(R) D CPU 3.33GH Single Core

python inline range_sum forloop forloop_offset

2.7.3 3.1359 3.0725 9.0778 15.6475
2.7.3 3.0382 3.1452 9.8799 16.8579

3.2.3 2.8226 2.8074 13.47624 13.6430
3.2.3 2.8331 2.8228 13.54151 13.8716

Chris Angelico

unread,
Mar 26, 2013, 12:54:47 PM3/26/13
to pytho...@python.org
On Wed, Mar 27, 2013 at 3:41 AM, Cousin Stanley <cousin...@gmail.com> wrote:
>
> Chris Angelico wrote:
>
>> Once again, Py3 is slower on small integers than Py2.
>
> Chris Angelico
> Ubuntu Karmic.
> Pentium(R) Dual-Core CPU E6500 @ 2.93GHz.
>
> python inline range_sum forloop forloop_offset
>
> 2.6.4 2.7050 2.6492 6.5877 16.5168
>
> 3.1.1 4.4453 4.3731 12.4834 13.5001
>
> You do seem to have a slight py3 improvement
> under ubuntu for the forloop_offset case ....

Yes, that's correct. The forloop_offset one is using long integers in
all cases. (Well, on Py2 it's adding a series of ints to a long, but
the arithmetic always has to be done with longs.) Python 3 has had
some improvements done, but the main thing is that there's a massive
spike in the Py2 time, while Py3 has _already paid_ that cost - as
evidenced by the closeness of the forloop and forloop_offset times on
Py3.

ChrisA

Terry Reedy

unread,
Mar 26, 2013, 2:24:34 PM3/26/13
to pytho...@python.org
On 3/26/2013 12:41 PM, Cousin Stanley wrote:

>> So where's the difference with your system ?
>
> CPU ????

Compilers and compiler settings can also make a difference.

--
Terry Jan Reedy

jmfauth

unread,
Mar 26, 2013, 2:50:14 PM3/26/13
to
On 25 mar, 22:51, Chris Angelico <ros...@gmail.com> wrote:
> The Python 3 merge of int and long has effectively penalized
> small-number arithmetic by removing an optimization. As we've seen
> from PEP 393 strings (jmf aside), there can be huge benefits from
> having a single type with multiple representations internally ...

------

A character is not an integer (short form).

jmf

Chris Angelico

unread,
Mar 26, 2013, 3:03:01 PM3/26/13
to pytho...@python.org
So?

ChrisA

jmfauth

unread,
Mar 26, 2013, 4:44:01 PM3/26/13
to
On 26 mar, 20:03, Chris Angelico <ros...@gmail.com> wrote:
A character is not an integer.

jmf

Mark Lawrence

unread,
Mar 26, 2013, 4:50:28 PM3/26/13
to pytho...@python.org
But you are an idiot.

--
If you're using GoogleCrap� please read this
http://wiki.python.org/moin/GoogleGroupsPython.

Mark Lawrence

Chris Angelico

unread,
Mar 26, 2013, 4:52:09 PM3/26/13
to pytho...@python.org
Yes, I heard you the first time. And I repeat: A needle pulling thread?

You have not made any actual, uhh, _point_.

ChrisA

Grant Edwards

unread,
Mar 26, 2013, 5:08:55 PM3/26/13
to
On 2013-03-26, Mark Lawrence <bream...@yahoo.co.uk> wrote:
> On 26/03/2013 20:44, jmfauth wrote:
>>>
>>>> A character is not an integer (short form).
>>>
>>> So?
>>
>> A character is not an integer.
>>
>> jmf
>
> But you are an idiot.

I think we all agree that jmf is a character.

So we've established that no characters are integers, but some
characters are idiots.

Does that allow us to determine wheter integers are idiots or not?

--
Grant Edwards grant.b.edwards Yow! All of life is a blur
at of Republicans and meat!
gmail.com

Chris Angelico

unread,
Mar 26, 2013, 5:14:46 PM3/26/13
to pytho...@python.org
On Wed, Mar 27, 2013 at 8:08 AM, Grant Edwards <inv...@invalid.invalid> wrote:
> On 2013-03-26, Mark Lawrence <bream...@yahoo.co.uk> wrote:
>> On 26/03/2013 20:44, jmfauth wrote:
>>>>
>>>>> A character is not an integer (short form).
>>>>
>>>> So?
>>>
>>> A character is not an integer.
>>>
>>> jmf
>>
>> But you are an idiot.
>
> I think we all agree that jmf is a character.
>
> So we've established that no characters are integers, but some
> characters are idiots.
>
> Does that allow us to determine wheter integers are idiots or not?

No, it doesn't. I'm fairly confident that most of them are not...
however, I have my eye on 42. He gets around, a bit, but never seems
to do anything very useful. I'd think twice before hiring him.

But 1, now, he's a good fellow. Even when things get divisive, he's
the voice of unity.

ChrisA

Mark Lawrence

unread,
Mar 26, 2013, 5:26:30 PM3/26/13
to pytho...@python.org
On 26/03/2013 21:14, Chris Angelico wrote:
> On Wed, Mar 27, 2013 at 8:08 AM, Grant Edwards <inv...@invalid.invalid> wrote:
>> On 2013-03-26, Mark Lawrence <bream...@yahoo.co.uk> wrote:
>>> On 26/03/2013 20:44, jmfauth wrote:
>>>>>
>>>>>> A character is not an integer (short form).
>>>>>
>>>>> So?
>>>>
>>>> A character is not an integer.
>>>>
>>>> jmf
>>>
>>> But you are an idiot.
>>
>> I think we all agree that jmf is a character.
>>
>> So we've established that no characters are integers, but some
>> characters are idiots.
>>
>> Does that allow us to determine wheter integers are idiots or not?
>
> No, it doesn't. I'm fairly confident that most of them are not...
> however, I have my eye on 42. He gets around, a bit, but never seems
> to do anything very useful. I'd think twice before hiring him.
>
> But 1, now, he's a good fellow. Even when things get divisive, he's
> the voice of unity.
>
> ChrisA
>

Which reminds me, why do people on newsgroups often refer to 101, my
favourite number? I mean, do we really care about the number of a room
that Eric Blair worked in when he was at the BBC?

Dave Angel

unread,
Mar 26, 2013, 5:28:41 PM3/26/13
to pytho...@python.org
On 03/26/2013 05:14 PM, Chris Angelico wrote:
> <snip>
>> Does that allow us to determine wheter integers are idiots or not?
>
> No, it doesn't. I'm fairly confident that most of them are not...
> however, I have my eye on 42. He gets around, a bit, but never seems
> to do anything very useful. I'd think twice before hiring him.

Ah, 42, the "Answer to Life, the Universe, and Everything"

--
DaveA

Gregory Ewing

unread,
Mar 26, 2013, 7:10:10 PM3/26/13
to
> On Wed, Mar 27, 2013 at 8:08 AM, Grant Edwards <inv...@invalid.invalid> wrote:
>
>>Does that allow us to determine wheter integers are idiots or not?
>
> No, it doesn't. I'm fairly confident that most of them are not...
> however, I have my eye on 42.

He thought he was equal to 6 x 9 at one point, which
seems pretty idiotic to me.

--
Greg

Ned Deily

unread,
Mar 26, 2013, 8:00:43 PM3/26/13
to pytho...@python.org
In article <kit1kg$g2u$1...@ger.gmane.org>,
Mark Lawrence <bream...@yahoo.co.uk> wrote:
> But you are an idiot.

I repeat the friendly reminder I posted a few weeks ago and I'll be a
little less oblique: please avoid gratuitous personal attacks here. It
reflects badly on the group and especially on those people making them.
We can disagree strongly about technical opinions without resorting to
such.

On Mon, 11 Mar 2013 11:13:16 -0700, I posted:
> A friendly reminder that this forum is for general discussion and
> questions about Python.
>
> "Pretty much anything Python-related is fair game for discussion, and
> the group is even fairly tolerant of off-topic digressions; there have
> been entertaining discussions of topics such as floating point, good
> software design, and other programming languages such as Lisp and Forth."
>
> But ...
>
> "Rudeness and personal attacks, even in reaction to blatant flamebait,
> are strongly frowned upon. People may strongly disagree on an issue, but
> usually discussion remains civil. In case of an actual flamebait
> posting, you can ignore it, quietly plonk the offending poster in your
> killfile or mail filters, or write a sharp but still-polite response,
> but at all costs resist the urge to flame back."
>
> http://www.python.org/community/lists/
>
> It's up to all of us to help keep this group/list a place where people
> enjoy participating, without fear of gratuitous personal sniping.
> Thanks!

--
Ned Deily,
n...@acm.org

Mark Lawrence

unread,
Mar 26, 2013, 8:20:14 PM3/26/13
to pytho...@python.org
I suggest you spend more time telling the troll that he's a troll and
less time moaning at me.

Ned Deily

unread,
Mar 26, 2013, 9:31:50 PM3/26/13
to pytho...@python.org
In article <kitdqr$4m4$2...@ger.gmane.org>,
Mark Lawrence <bream...@yahoo.co.uk> wrote:
> On 27/03/2013 00:00, Ned Deily wrote:
[...]
I suggest you re-read the group charter. He may be saying things that
most of us disagree with but he does it without personal attacks. He's
made his position clear and it doesn't seem likely to change. Ignoring,
plonking, or polite responses are all fine responses. Flaming is not.
That's not the kind of group most of us want to see.

--
Ned Deily,
n...@acm.org

Message has been deleted

ru...@yahoo.com

unread,
Mar 27, 2013, 12:31:44 AM3/27/13
to
On Tuesday, March 26, 2013 6:00:43 PM UTC-6, Ned Deily wrote:
> In article <kit1kg$g2u$1...@ger.gmane.org>,
> Mark Lawrence <bream...@yahoo.co.uk> wrote:
> > But you are an idiot.
>
> I repeat the friendly reminder I posted a few weeks ago and I'll be a
> little less oblique: please avoid gratuitous personal attacks here. It
> reflects badly on the group and especially on those people making them.
> We can disagree strongly about technical opinions without resorting to
> such.
>[..]

+1, thank you for posting that.

Mark Lawrence

unread,
Mar 27, 2013, 7:51:07 AM3/27/13
to pytho...@python.org
On 27/03/2013 01:31, Ned Deily wrote:
> In article <kitdqr$4m4$2...@ger.gmane.org>,
> Mark Lawrence <bream...@yahoo.co.uk> wrote:
>> On 27/03/2013 00:00, Ned Deily wrote:
> [...]
>>> I repeat the friendly reminder I posted a few weeks ago and I'll be a
>>> little less oblique: please avoid gratuitous personal attacks here. It
>>> reflects badly on the group and especially on those people making them.
>>> We can disagree strongly about technical opinions without resorting to
>>> such.
>>>
He's not going to change so neither am I.

I also suggest you go and moan at Steven D'Aprano who called the idiot a
liar. Although thinking about it, I prefer Steven's comment to my own
as being more accurate.

Dave Angel

unread,
Mar 26, 2013, 7:19:15 PM3/26/13
to pytho...@python.org
Not in base 13.


--
DaveA

jmfauth

unread,
Mar 27, 2013, 4:30:24 PM3/27/13
to
On 26 mar, 22:08, Grant Edwards <inva...@invalid.invalid> wrote:

>
> I think we all agree that jmf is a character.
>
------

The characters are also "intrisic characteristics" of a
group in the Group Theory.

If you are not a mathematician, but eg a scientist in
need of these characters, they are available in
"precalculated" tables, one shorly calls ... "Tables of
characters" !
(My booklet of the tables is titled "Tables for Group Theory")


Example in chemistry, mainly "quantum chemistry":

Group Theory and its Application to Chemistry
http://chemwiki.ucdavis.edu/Physical_Chemistry/Symmetry/Group_Theory%3A_Application

(Copied link from Firefox).

jmf

Steven D'Aprano

unread,
Mar 27, 2013, 9:47:25 PM3/27/13
to
On Wed, 27 Mar 2013 11:51:07 +0000, Mark Lawrence defending an
unproductive post flaming a troll:

> He's not going to change so neither am I.

"He's a troll disrupting the newsgroup, therefore I'm going to be a troll
disrupting the newsgroup too, so nyah!!!"


> I also suggest you go and moan at Steven D'Aprano who called the idiot a
> liar. Although thinking about it, I prefer Steven's comment to my own
> as being more accurate.


Yes I did, I suggest you reflect on the difference in content between
your post and mine, and why yours can be described as abusive flaming and
mine shouldn't be.


--
Steven

Ethan Furman

unread,
Mar 27, 2013, 11:18:36 PM3/27/13
to pytho...@python.org
On 03/27/2013 06:47 PM, Steven D'Aprano wrote:
> On Wed, 27 Mar 2013 11:51:07 +0000, Mark Lawrence defending an
> unproductive post flaming a troll:

I wouldn't call it unproductive -- a half-dozen amusing posts followed because of Mark's initial post, and they were a
great relief from the tedium and (dare I say it?) idiocy of jmf's posts.


>> He's not going to change so neither am I.
>
> "He's a troll disrupting the newsgroup, therefore I'm going to be a troll
> disrupting the newsgroup too, so nyah!!!"

So long as Mark doesn't start cussing and swearing I'm not going to get worked up about it. I find jmf's posts for more
aggravating.


>> I also suggest you go and moan at Steven D'Aprano who called the idiot a
>> liar. Although thinking about it, I prefer Steven's comment to my own
>> as being more accurate.
>
> Yes I did, I suggest you reflect on the difference in content between
> your post and mine, and why yours can be described as abusive flaming and
> mine shouldn't be.

Mark's post was not, in my not-so-humble opinion, abusive. jmf's (again IMNSHO) was.

Your post (Steven's) was possibly more accurate, but Mark's was more amusing, and generated more amusing responses.

Clearly, jmf is not going to change his thread-hijacking unicode-whining behavior, whether faced with the cold rational
responses or the hotter fed-up responses.

So I guess what I'm saying is: Don't Feed The Trolls (Anyone!) ;)

Of course, somebody still has to reply so a newcomer doesn't get taken in by him.

Has anybody else thought that his last few responses are starting to sound bot'ish?

--
~Ethan~

Chris Angelico

unread,
Mar 27, 2013, 11:40:17 PM3/27/13
to pytho...@python.org
On Thu, Mar 28, 2013 at 2:18 PM, Ethan Furman <et...@stoneleaf.us> wrote:
> Has anybody else thought that [jmf's] last few responses are starting to sound
> bot'ish?

Yes, I did wonder. It's like he and Dihedral have been trading
accounts sometimes. Hey, Dihedral, I hear there's a discussion of
Unicode and PEP 393 and Python 3.3 and Unicode and lots of keywords
for you to trigger on and Python and bots are funny and this text is
almost grammatical!

There. Let's see if he takes the bait.

ChrisA

rusi

unread,
Mar 27, 2013, 11:49:20 PM3/27/13
to
On Mar 28, 8:18 am, Ethan Furman <et...@stoneleaf.us> wrote:
>
> So long as Mark doesn't start cussing and swearing I'm not going to get worked up about it.  I
> find jmf's posts for more aggravating.

I support Ned's original gentle reminder -- Please be civil
irrespective of surrounding nonsensical behavior.

In particular "You are a liar" is as bad as "You are an idiot"
The same statement can be made non-abusively thus: "... is not true
because ..."

Steven D'Aprano

unread,
Mar 28, 2013, 1:20:19 AM3/28/13
to
I accept that criticism, even if I disagree with it. Does that make
sense? I mean it in the sense that I accept that your opinion differs
from mine.

Politeness does not always trump honesty, and stating that somebody's
statement "is not true because..." is not the same as stating that they
are deliberately telling lies (rather than merely being mistaken or
confused).

The world is full of people who deliberately and in complete awareness of
what they are doing lie in order to further their agenda, or for profit,
or to feel good about themselves, or to harm others. There comes a time
where politely ignoring the elephant in the room (the dirty, rotten,
lying scoundrel of an elephant) and giving them the benefit of the doubt
simply makes life worse for everyone except the liars.

We all know this. Unless you've been living in a cave on the top of some
mountain, we all know people whose relationship to the truth is, shall we
say, rather bendy. And yet we collectively muddy the water and inject
uncertainty into debate by politely going along with their lies, or at
least treating them with dignity that they don't deserve, by treating
them as at worst a matter of honest misunderstanding or even mere
difference of opinion.

As an Australian, I am constitutionally required to call a spade a bloody
shovel at least twice a week, so I have no regrets.



--
Steven

rusi

unread,
Mar 28, 2013, 1:42:18 AM3/28/13
to
On Mar 28, 10:20 am, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> On Wed, 27 Mar 2013 20:49:20 -0700, rusi wrote:
> > On Mar 28, 8:18 am, Ethan Furman <et...@stoneleaf.us> wrote:
>
> >> So long as Mark doesn't start cussing and swearing I'm not going to get
> >> worked up about it.  I find jmf's posts for more aggravating.
>
> > I support Ned's original gentle reminder -- Please be civil irrespective
> > of surrounding nonsensical behavior.
>
> > In particular "You are a liar" is as bad as "You are an idiot" The same
> > statement can be made non-abusively thus: "... is not true because ..."
>
> I accept that criticism, even if I disagree with it. Does that make
> sense? I mean it in the sense that I accept that your opinion differs
> from mine.
>
> Politeness does not always trump honesty, and stating that somebody's
> statement "is not true because..." is not the same as stating that they
> are deliberately telling lies (rather than merely being mistaken or
> confused).
>
> The world is full of people who deliberately and in complete awareness of
> what they are doing lie in order to further their agenda, or for profit,
> or to feel good about themselves, or to harm others. There comes a time
> where politely ignoring the elephant in the room (the dirty, rotten,
> lying scoundrel of an elephant) and giving them the benefit of the doubt
> simply makes life worse for everyone except the liars.

We all subscribe to legal systems that decide the undecidable; eg.
A pulled out a gun and killed B.
Was it murder, manslaughter, just a mistake, euthanasia?
Any lawyer with experience knows that horrible mistakes happen in
making these decisions; yet they (the judges) need to make them.
For the purposes of the python list these ascriptions to personal
motives are OT enough to be out of place.

>
> We all know this. Unless you've been living in a cave on the top of some
> mountain, we all know people whose relationship to the truth is, shall we
> say, rather bendy. And yet we collectively muddy the water and inject
> uncertainty into debate by politely going along with their lies, or at
> least treating them with dignity that they don't deserve, by treating
> them as at worst a matter of honest misunderstanding or even mere
> difference of opinion.
>
> As an Australian, I am constitutionally required to call a spade a bloody
> shovel at least twice a week, so I have no regrets.

If someone has got physically injured by the spade then its a bloody
spade; else you are a bloody liar :-)

Well… More seriously Ive never seen anyone -- cause or person -- aided
by the use of excessively strong language.

IOW I repeat my support for Ned's request: Ad hominiem attacks are not
welcome, irrespective of the context/provocation.

Ethan Furman

unread,
Mar 28, 2013, 2:12:21 AM3/28/13
to pytho...@python.org
On 03/27/2013 08:49 PM, rusi wrote:
> In particular "You are a liar" is as bad as "You are an idiot"
> The same statement can be made non-abusively thus: "... is not true
> because ..."

I don't agree. With all the posts and micro benchmarks and other drivel that jmf has inflicted on us, I find it /very/
hard to believe that he forgot -- which means he was deliberately lying.

At some point we have to stop being gentle / polite / politically correct and call a shovel a shovel... er, spade.

--
~Ethan~

Steven D'Aprano

unread,
Mar 28, 2013, 3:48:30 AM3/28/13
to
On Wed, 27 Mar 2013 22:42:18 -0700, rusi wrote:


> More seriously Ive never seen anyone -- cause or person -- aided by
> the use of excessively strong language.

Of course not. By definition, if it helps, it wasn't *excessively* strong
language.


> IOW I repeat my support for Ned's request: Ad hominiem attacks are not
> welcome, irrespective of the context/provocation.

Insults are not ad hominem attacks.

"You sir, are a bounder and a cad. Furthermore, your
argument is wrong, because of reasons."

may very well be an insult, but it also may be correct, and the reasons
logically valid.

"Your argument is wrong, because you are a bounder
and a cad."

is an ad hominem fallacy, because even bounders and cads may tell the
truth occasionally, or be correct by accident.

I find it interesting that nobody has yet felt the need to defend JMF,
and tell me I was factually incorrect about him (as opposed to merely
impolite or mean-spirited).

In any case, I don't want this to be specifically about any one person,
so let's move away from JMF. I disagree that hostile language is *always*
inappropriate, although I agree that it is *usually* inappropriate.

Although even that depends on what you define as "hostile" -- I would
much prefer that people confronted me for being (supposedly) dishonest
than silently shunning me without giving me any way to respond or correct
either my behaviour or their (mis)apprehensions. Quite frankly, I think
that the passive-aggressive silent treatment (kill-filing) is MUCH more
hostile and mean-spirited[1] than honest, respectful, direct criticism,
even when that criticism is about character ("you sir are a lying
scoundrel").

I treat people the way I hope to be treated. As galling as it would be to
be accused of lying, I would rather that you called me a liar to my face
and gave me the opportunity to respond, than for you to ignore everything
I said.

I hope that we all agree that we want a nice, friendly, productive
community where everyone is welcome. But some people simply cannot or
will not behave in ways that are compatible with those community values.
There are some people whom we *do not want here* -- spoilers and messers,
vandals and spammers and cheats and liars and trolls and crackpots of all
sorts. We only disagree as to the best way to make it clear to them that
they are not welcome so long as they continue their behaviour.



[1] Although sadly, given the reality of communication on the Internet,
sometimes kill-filing is the least-worst option.


--
Steven

jmfauth

unread,
Mar 28, 2013, 5:03:09 AM3/28/13
to
-----------

The problem is elsewhere. Nobody understand the examples
I gave on this list, because nobody understand Unicode.
These examples are not random examples, they are well
thought.

If you were understanding the coding of the characters,
Unicode and what this flexible representation does, it
would not be a problem for you to create analog examples.

So, we are turning into circles.

This flexible representation succeeds to cumulate in one
shoot all the design mistakes it is possible to do, when
one wishes to implements Unicode.

Example of a good Unicode understanding.
If you wish 1) to preserve memory, 2) to cover the whole range
of Unicode, 3) to keep maximum performance while preserving the
good work Unicode.org as done (normalization, sorting), there
is only one solution: utf-8. For this you have to understand,
what is really a "unicode transformation format".

Why all the actors, active in the "text field", like MicroSoft,
Apple, Adobe, the unicode compliant TeX engines, the foundries,
the "organisation" in charge of the OpenType font specifications,
are able to handle all this stuff correctly (understanding +
implementation) and Python not?, I should say this is going
beyond my understanding.

Python has certainly and definitvely not "revolutionize"
Unicode.

jmf

Ian Foote

unread,
Mar 28, 2013, 5:36:19 AM3/28/13
to pytho...@python.org
You're confusing python's choice of internal string representation with
the programmer's choice of encoding for communicating with other programs.

I think most people agree that utf-8 is usually the best encoding to use
for interoperating with other unicode aware software, but as a
variable-length encoding it has disadvantages that make it unsuitable
for use as an internal representation.

Specifically, indexing a variable-length encoding like utf-8 is not as
efficient as indexing a fixed-length encoding.

Regards,
Ian F

Oscar Benjamin

unread,
Mar 28, 2013, 5:47:04 AM3/28/13
to jmfauth, Python List
On 28 March 2013 09:03, jmfauth <wxjm...@gmail.com> wrote:
>
> The problem is elsewhere. Nobody understand the examples
> I gave on this list, because nobody understand Unicode.
> These examples are not random examples, they are well
> thought.

There are many people here and among the Python devs who understand
unicode. Similarly they have understood the examples that you have
given. It has been accepted that there are a handful of cases where
performance has been reduced as a result of the change. There are also
many cases where the performance has improved. It is certainly not
clear that there is an *overall* performance reduction for people
using non latin-1 characters as you have often suggested.

The reason your initial posts received a poor reception is that they
were accompanied with pointless rants and arrogant claims that no one
understood the problem. Had you simply reported the timing differences
without the rants then I imagine that you would have received a
response like "Okay, there might be a few regressions. Can you open an
issue on the tracker please?".

Since then you have been relentlessly hijacking unrelated threads and
this is clearly just trolling.

>
> If you were understanding the coding of the characters,
> Unicode and what this flexible representation does, it
> would not be a problem for you to create analog examples.
>
> So, we are turning into circles.
>
> This flexible representation succeeds to cumulate in one
> shoot all the design mistakes it is possible to do, when
> one wishes to implements Unicode.

This is clearly untrue.The most significant design mistakes are the
ones that lead to incorrect handling of unicode characters. This new
implementation in Python 3.3 has been designed in a way that makes it
possible to handle all unicode characters correctly.

>
> Example of a good Unicode understanding.
> If you wish 1) to preserve memory, 2) to cover the whole range
> of Unicode, 3) to keep maximum performance while preserving the
> good work Unicode.org as done (normalization, sorting), there
> is only one solution: utf-8. For this you have to understand,
> what is really a "unicode transformation format".

Again you pretend that others here don't understand. Most people here
are well aware of utf-8 is. Your suggestion that "maximum performance"
would be achieved if Python use utf-8 internally ignores the fact that
it would have many negative performance implications for slicing and
indexing and so on.

>
> Why all the actors, active in the "text field", like MicroSoft,
> Apple, Adobe, the unicode compliant TeX engines, the foundries,
> the "organisation" in charge of the OpenType font specifications,
> are able to handle all this stuff correctly (understanding +
> implementation) and Python not?, I should say this is going
> beyond my understanding.
>
> Python has certainly and definitvely not "revolutionize"
> Unicode.

Perhaps not, but it does now correctly handle all unicode characters
(unlike many other languages and pieces of software).


Oscar

Chris Angelico

unread,
Mar 28, 2013, 6:22:25 AM3/28/13
to pytho...@python.org
On Thu, Mar 28, 2013 at 4:20 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Wed, 27 Mar 2013 20:49:20 -0700, rusi wrote:
>
>> In particular "You are a liar" is as bad as "You are an idiot" The same
>> statement can be made non-abusively thus: "... is not true because ..."
>
> I accept that criticism, even if I disagree with it. Does that make
> sense? I mean it in the sense that I accept that your opinion differs
> from mine.
>
> Politeness does not always trump honesty, and stating that somebody's
> statement "is not true because..." is not the same as stating that they
> are deliberately telling lies (rather than merely being mistaken or
> confused).

There comes a time when a bit of rudeness is a small cost to pay for
forum maintenance. Before you criticize someone for nit-picking, think
what happens when someone reads the thread archive. Of course, that
particular example can be done courteously too - cf the "def" vs
"class" nit from a recent thread. But it'd still be of value even if
done rudely, so the hundreds of subsequent readers would have a chance
to know what's going on. I was researching a problem with ALSA a
couple of weeks ago, and came across a forum thread that discussed
exactly what I needed to know. A dozen or so courteous posts delivered
misinformation; finally someone had the guts to be rude and call
people out for posting incorrect points (and got criticized for doing
so), and that one post was the most useful in the whole thread.

I'd rather this list have some vinegar than it devolve into
uselessness. Or, worse, if there's a hard-and-fast rule about
courtesy, devolve into aspartame... everyone's courteous in words but
hates each other underneath. Or am I taking the analogy too far? :)

ChrisA

Chris Angelico

unread,
Mar 28, 2013, 6:30:27 AM3/28/13
to pytho...@python.org
On Thu, Mar 28, 2013 at 8:03 PM, jmfauth <wxjm...@gmail.com> wrote:
> Example of a good Unicode understanding.
> If you wish 1) to preserve memory, 2) to cover the whole range
> of Unicode, 3) to keep maximum performance while preserving the
> good work Unicode.org as done (normalization, sorting), there
> is only one solution: utf-8. For this you have to understand,
> what is really a "unicode transformation format".

You really REALLY need to sort out in your head the difference between
correctness and performance. I still haven't seen one single piece of
evidence from you that Python 3.3 fails on any point of Unicode
correctness. Covering the whole range of Unicode has never been a
problem.

In terms of memory usage and performance, though, there's one obvious
solution. Fork CPython 3.3 (or the current branch head[1]), change the
internal representation of a string to be UTF-8 (by the way, that's
the official spelling), and run the string benchmarks. Then post your
code and benchmark figures so other people can replicate your results.

> Python has certainly and definitvely not "revolutionize"
> Unicode.

This is one place where you're actually correct, though, because PEP
393 isn't the first instance of this kind of format - Pike's had it
for years. Funny though, I don't think that was your point :)

[1] Apologies if my terminology is wrong, I'm a git user and did one
quick Google search to see if hg uses the same term.

ChrisA

Neil Hodgson

unread,
Mar 28, 2013, 8:11:55 AM3/28/13
to
Ian Foote:

> Specifically, indexing a variable-length encoding like utf-8 is not as
> efficient as indexing a fixed-length encoding.

Many common string operations do not require indexing by character
which reduces the impact of this inefficiency. UTF-8 seems like a
reasonable choice for an internal representation to me. One benefit of
UTF-8 over Python's flexible representation is that it is, on average,
more compact over a wide set of samples.

Neil

Mark Lawrence

unread,
Mar 28, 2013, 8:39:35 AM3/28/13
to pytho...@python.org
On 28/03/2013 03:18, Ethan Furman wrote:
>
> I wouldn't call it unproductive -- a half-dozen amusing posts followed
> because of Mark's initial post, and they were a great relief from the
> tedium and (dare I say it?) idiocy of jmf's posts.
>
> --
> ~Ethan~

Thanks for those words. They're a tonic as I've just clawed my way out
of bed at 12:00 GMT having slept for 15 hours.

Once the PEP393 unicode debacle has been sorted, does anyone have a cure
for Chronic Fatigue Syndrome? :)

Steven D'Aprano

unread,
Mar 28, 2013, 9:01:57 AM3/28/13
to
On Thu, 28 Mar 2013 23:11:55 +1100, Neil Hodgson wrote:

> Ian Foote:
>
>> Specifically, indexing a variable-length encoding like utf-8 is not as
>> efficient as indexing a fixed-length encoding.
>
> Many common string operations do not require indexing by character
> which reduces the impact of this inefficiency.

Which common string operations do you have in mind?

Specifically in Python's case, the most obvious counter-example is the
length of a string. But that's only because Python strings are immutable
objects, and include a field that records the length. So once the string
is created, checking its length takes constant time.

Some string operations need to inspect every character, e.g. str.upper().
Even for them, the increased complexity of a variable-width encoding
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
4 bytes per character. You have to walk the string grabbing 1 byte at a
time, and then decide whether you need another 1, 2 or 3 bytes. Even
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost.

Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the "start offset" is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.


> UTF-8 seems like a
> reasonable choice for an internal representation to me.

It's not. Haskell, for example, uses UTF-8 internally, and it warns that
this makes string operations O(N) instead of O(1) precisely because of
the need to walk the string inspecting every byte.

Remember, when your string primitives are O(N), it is very easy to write
code that becomes O(N**2). Using UTF-8 internally is just begging for
user-written code to be O(N**2).


> One benefit of
> UTF-8 over Python's flexible representation is that it is, on average,
> more compact over a wide set of samples.

Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.


--
Steven

jmfauth

unread,
Mar 28, 2013, 9:34:32 AM3/28/13
to
On 28 mar, 11:30, Chris Angelico <ros...@gmail.com> wrote:
> On Thu, Mar 28, 2013 at 8:03 PM, jmfauth <wxjmfa...@gmail.com> wrote:

-----

> You really REALLY need to sort out in your head the difference between
> correctness and performance. I still haven't seen one single piece of
> evidence from you that Python 3.3 fails on any point of Unicode
> correctness.

That's because you are not understanding unicode. Unicode takes
you from the character to the unicoded transformed fomat via
the code point, working with a unique set of characters with
a contigoous range of code points.
Then it is up to the "implementors" (languages, compilers, ...)
to implement this utf.

> Covering the whole range of Unicode has never been a
> problem.

... for all those, who are following the scheme explained above.
And it magically works smoothly. Of course, there are some variations
due to the Character Encoding Form wich is later influenced by the
Character Encoding Scheme (the serialization of the character Encoding
Scheme).

Rough explanation in other words.
I does not matter if you are using utf-8, -16, -32, ucs2 or ucs4.
All the single characters are handled in the same way with the "same
algorithm".

---

The flexible string representation takes the problem from the
other side, it attempts to work with the characters by using
their representations and it (can only) fails...

PS I never propose to use utf-8. I only spoke about utf-8
as an example. If you start to discuss indexing, you are off-topic.

jmf


jmfauth

unread,
Mar 28, 2013, 10:12:10 AM3/28/13
to
On 28 mar, 14:01, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> On Thu, 28 Mar 2013 23:11:55 +1100, Neil Hodgson wrote:
> > Ian Foote:
>
>
> > One benefit of
> > UTF-8 over Python's flexible representation is that it is, on average,
> > more compact over a wide set of samples.
>
> Sure. And over a different set of samples, it is less compact. If you
> write a lot of Latin-1, Python will use one byte per character, while
> UTF-8 will use two bytes per character.
>

This flexible string representation is so absurd that not only
"it" does not know you can not write Western European Languages
with latin-1, "it" penalizes you by just attempting to optimize
latin-1. Shown in my multiple examples.

(This is a similar case of the long and short int question/dicussion
Chris Angelico opened).


PS1: I received plenty of private mails. I'm suprise, how the dev
do not understand unicode.

PS2: Question I received once from a registrated French Python
Developper (in another context). What are those French characters
you can handle with cp1252 and not with latin-1?

jmf


Chris Angelico

unread,
Mar 28, 2013, 10:38:07 AM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 1:12 AM, jmfauth <wxjm...@gmail.com> wrote:
> This flexible string representation is so absurd that not only
> "it" does not know you can not write Western European Languages
> with latin-1, "it" penalizes you by just attempting to optimize
> latin-1. Shown in my multiple examples.

PEP393 strings have two optimizations, or kinda three:

1a) ASCII-only strings
1b) Latin1-only strings
2) BMP-only strings
3) Everything else

Options 1a and 1b are almost identical - I'm not sure what the detail
is, but there's something flagging those strings that fit inside seven
bits. (Something to do with optimizing encodings later?) Both are
optimized down to a single byte per character.

Option 2 is optimized to two bytes per character.

Option 3 is stored in UTF-32.

Once again, jmf, you are forgetting that option 2 is a safe and
bug-free optimization.

ChrisA

MRAB

unread,
Mar 28, 2013, 10:51:56 AM3/28/13
to pytho...@python.org
Implementing the regex module (http://pypi.python.org/pypi/regex) would
have been more difficult if the internal representation had been UTF-8,
because of the need to decode, and the implementation would also have
been slower for that reason.

Chris Angelico

unread,
Mar 28, 2013, 11:07:45 AM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 1:51 AM, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 28/03/2013 12:11, Neil Hodgson wrote:
>>
> Implementing the regex module (http://pypi.python.org/pypi/regex) would
> have been more difficult if the internal representation had been UTF-8,
> because of the need to decode, and the implementation would also have
> been slower for that reason.

In fact, nearly ALL string parsing operations would need to be done
differently. The only method that I can think of that wouldn't be
impacted is a linear state-machine parser - something that could be
written inside a "for character in string" loop.

text = []

def initial(c):
global state
if c=='<': state=tag
else: text.append(c)

def tag(c):
global state
if c=='>': state=initial

state = initial
for character in string:
state(character)

print(''.join(text))


I'm pretty sure this will run in O(N) time, even with UTF-8 strings.
But it's an *extremely* simple parser.

ChrisA

jmfauth

unread,
Mar 28, 2013, 11:14:43 AM3/28/13
to
On 28 mar, 15:38, Chris Angelico <ros...@gmail.com> wrote:
As long as you are attempting to devide a set of characters in
chunks and try to handle them seperately, it will never work.

Read my previous post about the unicode transformation format.
I know what pep393 does.

jmf

Chris Angelico

unread,
Mar 28, 2013, 11:21:03 AM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 2:14 AM, jmfauth <wxjm...@gmail.com> wrote:
> As long as you are attempting to devide a set of characters in
> chunks and try to handle them seperately, it will never work.

Okay. Let's look at integers. To properly represent the Python 3 'int'
type (or the Python 2 'long'), we need to be able to encode ANY
integer. And of course, any attempt to divide them up into chunks will
never work. So we need a single representation that will cover ANY
integer, right?

Perfect. We already have one of those, detailed in RFC 2795. (It's
coming up to its thirteenth anniversary in a day or two. Very
appropriate.)

http://tools.ietf.org/html/rfc2795#section-4

Are you saying Python's integers should be stored as I-TAGs?

ChrisA

jmfauth

unread,
Mar 28, 2013, 11:45:51 AM3/28/13
to
Addendum.

This was you correctly percieved in one another thread.
You qualified it as a "switch". Now you have to understand
from where this "switch" is coming from.

jmf

by toy with

Terry Reedy

unread,
Mar 28, 2013, 12:01:25 PM3/28/13
to pytho...@python.org
On 3/28/2013 10:38 AM, Chris Angelico wrote:

> PEP393 strings have two optimizations, or kinda three:
>
> 1a) ASCII-only strings
> 1b) Latin1-only strings
> 2) BMP-only strings
> 3) Everything else
>
> Options 1a and 1b are almost identical - I'm not sure what the detail
> is, but there's something flagging those strings that fit inside seven
> bits. (Something to do with optimizing encodings later?)

Yes. 'Encoding' an ascii-only string to any ascii-compatible encoding
amounts to a simple copy of the internal bytes. I do not know if *all*
the codecs for such encodings are 393-aware, but I do know that the
utf-8 and latin-1 group are. This is one operation that 3.3+ does much
faster than 3.2-


--
Terry Jan Reedy

Ian Kelly

unread,
Mar 28, 2013, 12:01:06 PM3/28/13
to Python
On Thu, Mar 28, 2013 at 7:01 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> Any string method that takes a starting offset requires the method to
> walk the string byte-by-byte. I've even seen languages put responsibility
> for dealing with that onto the programmer: the "start offset" is given in
> *bytes*, not characters. I don't remember what language this was... it
> might have been Haskell? Whatever it was, it horrified me.

Go does this. I remember because it came up in one of these threads,
where jmf (or was it Ranting Rick?) was praising Go for just getting
Unicode "right".

Ian Kelly

unread,
Mar 28, 2013, 12:11:59 PM3/28/13
to Python
On Thu, Mar 28, 2013 at 8:38 AM, Chris Angelico <ros...@gmail.com> wrote:
> PEP393 strings have two optimizations, or kinda three:
>
> 1a) ASCII-only strings
> 1b) Latin1-only strings
> 2) BMP-only strings
> 3) Everything else
>
> Options 1a and 1b are almost identical - I'm not sure what the detail
> is, but there's something flagging those strings that fit inside seven
> bits. (Something to do with optimizing encodings later?) Both are
> optimized down to a single byte per character.

The only difference for ASCII-only strings is that they are kept in a
struct with a smaller header. The smaller header omits the utf8
pointer (which optionally points to an additional UTF-8 representation
of the string) and its associated length variable. These are not
needed for ASCII-only strings because an ASCII string can be directly
interpreted as a UTF-8 string for the same result. The smaller header
also omits the "wstr_length" field which, according to the PEP,
"differs from length only if there are surrogate pairs in the
representation." For an ASCII string, of course there would not be
any surrogate pairs.

Chris Angelico

unread,
Mar 28, 2013, 12:16:55 PM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 3:01 AM, Terry Reedy <tjr...@udel.edu> wrote:
> On 3/28/2013 10:38 AM, Chris Angelico wrote:
>
>> PEP393 strings have two optimizations, or kinda three:
>>
>> 1a) ASCII-only strings
>> 1b) Latin1-only strings
>> 2) BMP-only strings
>> 3) Everything else
>>
>> Options 1a and 1b are almost identical - I'm not sure what the detail
>> is, but there's something flagging those strings that fit inside seven
>> bits. (Something to do with optimizing encodings later?)
>
>
> Yes. 'Encoding' an ascii-only string to any ascii-compatible encoding
> amounts to a simple copy of the internal bytes. I do not know if *all* the
> codecs for such encodings are 393-aware, but I do know that the utf-8 and
> latin-1 group are. This is one operation that 3.3+ does much faster than
> 3.2-

Thanks Terry. So that's not so much a representation difference as a
flag that costs little or nothing to retain, and can improve
performance in the encode later on. Sounds like a useful tweak to the
basics of flexible string representation, without being particularly
germane to jmf's complaints.

ChrisA

Ian Kelly

unread,
Mar 28, 2013, 12:33:46 PM3/28/13
to jmfauth, pytho...@python.org
On Thu, Mar 28, 2013 at 7:34 AM, jmfauth <wxjm...@gmail.com> wrote:
> The flexible string representation takes the problem from the
> other side, it attempts to work with the characters by using
> their representations and it (can only) fails...

This is false. As I've pointed out to you before, the FSR does not
divide characters up by representation. It divides them up by
codepoint -- more specifically, by the *bit-width* of the codepoint.
We call the internal format of the string "ASCII" or "Latin-1" or
"UCS-2" for conciseness and a point of reference, but fundamentally
all of the FSR formats are simply byte arrays of *codepoints* -- you
know, those things you keep harping on. The major optimization
performed by the FSR is to consistently truncate the leading zero
bytes from each codepoint when it is possible to do so safely. But
regardless of to what extent this truncation is applied, the string is
*always* internally just an array of codepoints, and the same
algorithms apply for all representations.

jmfauth

unread,
Mar 28, 2013, 12:55:46 PM3/28/13
to
Chris,

Your problem with int/long, the start of this thread, is
very intersting.

This is not a demonstration, a proof, rather an illustration.

Assume you have a set of integers {0...9} and an operator,
let say, the addition.

Idea.
Just devide this set in two chunks, {0...4} and {5...9}
and work hardly to optimize the addition of 2 operands in
the sets {0...4}.

The problems.
- When optimizing "{0...4}", your algorithm will most probably
weaken "{5...9}".
- When using "{5...9}", you do not benefit from your algorithm, you
will be penalized just by the fact you has optimized "{0...4}"
- And the first mistake, you are just penalized and impacted by the
fact you have to select in which subset you operands are when
working with "{0...9}".

Very interestingly, working with the representation (bytes) of
these integers will not help. You have to consider conceptually
{0..9} as numbers.

Now, replace numbers by characters, bytes by "encoded code points",
and you have qualitatively the flexible string representation.

In Unicode, there is one more level of abstraction: one conceptually
neither works with characters, nor with "encoded code points", but
with unicode transformed formated "entities". (see my previous post).

That means you can work very hardly on the "bytes levels",
you will never solves the problem which is one level higher
in the unicode hierarchy:
character -> code point -> utf -> bytes (implementation)
with the important fact that this construct can only go
from left to right.

---

In fact, by proposing a flexible representation of ints, you may
just fall in the same trap the flexible string representation
presents.

----

All this stuff is explained in good books about the coding of the
characters and/or unicode.
The unicode.org documention explains it too. It is a little
bit harder to discover, because the doc is presenting always
this stuff from a "technical" perspective.
You get it when reading a large part of the Unicode doc.

jmf



Chris Angelico

unread,
Mar 28, 2013, 1:13:24 PM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 3:55 AM, jmfauth <wxjm...@gmail.com> wrote:
> Assume you have a set of integers {0...9} and an operator,
> let say, the addition.
>
> Idea.
> Just devide this set in two chunks, {0...4} and {5...9}
> and work hardly to optimize the addition of 2 operands in
> the sets {0...4}.
>
> The problems.
> - When optimizing "{0...4}", your algorithm will most probably
> weaken "{5...9}".
> - When using "{5...9}", you do not benefit from your algorithm, you
> will be penalized just by the fact you has optimized "{0...4}"
> - And the first mistake, you are just penalized and impacted by the
> fact you have to select in which subset you operands are when
> working with "{0...9}".
>
> Very interestingly, working with the representation (bytes) of
> these integers will not help. You have to consider conceptually
> {0..9} as numbers.

Yeah, and there's an easy representation of those numbers. But let's
look at Python's representations of integers. I have a sneaking
suspicion something takes note of how large the number is before
deciding how to represent it. Look!

>>> sys.getsizeof(1)
14
>>> sys.getsizeof(1<<2)
14
>>> sys.getsizeof(1<<4)
14
>>> sys.getsizeof(1<<8)
14
>>> sys.getsizeof(1<<31)
18
>>> sys.getsizeof(1<<30)
18
>>> sys.getsizeof(1<<16)
16
>>> sys.getsizeof(1<<12345)
1660
>>> sys.getsizeof(1<<123456)
16474

Small numbers are represented more compactly than large ones! And it's
not like in REXX, where all numbers are stored as strings.

Go fork CPython and make the changes you suggest. Then run real-world
code on it and see how it performs. Or at very least, run plausible
benchmarks like the strings benchmark from the standard tests.

My original post about integers was based on two comparisons: Python 2
and Pike. Both languages have an optimization for "small" integers
(where "small" is "within machine word" - on rechecking some of my
stats, I find that I perhaps should have used a larger offset, as the
64-bit Linux Python I used appeared to be a lot faster than it should
have been), which Python 3 doesn't have. Real examples, real
statistics, real discussion. (I didn't include Pike stats in what I
posted, for two reasons: firstly, it would require a reworking of the
code, rather than simply "run this under both interpreters"; and
secondly, Pike performance is completely different from CPython
performance, and is non-comparable. Pike is more similar to PyPy, able
to compile - in certain circumstances - to machine code. So the
comparisons were Py2 vs Py3.)

ChrisA

jmfauth

unread,
Mar 28, 2013, 1:48:33 PM3/28/13
to
On 28 mar, 17:33, Ian Kelly <ian.g.ke...@gmail.com> wrote:
-----

You know, we can discuss this ad nauseam. What is important
is Unicode.

You have transformed Python back in an ascii oriented product.

If Python had imlemented Unicode correctly, there would
be no difference in using an "a", "é", "€" or any character,
what the narrow builds did.

If I am practically the only one, who speakes /discusses about
this, I can ensure you, this has been noticed.

Now, it's time to prepare the Asparagus, the "jambon cru"
and a good bottle a dry white wine.

jmf




Chris Angelico

unread,
Mar 28, 2013, 1:55:03 PM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 4:48 AM, jmfauth <wxjm...@gmail.com> wrote:
> If Python had imlemented Unicode correctly, there would
> be no difference in using an "a", "é", "€" or any character,
> what the narrow builds did.

I'm not following your grammar perfectly here, but if Python were
implementing Unicode correctly, there would be no difference between
any of those characters, which is the way a *wide* build works. With a
narrow build, there is a difference between BMP and non-BMP
characters.

ChrisA

ru...@yahoo.com

unread,
Mar 28, 2013, 3:54:20 PM3/28/13
to
On 03/28/2013 01:48 AM, Steven D'Aprano wrote:
> On Wed, 27 Mar 2013 22:42:18 -0700, rusi wrote:
>> More seriously Ive never seen anyone -- cause or person -- aided by
>> the use of excessively strong language.
>
> Of course not. By definition, if it helps, it wasn't *excessively* strong
> language.

For someone who delights in pointing out the logical errors
of others you are often remarkably sloppy in your own logic.

Of course language can be both helpful and excessively strong.
That is the case when language less strong would be
equally or more helpful.

>> IOW I repeat my support for Ned's request: Ad hominiem attacks are not
>> welcome, irrespective of the context/provocation.
>
> Insults are not ad hominem attacks.

Insults may or may not be ad hominem attacks. There is nothing
mutually exclusive about those terms.

> "You sir, are a bounder and a cad. Furthermore, your
> argument is wrong, because of reasons."
>
> may very well be an insult, but it also may be correct, and the reasons
> logically valid.

Those are two different statements. The first is an ad hominem
attack and is not welcome here. The second is an acceptable
response.

> "Your argument is wrong, because you are a bounder
> and a cad."
>
> is an ad hominem fallacy, because even bounders and cads may tell the
> truth occasionally, or be correct by accident.

That it is a fallacy does not mean it is not also an attack.

> I find it interesting that nobody has yet felt the need to defend JMF,
> and tell me I was factually incorrect about him (as opposed to merely
> impolite or mean-spirited).

Nothing "interesting" about it at all. Most of us (perhaps
unlike you) are not interested in discussing the personal
characteristics of posters here (in contrast to discussing
the technical opinions they post).

Further, "liar" is both so non-objective and so pejoratively
emotive that it is a word much more likely to be used by
someone interested in trolling than in a serious discussion,
so most sensible people here likely would not bite.

>[...]
> I would rather that you called me a liar to my face
> and gave me the opportunity to respond, than for you to ignore everything
> I said.

Even if you personally would prefer someone to respond by
calling you a liar, your personal preferences do not form
a basis for desirable posting behavior here.

But again you're creating a false dichotomy. Those are not
the only two choices. A third choice is neither ignore you
nor call you a liar but to factually point out where you are
wrong, or (if it is a matter of opinion) why one holds a
different opinion. That was the point Ned Deily was making
I believe.

> I hope that we all agree that we want a nice, friendly, productive
> community where everyone is welcome.

I hope so too but it is likely that some people want a place
to develop and assert some sense of influence, engage in verbal
duels, instigate arguments, etc. That can be true of regulars
here as well as drive-by posters.

> But some people simply cannot or
> will not behave in ways that are compatible with those community values.
> There are some people whom we *do not want here*

In other words, everyone is NOT welcome.

> -- spoilers and messers,
> vandals and spammers and cheats and liars and trolls and crackpots of all
> sorts.

Where those terms are defined by you and a handful of other
voracious posters. "Troll" in particular is often used to
mean someone who disagrees with the borg mind here, or who
says anything negative about Python, or who due attitude or
lack of full English fluency do not express themselves in
a sufficiently submissive way.

> We only disagree as to the best way to make it clear to them that
> they are not welcome so long as they continue their behaviour.

No, we disagree on who fits those definitions and even
how tolerant we are to those who do fit the definitions.
The policing that you and a handful of other self-appointed
net-cops try to do is far more obnoxious that the original
posts are.

> [1] Although sadly, given the reality of communication on the Internet,
> sometimes kill-filing is the least-worst option.

Please, please, killfile jmfauth, ranting rick, xaw lee and
anyone else you don't like so that the rest of us can be spared
the orders of magnitude larger, more disruptive and more offensive
posts generated by your (plural) responses to them.

Believe or not, most of the rest of us here are smart enough to
form our own opinions of such posters without you and the other
c.l.p truthsquad members telling us what to think.

Ned Deily

unread,
Mar 28, 2013, 4:23:37 PM3/28/13
to pytho...@python.org
In article
<CAPTjJmoZDHsmUQx7vcpuii2B...@mail.gmail.com>,
Chris Angelico <ros...@gmail.com> wrote:
> I'd rather this list have some vinegar than it devolve into
> uselessness. Or, worse, if there's a hard-and-fast rule about
> courtesy, devolve into aspartame... everyone's courteous in words but
> hates each other underneath. Or am I taking the analogy too far? :)

I think you are positing false choices. No one - at least I'm not - is
advocating to avoid challenging false or misleading statements in the
interests of maintaining some false "see how well we all get along"
facade. The point is we can have meaningful, hard-nosed discussions
without resorting to personal insults, i.e. flaming. I think the
discussion in this topic over the past 24 hours or so demonstrates that.

--
Ned Deily,
n...@acm.org

jmfauth

unread,
Mar 28, 2013, 4:26:57 PM3/28/13
to
On 28 mar, 18:55, Chris Angelico <ros...@gmail.com> wrote:
--------

The wide build (I never used) is in my mind as correct as
the narrow build. It "just" covers a different range in unicode
(the whole range).

Claiming that the narrow build is buggy, because it does not
cover the whole unicode is not correct.

Unicode does not stipulate, one has to cover the whole range.
Unicode expects that every character in a range behaves the same
way. This is clearly not realized with the flexible string
representation. An user should not be somehow penalized
simply because it not an ascii user.

If you take the fonts in consideration (btw a problem nobody
is speaking about) and you ensure your application, toolkit, ...
is MES-X or WGL4 compliant, your are also deliberately (and
correctly) working with a restriced unicode range.

jmf


Benjamin Kaplan

unread,
Mar 28, 2013, 4:29:23 PM3/28/13
to pytho...@python.org
You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

Ethan Furman

unread,
Mar 28, 2013, 4:31:56 PM3/28/13
to pytho...@python.org
On 03/28/2013 12:54 PM, ru...@yahoo.com wrote:
> On 03/28/2013 01:48 AM, Steven D'Aprano wrote:
>> On Wed, 27 Mar 2013 22:42:18 -0700, rusi wrote:
> For someone who delights in pointing out the logical errors
> of others you are often remarkably sloppy in your own logic.
>
> Of course language can be both helpful and excessively strong.
> That is the case when language less strong would be
> equally or more helpful.

It can also be the case when language less strong would be useless.


> Further, "liar" is both so non-objective and so pejoratively
> emotive that it is a word much more likely to be used by
> someone interested in trolling than in a serious discussion,
> so most sensible people here likely would not bite.

Non-objective? If today poster B says X, and tomorrow poster B says s/he was unaware of X until just now, is not "liar"
a reasonable conclusion?


>> I hope that we all agree that we want a nice, friendly, productive
>> community where everyone is welcome.
>
> I hope so too but it is likely that some people want a place
> to develop and assert some sense of influence, engage in verbal
> duels, instigate arguments, etc. That can be true of regulars
> here as well as drive-by posters.
>
>> But some people simply cannot or
>> will not behave in ways that are compatible with those community values.
>> There are some people whom we *do not want here*
>
> In other words, everyone is NOT welcome.

Correct. Do you not agree?


>> -- spoilers and messers,
>> vandals and spammers and cheats and liars and trolls and crackpots of all
>> sorts.
>
> Where those terms are defined by you and a handful of other
> voracious posters. "Troll" in particular is often used to
> mean someone who disagrees with the borg mind here, or who
> says anything negative about Python, or who due attitude or
> lack of full English fluency do not express themselves in
> a sufficiently submissive way.

I cannot speak for the borg mind, but for myself a troll is anyone who continually posts rants (such as RR & XL) or who
continuously hijacks threads to talk about their pet peeve (such as jmf).


>> We only disagree as to the best way to make it clear to them that
>> they are not welcome so long as they continue their behaviour.
>
> No, we disagree on who fits those definitions and even
> how tolerant we are to those who do fit the definitions.
> The policing that you and a handful of other self-appointed
> net-cops try to do is far more obnoxious that the original
> posts are.

I completely disagree, and I am grateful to those who bother to take the time to continually point out the errors from
those posters and to warn newcomers that those posters should not be believed.


> Believe or not, most of the rest of us here are smart enough to
> form our own opinions of such posters without you and the other
> c.l.p truthsquad members telling us what to think.

If one of my first few posts on c.l.p netted a response from a troll I would greatly appreciate a reply from one of the
regulars saying that was a troll so I didn't waste time trying to use whatever they said, or be concerned that the
language I was trying to use and learn was horribly flawed.

If the "truthsquad" posts are so offensive to you, why don't you kill-file them?

--
~Ethan~

jmfauth

unread,
Mar 28, 2013, 5:11:18 PM3/28/13
to
On 28 mar, 21:29, Benjamin Kaplan <benjamin.kap...@case.edu> wrote:
----------


I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

jmf

jmfauth

unread,
Mar 28, 2013, 5:33:03 PM3/28/13
to
-----

Addendum.

And you kwow what? Py34 will suffer from the same desease.
You are spending your time in improving chunks of bytes,
when the problem is elsewhere.
In fact you are working for peanuts, eg the replacing method.


If you are not satisfied with my examples, just pick up
the examples of GvR (ascii-string) on the bug tracker, "timeit"
them and you will see there is already a problem.

Better, "timeit" them afeter having replaced his ascii-strings
with non ascii characters...

jmf

and you will see, there is

Chris Angelico

unread,
Mar 28, 2013, 5:45:36 PM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 7:26 AM, jmfauth <wxjm...@gmail.com> wrote:
> The wide build (I never used) is in my mind as correct as
> the narrow build. It "just" covers a different range in unicode
> (the whole range).

Actually it does; it covers all of the Unicode range, by using
(effectively) UTF-16. Characters that cannot be represented in one
16-bit number are represented in two. That's not "just" covering a
different range. It's being buggy. And it's creating a way for code to
unexpectedly behave fundamentally differently on Windows and Linux
(since the most common builds for Windows were narrow and for Linux
were wide). This is a Bad Thing for Python.

ChrisA

MRAB

unread,
Mar 28, 2013, 5:50:14 PM3/28/13
to pytho...@python.org
>> > be no difference in using an "a", "�", "�" or any character,
If you're that concerned about it, why don't you modify the source code so
that the string representation chooses between only 2 bytes and 4 bytes per
codepoint, and then see whether that you prefer that situation. How do
the memory usage and speed compare?

Benjamin Kaplan

unread,
Mar 28, 2013, 5:52:27 PM3/28/13
to pytho...@python.org
By that logic, we should all be using ASCII because it's "correct" for
the 127 characters that I (as an English speaker) use, and therefore
it's all that we should care about. I don't care if é counts as two
characters, it's faster and more memory efficient for all of my
strings to just count bytes. There are certain domains where
characters outside the basic multilingual plane are used. Python's job
is to be correct in all of those circumstances, not just the ones you
care about.

88888 Dihedral

unread,
Mar 28, 2013, 7:04:10 PM3/28/13
to pytho...@python.org
Chris Angelico於 2013年3月28日星期四UTC+8上午11時40分17秒寫道:
> On Thu, Mar 28, 2013 at 2:18 PM, Ethan Furman <et...@stoneleaf.us> wrote:
>
> > Has anybody else thought that [jmf's] last few responses are starting to sound
>
> > bot'ish?
>
>
>
> Yes, I did wonder. It's like he and Dihedral have been trading
>
> accounts sometimes. Hey, Dihedral, I hear there's a discussion of
>
> Unicode and PEP 393 and Python 3.3 and Unicode and lots of keywords
>
> for you to trigger on and Python and bots are funny and this text is
>
> almost grammatical!
>
>
>
> There. Let's see if he takes the bait.
>
>
>
> ChrisA

Well, we need some cheap ram to hold 4 bytes per character

in a text segment to be observed.

For those not to be observed or shown, the old way still works.

Windows got this job done right to collect taxes in areas
of different languages.

Terry Reedy

unread,
Mar 28, 2013, 7:12:07 PM3/28/13
to pytho...@python.org
On 3/28/2013 4:26 PM, jmfauth wrote:

Please provide references for your assertions. I have read the unicode
standard, parts more than once, and your assertions contradict my memory.

> Unicode does not stipulate, one has to cover the whole range.

I believe it does. As I remember, the recognized encodings all encode
the entire unicode codepoint range

> Unicode expects that every character in a range behaves the same
> way.

I have no idea what you mean by 'same way'. Each codepoint is supposed
to behave differently in some way. That is the reason for having
multiple codepoints. One causes an 'a' to appear, another a 'b'. Indeed,
the standard define multiple categories of codepoints and chars in
different categories are supposed to act differently (or be treated
differently). Glyphic chars versus control chars are one example.

--
Terry Jan Reedy

88888 Dihedral

unread,
Mar 28, 2013, 7:04:10 PM3/28/13
to comp.lan...@googlegroups.com, pytho...@python.org
Message has been deleted

Chris Angelico

unread,
Mar 28, 2013, 8:03:07 PM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 10:53 AM, Dennis Lee Bieber
<wlf...@ix.netcom.com> wrote:
> On Wed, 27 Mar 2013 23:12:21 -0700, Ethan Furman <et...@stoneleaf.us>
> declaimed the following in gmane.comp.python.general:
>
>>
>> At some point we have to stop being gentle / polite / politically correct and call a shovel a shovel... er, spade.
>
> Call it an Instrument For the Transplantation of Dirt
>
> (Is an antique Steam Shovel ever a Steam Spade?)

I don't know, but I'm pretty sure there's a private detective who
wouldn't appreciate being called Sam Shovel.

ChrisA

Mark Lawrence

unread,
Mar 28, 2013, 8:15:59 PM3/28/13
to pytho...@python.org
On 28/03/2013 23:53, Dennis Lee Bieber wrote:
> On Wed, 27 Mar 2013 23:12:21 -0700, Ethan Furman <et...@stoneleaf.us>
> declaimed the following in gmane.comp.python.general:
>
>>
>> At some point we have to stop being gentle / polite / politically correct and call a shovel a shovel... er, spade.
>
> Call it an Instrument For the Transplantation of Dirt
>
> (Is an antique Steam Shovel ever a Steam Spade?)
>

Surely you can spade a lot more things than dirt?

--
If you're using GoogleCrap� please read this
http://wiki.python.org/moin/GoogleGroupsPython.

Mark Lawrence

Steven D'Aprano

unread,
Mar 28, 2013, 8:35:23 PM3/28/13
to
On Thu, 28 Mar 2013 12:54:20 -0700, rurpy wrote:

> Even if you personally would prefer someone to respond by calling you a
> liar, your personal preferences do not form a basis for desirable
> posting behavior here.

Whereas yours apparently are.

Thanks for the feedback, I'll take it under advisement.


--
Steven

Steven D'Aprano

unread,
Mar 28, 2013, 8:39:57 PM3/28/13
to
I wonder why they need care about surrogate pairs?

ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
strings. It's only strings in the SMPs that could need surrogate pairs,
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?

I also wonder why the implementation bothers keeping a UTF-8
representation. That sounds like premature optimization to me. Surely you
only need it when writing to a file with UTF-8 encoding? For most
strings, that will never happen.



--
Steven

Chris Angelico

unread,
Mar 28, 2013, 8:54:41 PM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
> strings. It's only strings in the SMPs that could need surrogate pairs,
> and they don't need them in Python's implementation since it's a full 32-
> bit implementation. So where do the surrogate pairs come into this?

PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""

If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...

> I also wonder why the implementation bothers keeping a UTF-8
> representation. That sounds like premature optimization to me. Surely you
> only need it when writing to a file with UTF-8 encoding? For most
> strings, that will never happen.

... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?

Minor nitpick, btw:
> (in which cast wstr_length differs form length)
Should be "in which case" and "from". Who has the power to correct
typos in PEPs?

ChrisA

Mark Lawrence

unread,
Mar 28, 2013, 9:03:51 PM3/28/13
to pytho...@python.org
On 29/03/2013 00:54, Chris Angelico wrote:
>
> Minor nitpick, btw:
>> (in which cast wstr_length differs form length)
> Should be "in which case" and "from". Who has the power to correct
> typos in PEPs?
>
> ChrisA
>

Sneak it in here? http://bugs.python.org/issue13604

Chris Angelico

unread,
Mar 28, 2013, 9:10:54 PM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 12:03 PM, Mark Lawrence <bream...@yahoo.co.uk> wrote:
> On 29/03/2013 00:54, Chris Angelico wrote:
>> Minor nitpick, btw:
>>>
>>> (in which cast wstr_length differs form length)
>>
>> Should be "in which case" and "from". Who has the power to correct
>> typos in PEPs?
>
Ah! Turns out it's already been fixed; a reword of that section, as
shown in the attached files, no longer has the parenthesis, and thus
its typos.

ChrisA

MRAB

unread,
Mar 28, 2013, 10:00:24 PM3/28/13
to pytho...@python.org
On 29/03/2013 00:54, Chris Angelico wrote:
> On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
> <steve+comp....@pearwood.info> wrote:
>> ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
>> strings. It's only strings in the SMPs that could need surrogate pairs,
>> and they don't need them in Python's implementation since it's a full 32-
>> bit implementation. So where do the surrogate pairs come into this?
>
> PEP 393 says:
> """
> wstr_length, wstr: representation in platform's wchar_t
> (null-terminated). If wchar_t is 16-bit, this form may use surrogate
> pairs (in which cast wstr_length differs form length). wstr_length
> differs from length only if there are surrogate pairs in the
> representation.
>
> utf8_length, utf8: UTF-8 representation (null-terminated).
>
> data: shortest-form representation of the unicode string. The string
> is null-terminated (in its respective representation).
>
> All three representations are optional, although the data form is
> considered the canonical representation which can be absent only while
> the string is being created. If the representation is absent, the
> pointer is NULL, and the corresponding length field may contain
> arbitrary data.
> """
>
> If the string was created from a wchar_t string, that string will be
> retained, and presumably can be used to re-output the original for a
> clean and fast round-trip. Same with...
>
>> I also wonder why the implementation bothers keeping a UTF-8
>> representation. That sounds like premature optimization to me. Surely you
>> only need it when writing to a file with UTF-8 encoding? For most
>> strings, that will never happen.
>
> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot
> of content will go out in the same encoding it came in in, so it makes
> sense to hang onto it where possible.
>
> Though, from the same quote: The UTF-8 representation is
> null-terminated. Does this mean that it can't be used if there might
> be a \0 in the string?
>
You could ask the same question about any encoding.

It's only an issue if it's passed to a C function which expects a
null-terminated string.

Steven D'Aprano

unread,
Mar 28, 2013, 10:37:55 PM3/28/13
to
On Fri, 29 Mar 2013 11:54:41 +1100, Chris Angelico wrote:

> On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
> <steve+comp....@pearwood.info> wrote:
>> ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
>> strings. It's only strings in the SMPs that could need surrogate pairs,
>> and they don't need them in Python's implementation since it's a full
>> 32- bit implementation. So where do the surrogate pairs come into this?
>
> PEP 393 says:
> """
> wstr_length, wstr: representation in platform's wchar_t
> (null-terminated). If wchar_t is 16-bit, this form may use surrogate
> pairs (in which cast wstr_length differs form length). wstr_length
> differs from length only if there are surrogate pairs in the
> representation.
>
> utf8_length, utf8: UTF-8 representation (null-terminated).
>
> data: shortest-form representation of the unicode string. The string is
> null-terminated (in its respective representation).
>
> All three representations are optional, although the data form is
> considered the canonical representation which can be absent only while
> the string is being created. If the representation is absent, the
> pointer is NULL, and the corresponding length field may contain
> arbitrary data.
> """

All the words are in English (well, most of them...) but what does it
mean?

> If the string was created from a wchar_t string, that string will be
> retained, and presumably can be used to re-output the original for a
> clean and fast round-trip.

Under what circumstances will a string be created from a wchar_t string?
How, and why, would such a string be created? Why would Python still
support strings containing surrogates when it now has a nice, shiny,
surrogate-free flexible representation?



>> I also wonder why the implementation bothers keeping a UTF-8
>> representation. That sounds like premature optimization to me. Surely
>> you only need it when writing to a file with UTF-8 encoding? For most
>> strings, that will never happen.
>
> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot
> of content will go out in the same encoding it came in in, so it makes
> sense to hang onto it where possible.

Not to me. That almost doubles the size of the string, on the off-chance
that you'll need the UTF-8 encoding. Which for many uses, you don't, and
even if you do, it seems like premature optimization to keep it around
just in case. Encoding to UTF-8 will be fast for small N, and for large
N, why carry around (potentially) multiple megabytes of duplicated data
just in case the encoded version is needed some time?

Chris Angelico

unread,
Mar 28, 2013, 10:44:50 PM3/28/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 1:37 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> Under what circumstances will a string be created from a wchar_t string?
> How, and why, would such a string be created? Why would Python still
> support strings containing surrogates when it now has a nice, shiny,
> surrogate-free flexible representation?

Strings are created from some form of content. If not from another
Python string, then - most likely - it's from a stream of bytes. If
from a C API that returns wchar_t, then it'd make sense to have that
form around.

ChrisA

Neil Hodgson

unread,
Mar 28, 2013, 11:34:27 PM3/28/13
to
Steven D'Aprano:

> Some string operations need to inspect every character, e.g. str.upper().
> Even for them, the increased complexity of a variable-width encoding
> costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
> 4 bytes per character. You have to walk the string grabbing 1 byte at a
> time, and then decide whether you need another 1, 2 or 3 bytes. Even
> though it's still O(N), the added bit-masking and overhead of variable-
> width encoding adds to the overall cost.

It does add to implementation complexity but should only add a small
amount of time.

To compare costs, I am using the text of the web site
http://www.mofa.go.jp/mofaj/ since it has a reasonable amount (10%) of
multi-byte characters. Since the document fits in the the BMP, Python
would choose a 2-byte wide implementation so I am emulating that choice
with a very simple 16-bit table-based upper-caser. Real Unicode case
conversion code is more concerned with edge cases like Turkic and
Lithuanian locales and Greek combining characters and also allowing for
measurement/reallocation for the cases where the result is
smaller/larger. See, for example, glib's real_toupper in
https://git.gnome.org/browse/glib/tree/glib/guniprop.c

Here is some simplified example code that implements upper-casing
over 16-bit wide (utf16_up) and UTF-8 (utf8_up) buffers:
http://www.scintilla.org/UTF8Up.cxx
Since I didn't want to spend too much time writing code it only
handles the BMP and doesn't have upper-case table entries outside ASCII
for now. If this was going to be worked on further to be made
maintainable, most of the masking and so forth would be in macros
similar to UTF8_COMPUTE/UTF8_GET in glib.
The UTF-8 case ranges from around 5% slower on average in a 32 bit
release build (VC2012 on an i7 870) to averaging a little faster in a
64-bit build. They're both around a billion characters per-second.

C:\u\hg\UpUTF\UpUTF>..\x64\Release\UpUTF.exe
Time taken for UTF8 of 80449=0.006528
Time taken for UTF16 of 71525=0.006610
Relative time taken UTF8/UTF16 0.987581

> Any string method that takes a starting offset requires the method to
> walk the string byte-by-byte. I've even seen languages put responsibility
> for dealing with that onto the programmer: the "start offset" is given in
> *bytes*, not characters. I don't remember what language this was... it
> might have been Haskell? Whatever it was, it horrified me.

It doesn't horrify me - I've been working this way for over 10 years
and it seems completely natural. You can wrap access in iterators that
hide the byte offsets if you like. This then ensures that all operations
on those iterators are safe only allowing the iterator to point at the
start/end of valid characters.

> Sure. And over a different set of samples, it is less compact. If you
> write a lot of Latin-1, Python will use one byte per character, while
> UTF-8 will use two bytes per character.

I think you mean writing a lot of Latin-1 characters outside ASCII.
However, even people writing texts in, say, French will find that only a
small proportion of their text is outside ASCII and so the cost of UTF-8
is correspondingly small.

The counter-problem is that a French document that needs to include
one mathematical symbol (or emoji) outside Latin-1 will double in size
as a Python string.

Neil

Neil Hodgson

unread,
Mar 28, 2013, 11:57:20 PM3/28/13
to
MRAB:

> Implementing the regex module (http://pypi.python.org/pypi/regex) would
> have been more difficult if the internal representation had been UTF-8,
> because of the need to decode, and the implementation would also have
> been slower for that reason.

One way to build regex support for UTF-8 is to build a fixed width
version of the regex code and then interpose an object that converts
between the UTF-8 representation and that code.

The C++11 standard library contains a regex template that can be
instantiated over a UTF-8 representation in this way.

Neil

Ethan Furman

unread,
Mar 29, 2013, 12:56:05 AM3/29/13
to pytho...@python.org
On 03/28/2013 08:34 PM, Neil Hodgson wrote:
> Steven D'Aprano:
>
>> Any string method that takes a starting offset requires the method to
>> walk the string byte-by-byte. I've even seen languages put responsibility
>> for dealing with that onto the programmer: the "start offset" is given in
>> *bytes*, not characters. I don't remember what language this was... it
>> might have been Haskell? Whatever it was, it horrified me.
>
> It doesn't horrify me - I've been working this way for over 10 years and it seems completely natural.

Horrifying or not, I am willing to give up a small amount of speed for correctness. Heck, I'm willing to give up a lot
of speed for correctness. Once I have my slow but correct prototype going I can recode in a faster language (if needed)
and compare it's blazingly fast output with my slowly-generated but known-good output.

> You can wrap
> access in iterators that hide the byte offsets if you like. This then ensures that all operations on those iterators are
> safe only allowing the iterator to point at the start/end of valid characters.

Sure. Or I can let Python handle it for me.


> The counter-problem is that a French document that needs to include one mathematical symbol (or emoji) outside
> Latin-1 will double in size as a Python string.

True. But how often do you have the entire document as a single string? Use readlines() instead of read(). Besides,
memory is cheap.

--
~Ethan~

Chris Angelico

unread,
Mar 29, 2013, 1:33:52 AM3/29/13
to pytho...@python.org
On Fri, Mar 29, 2013 at 2:34 PM, Neil Hodgson <nhod...@iinet.net.au> wrote:
> It doesn't horrify me - I've been working this way for over 10 years and
> it seems completely natural. You can wrap access in iterators that hide the
> byte offsets if you like. This then ensures that all operations on those
> iterators are safe only allowing the iterator to point at the start/end of
> valid characters.

But both this and your example of case conversion are, fundamentally,
iterating over the string. What if you aren't doing that? What if you
want to parse and process?

ChrisA

Neil Hodgson

unread,
Mar 29, 2013, 1:46:14 AM3/29/13
to
Chris Angelico:

> But both this and your example of case conversion are, fundamentally,
> iterating over the string. What if you aren't doing that? What if you
> want to parse and process?

Parsing is also normally a scanning operation. If you want to
process pieces of the string based on the parse then you remember the
positions (as iterators) at the significant places and extract/process
the data based on those positions.

Neil

Ian Kelly

unread,
Mar 29, 2013, 2:11:37 AM3/29/13
to Python
On Thu, Mar 28, 2013 at 8:37 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
>>> I also wonder why the implementation bothers keeping a UTF-8
>>> representation. That sounds like premature optimization to me. Surely
>>> you only need it when writing to a file with UTF-8 encoding? For most
>>> strings, that will never happen.
>>
>> ... the UTF-8 version. It'll keep it if it has it, and not else. A lot
>> of content will go out in the same encoding it came in in, so it makes
>> sense to hang onto it where possible.
>
> Not to me. That almost doubles the size of the string, on the off-chance
> that you'll need the UTF-8 encoding. Which for many uses, you don't, and
> even if you do, it seems like premature optimization to keep it around
> just in case. Encoding to UTF-8 will be fast for small N, and for large
> N, why carry around (potentially) multiple megabytes of duplicated data
> just in case the encoded version is needed some time?

>From the PEP:

"""
A new function PyUnicode_AsUTF8 is provided to access the UTF-8
representation. It is thus identical to the existing
_PyUnicode_AsString, which is removed. The function will compute the
utf8 representation when first called. Since this representation will
consume memory until the string object is released, applications
should use the existing PyUnicode_AsUTF8String where possible (which
generates a new string object every time). APIs that implicitly
converts a string to a char* (such as the ParseTuple functions) will
use PyUnicode_AsUTF8 to compute a conversion.
"""

So the utf8 representation is not populated when the string is
created, but when a utf8 representation is requested, and only when
requested by the API that returns a char*, not by the API that returns
a bytes object.

Ian Kelly

unread,
Mar 29, 2013, 2:22:08 AM3/29/13
to Python
On Fri, Mar 29, 2013 at 12:11 AM, Ian Kelly <ian.g...@gmail.com> wrote:
> From the PEP:
>
> """
> A new function PyUnicode_AsUTF8 is provided to access the UTF-8
> representation. It is thus identical to the existing
> _PyUnicode_AsString, which is removed. The function will compute the
> utf8 representation when first called. Since this representation will
> consume memory until the string object is released, applications
> should use the existing PyUnicode_AsUTF8String where possible (which
> generates a new string object every time). APIs that implicitly
> converts a string to a char* (such as the ParseTuple functions) will
> use PyUnicode_AsUTF8 to compute a conversion.
> """
>
> So the utf8 representation is not populated when the string is
> created, but when a utf8 representation is requested, and only when
> requested by the API that returns a char*, not by the API that returns
> a bytes object.

Since the PEP specifically mentions ParseTuple string conversion, I am
thinking that this is probably the motivation for caching it. A
string that is passed into a C function (that uses one of the various
UTF-8 char* format specifiers) is perhaps likely to be passed into
that function again at some point, so the UTF-8 representation is kept
around to avoid the need to recompose it at on each call.

Grant Edwards

unread,
Mar 29, 2013, 10:52:40 AM3/29/13
to
On 2013-03-28, Ethan Furman <et...@stoneleaf.us> wrote:

> I cannot speak for the borg mind, but for myself a troll is anyone
> who continually posts rants (such as RR & XL) or who continuously
> hijacks threads to talk about their pet peeve (such as jmf).

Assuming jmf actually does care deeply and genuinely about Unicode
implementations, and his postings reflect his actual position/opinion,
then he's not a troll. Traditionally, a troll is someone who posts
statements purely to provoke a response -- they don't really care
about the topic and often don't believe what they're posting.

--
Grant Edwards grant.b.edwards Yow! BARBARA STANWYCK makes
at me nervous!!
gmail.com
It is loading more messages.
0 new messages