O(n^2) is bad - can it be fixed?

Helen Dawson

unread,

May 22, 2001, 1:38:47 AM5/22/01

to

The following script tests appending to a list and to a string.

def AppendTest(listorstring):
for i in range(0, 100):
for x in range(0, 2000):
listorstring += " "
print i

AppendTest([])
AppendTest("")

The test with the list runs very swiftly and smoothly - the numbers
are printed out at a steady rate. The test with the string runs
slowly, and the speed that the numbers appear slows down - a lot. In
fact, I have never seen the string test end - I always abort it out
of boredom before it gets to twenty five.

Without even looking in the Python internals it is obvious what is
happening. In the string case, each time a character is appended the
entire string is copied. Therefore adding n characters requires
n^2/2 bytes to be copied, making processing large files (where large
is > about 10K) very inefficient - and processing files larger
than about 100K is virtually impossible.

List does much better. Why? It probably does the same thing that the
STL vector typically does - the allocabe space doubles whenever it
runs out of room, and then the newly allocated space is filled in
as needed.

So, why can't string do that?

When we could only go string = string + " " then fixing this problem
was probably impossible, but with += it seems like it should be quite
possible - maybe even easy!

Right now if I am producing a text file a character at a time I have
to do it to a list and then use string.join() to convert at the end.
That works, but seems clunky.

It's also a landmine for the naive. I recognized the problem
immediately, but many people would simply assume that Python was slow,
when it took an hour to process a one Mbyte file.

I'm using Python 2.1, on Windows.

Bruce Dawson.

P.S. Is this a known issue?

Chris Tavares

unread,

May 22, 2001, 2:44:22 AM5/22/01

to

"Helen Dawson" <hel...@accessone.com> wrote in message
news:3B09FBF9...@accessone.com...

> The following script tests appending to a list and to a string.
>

[... snipped offending code ...]

>
> So, why can't string do that?
>
> When we could only go string = string + " " then fixing this problem
> was probably impossible, but with += it seems like it should be quite
> possible - maybe even easy!
>

You're forgetting one thing - strings are IMMUTABLE. Doing s += " " doesn't
add anything to s, it creates a NEW string who's contents just happens to be
a space concatenated to the old s. And please don't ask for mutable
strings - immutability is the right answer here (just look at all the
complexities that the C++ std lib people went through to make
std::basic_string mutable, and it still doesn't work correctly in many
cases.)

> Right now if I am producing a text file a character at a time I have
> to do it to a list and then use string.join() to convert at the end.
> That works, but seems clunky.
>

This is one correct solution. Strings are not buffers, and should not be
used as such.

Another solution would be to use something that IS expected to act like a
buffer - the cStringIO module. The interface is a little different - this
one works like a file:

import cStringIO

def AppendTest2( io ):
for i in range(100):
for j in range(2000):
io.write( " " )
print i

This is visible faster on my box (Win2k, Python 2.1) than either the list or
string versions shown above.

--

Chris Tavares

E. Mark Ping

unread,

May 22, 2001, 2:46:35 AM5/22/01

to

In article <3B09FBF9...@accessone.com>,

Helen Dawson <hel...@accessone.com> wrote:
>List does much better. Why? It probably does the same thing that the
>STL vector typically does - the allocabe space doubles whenever it
>runs out of room, and then the newly allocated space is filled in
>as needed.

Doubtful. It most likely does what std::list<> does--it allocates
space each time a new entry is added. That's linear, not n^2 time.
std::vector<> uses geometric allocation to get amortized constant
allocation time.

If you need to add on character at a time, choose the data structure
that does what you want. Perhaps a list of char?
--
Mark Ping
ema...@soda.CSUA.Berkeley.EDU

kosh

unread,

May 22, 2001, 2:49:04 AM5/22/01

to

Helen Dawson wrote:

The reason that is slow is that you modify the lenght of the string many
times. The list one is far more efficient especially with a few
modifications. For the list one when you do a .join you only put all of
them together once insetead of adding together many times.

However also consider this case

list = []
for i in somelist:
list.append(i)
string.join(list, "")

a faster way then that is
list = []
append = list.append
for i in somelist:
append(i)
string.join(list, "")

That works faster become it removes a lookup operation from every iteration
of the loop. I know this doesn't precisely answer your question but it
should enable you to code a much faster solution.

Delaney, Timothy

unread,

May 22, 2001, 2:12:51 AM5/22/01

to pytho...@python.org

> def AppendTest(listorstring):
> for i in range(0, 100):
> for x in range(0, 2000):
> listorstring += " "
> print i
>
> AppendTest([])
> AppendTest("")
>
> The test with the list runs very swiftly and smoothly - the numbers
> are printed out at a steady rate. The test with the string runs
> slowly, and the speed that the numbers appear slows down - a lot. In
> fact, I have never seen the string test end - I always abort it out
> of boredom before it gets to twenty five.
>
> Without even looking in the Python internals it is obvious what is
> happening. In the string case, each time a character is appended the
> entire string is copied. Therefore adding n characters requires
> n^2/2 bytes to be copied, making processing large files (where large
> is > about 10K) very inefficient - and processing files larger
> than about 100K is virtually impossible.

[snip]

> When we could only go string = string + " " then fixing this problem
> was probably impossible, but with += it seems like it should be quite
> possible - maybe even easy!
>
> Right now if I am producing a text file a character at a time I have
> to do it to a list and then use string.join() to convert at the end.
> That works, but seems clunky.

Strings are immutable. When you append to a string, it creates a new string
(as you correctly surmised).

Having immuatble strings means that instances can be shared. The string
doesn't change behind your back. This is a Good Thing (TM). If you have used
Java it has the same thing: strings are immutable. If you want an immutable
string, you have to use a StringBuffer.

There are a number of pythonic ways of doing what you want. One is the one
you discovered: concatenating into a list, then joining.

Another method would be to use the array built-in module. This will probably
save memory over the list version, and is almost the equivalent of the Java
StringBuffer, except it's so much more ;) Testing both of these though,
using a list was *much* faster than using an array.

Perhaps you need to look at what you really want to do. As I understand it,
you wish to:

1. Build a string
2. Write the string to a file.

Is there perhaps a better way to do what you want?

Or possibly you wish to read a large file into memory. Is it text? In that
case, there are several standard ways to do this in python. The main methods
you want to look for are:

readlines()
xreadlines()
readline()

In addition, your basic function has a major problem, aside from the test.
When you pass in a list, you are modifying that list, and you will be able
to use the changed contents outside of the function. But when you pass in a
string, you're not modifying the string - you're creating a new one each
time. Unless you pass the generated string as a return value, you won't be
able to use it outside of the function. All you are doing is rebinding the
local reference to the string.

The difference between mutable and immutable objects is important in any
language. Python is one where it is vital.

Tim Delaney

Delaney, Timothy

unread,

May 22, 2001, 2:18:21 AM5/22/01

to pytho...@python.org

> Another method would be to use the array built-in module.
> This will probably
> save memory over the list version, and is almost the
> equivalent of the Java
> StringBuffer, except it's so much more ;) Testing both of
> these though,
> using a list was *much* faster than using an array.

Oops - this is misleading. I stupidly did this test incorrectly - I changed
the function to use a try/except to allow it to use strings, lists and
arrays ...

try:
listorstring += ' '
except:
listorstring.append(' ')

Now, who wants to point out the flaw of comparing using a list (which has
.append() as well as direct concatenation), and using an array (which only
has .append()) ...

Yep, I was taking the overhead of throwing an exception on the array test.

When I changed it around

try:
listorstring.append(' ')
except:
listorstring += ' '

lists and arrays ran pretty close to the same (I didn't do any formal
timing).

Bah! Get the algorithm right!

Tim Delaney

Courageous

unread,

May 22, 2001, 3:32:45 AM5/22/01

to

>The test with the list runs very swiftly and smoothly - the numbers
>are printed out at a steady rate. The test with the string runs

>slowly, ...

The correct thing to do if performance becomes important
to you in this context is to convert the string to a list,
manipulate it as a list, and then convert it back to a string.

C//

Isaac To Kar Keung

unread,

May 22, 2001, 3:08:16 AM5/22/01

to

>>>>> "E" == E Mark Ping <ema...@CSUA.Berkeley.EDU> writes:

E> Doubtful. It most likely does what std::list<> does--it allocates
E> space each time a new entry is added. That's linear, not n^2 time.
E> std::vector<> uses geometric allocation to get amortized constant
E> allocation time.

That's impossible. If allocation is done everytime, just the copying that
would be needed will cost quadratic time. Do read the manual. There is
something called the "capacity" of a vector, which can be larger than the
"size" of a vector. The capacity is the minimum size of the vector to which
the vector need to grow to, before the next time new space is allocated.

Regards,
Isaac.

Alex Martelli

unread,

May 22, 2001, 3:31:42 AM5/22/01

to

"Helen Dawson" <hel...@accessone.com> wrote in message
news:3B09FBF9...@accessone.com...

> The following script tests appending to a list and to a string.
>
> def AppendTest(listorstring):
> for i in range(0, 100):
> for x in range(0, 2000):
> listorstring += " "
> print i
>
> AppendTest([])
> AppendTest("")
>
> The test with the list runs very swiftly and smoothly - the numbers
> are printed out at a steady rate. The test with the string runs
> slowly, and the speed that the numbers appear slows down - a lot. In
> fact, I have never seen the string test end - I always abort it out
> of boredom before it gets to twenty five.
>
> Without even looking in the Python internals it is obvious what is
> happening. In the string case, each time a character is appended the
> entire string is copied. Therefore adding n characters requires

Right. String objects are immutable, so it couldn't possibly
be otherwise.

> n^2/2 bytes to be copied, making processing large files (where large
> is > about 10K) very inefficient - and processing files larger
> than about 100K is virtually impossible.
>
> List does much better. Why? It probably does the same thing that the
> STL vector typically does - the allocabe space doubles whenever it
> runs out of room, and then the newly allocated space is filled in
> as needed.

Not quite -- Python's sources are pretty clear about that:-). A
list DOES allocate a few extra entries when it grows, but it does
NOT grow geometrically. So IN THEORY appending should be O(N^2)
for lists as well -- but in practice it seems N never gets big
enough to actually measure that effect.

> So, why can't string do that?

A string object is immutable. So, what would it do with any
extra allocated space?

> When we could only go string = string + " " then fixing this problem
> was probably impossible, but with += it seems like it should be quite
> possible - maybe even easy!

Since a string object is immutable, it doesn't define "in-place
addition" -- thus the semantics of the two forms are identical.

Python's engine *MIGHT* perhaps special-case the occasions where
a string object is not interned AND there is just one reference
to it, but currently the engine doesn't do ANY "under the covers
black magic optimization", even for much more relevant issues
such as calling methods on objects, &c. The lack of any tricky
optimization keeps the engine simple, lightweight, etc, etc.

> Right now if I am producing a text file a character at a time I have
> to do it to a list and then use string.join() to convert at the end.
> That works, but seems clunky.

Clunkiness is in the eye of the beholder. Others would consider
"mutable strings" as clunkier, and Java's "solution" of having
separate "string" (immutable) and "string buffer" (mutable) types,
while feasible, hasn't won it great praise either.

> It's also a landmine for the naive. I recognized the problem
> immediately, but many people would simply assume that Python was slow,
> when it took an hour to process a one Mbyte file.

Education seems a better answer to that than black-magic
under-the-cover optimizations that would break as soon as
one had a second reference extant to a string object:-).

> I'm using Python 2.1, on Windows.
>
> Bruce Dawson.
>
> P.S. Is this a known issue?

Yes. http://www.python.org/doc/essays/list2str.html mentions it,
for example, as does FAQ 4.7 (http://www.python.org/doc/FAQ.html).

Alex

Alex Martelli

unread,

May 22, 2001, 4:45:53 AM5/22/01

to

"E. Mark Ping" <ema...@CSUA.Berkeley.EDU> wrote in message
news:9ed20b$jfj$1...@agate.berkeley.edu...

> In article <3B09FBF9...@accessone.com>,
> Helen Dawson <hel...@accessone.com> wrote:
> >List does much better. Why? It probably does the same thing that the
> >STL vector typically does - the allocabe space doubles whenever it
> >runs out of room, and then the newly allocated space is filled in
> >as needed.
>
> Doubtful. It most likely does what std::list<> does--it allocates

No need to wonder in a void about such issues... listobject.c is
open for inspection in the Objects directory of any Python source
distribution! A python list is nothing like a C++ std::list --
the latter is a doubly-linked-list (fast insertion/removal at
any point, but O(N) to get the Nth item), the former is actually
an array or vector.

> space each time a new entry is added. That's linear, not n^2 time.
> std::vector<> uses geometric allocation to get amortized constant
> allocation time.
>
> If you need to add on character at a time, choose the data structure
> that does what you want. Perhaps a list of char?

Testing is better than wondering:

import time, array

def AppendTest(listorarray):
head = repr(listorarray)
start = time.clock()

for i in range(0, 100):
for x in range(0, 2000):

listorarray.append(' ')
# print i
stend = time.clock()
print "%s: %.2f" % (head, stend-start)

AppendTest([])
AppendTest(array.array('c'))

D:\py21>python oaat.py
[]: 1.97
array('c'): 1.55

D:\py21>python oaat.py
[]: 1.97
array('c'): 1.56

So, an array.array of characters turns out to be
systematically and repeatably 20% faster than a
list for this task on my box (an old clunky P3-266
NT4SP5 at this time:-). Unsurprising:-) -- it
has less overhead since it keeps 1 byte per char
rather than 4 bytes per item (a reference) plus
space for the char itself.

By the way, once your built-one-char-at-a-time
array.array('c') object is ready, just call its
method .tostring() to get the string equivalent
(if you choose to use a list instead, then the
best way to get the string is ''.join(thelist)).

The dominance of array.array for this shines
out clearly from Guido's essay/anecdote on
optimization, btw...

Alex

Alex Martelli

unread,

May 22, 2001, 4:50:44 AM5/22/01

to

"Chris Tavares" <ctav...@develop.com> wrote in message
news:aVnO6.8300$9D5.8...@newsread2.prod.itd.earthlink.net...
...

> Another solution would be to use something that IS expected to act like a
> buffer - the cStringIO module. The interface is a little different - this
> one works like a file:

Good point! So I added it to my little measurement script:

def AppendTest(listorarray):
head = repr(listorarray)
start = time.clock()

for i in range(0, 100):
for x in range(0, 2000):

listorarray.append(' ')
# print i
stend = time.clock()
print "%s: %.2f" % (head, stend-start)

def AppendTest1(listorarray):

head = repr(listorarray)
start = time.clock()

for i in range(0, 100):
for x in range(0, 2000):

listorarray.write(' ')

# print i
stend = time.clock()
print "%s: %.2f" % (head, stend-start)

AppendTest([])
AppendTest(array.array('c'))

AppendTest1(cStringIO.StringIO())

and, whaddyaknow...:

D:\py21>python oaat.py
[]: 1.96
array('c'): 1.57
<StringO object at 007F63B0>: 0.86

D:\py21>python oaat.py
[]: 1.96
array('c'): 1.57
<StringO object at 007F63B0>: 0.87

D:\py21>

Bingo, another "almost a factor of 2" speedup.

Interesting!-) Guido's optimization anecdote
may need updating:-).

Alex

Tim Peters

unread,

May 22, 2001, 6:11:36 AM5/22/01

to pytho...@python.org

[Alex Martelli]
> ...

Curiously, an array.array is usually slower than a list, for most things. It
has overhead of a different kind: storing the raw bytes saves storage, but
it takes more layers of internal function calls to extract the raw bytes from
the Python object when storing them, and similarly to stuff the raw bytes
into a Python object again when retrieving them. A list doesn't do any of
that, it just stores and retrieves "the object" (a pointer).

You catch a break in this particular test case because all one-character
literal strings are interned, and so objects are neither created nor
destroyed each time you pass ' '; in the general case, they are, as objects
are created just to pass to array.append(), then destroyed soon after the
array module extracts their guts. list.append() avoids the "then destroyed"
half of that.

array.append() also needs to rededuce the *type* of the array each time it's
called, so it knows which routine to call to extract the raw bytes (there
aren't umpteen distinct array implementations under the covers, just one that
indirects thru a typecode member); and that's another overhead list.append()
avoids.

But an array.array('c') does win big when it comes time for .tostring(),
since that's just a straight memcpy of the internal array buffer; a
string.join(list, "") has mounds more work to do.

But on the fifth hand, note that array.append doesn't do any "over
allocation" on an append, while list.append does. It's at least curious that
the often decried but rarely measured "quadratic time behavior" for
list.append doesn't show up more often for array.append, as arrays don't do
*anything* to mitigate it -- this really all depends on your system realloc()
behavior.

One reasonable conclusion to draw is that there are no generally applicable
conclusions <wink>.

Isaac To Kar Keung

unread,

May 22, 2001, 6:34:10 AM5/22/01

to

>>>>> "Alex" == Alex Martelli <ale...@yahoo.com> writes:

Alex> Not quite -- Python's sources are pretty clear about that:-). A
Alex> list DOES allocate a few extra entries when it grows, but it does
Alex> NOT grow geometrically. So IN THEORY appending should be O(N^2)
Alex> for lists as well -- but in practice it seems N never gets big
Alex> enough to actually measure that effect.

The relevant code:

#define ROUNDUP(n, PyTryBlock) \
((((n)+(PyTryBlock)-1)/(PyTryBlock))*(PyTryBlock))

static int
roundupsize(int n)
{
if (n < 500)
return ROUNDUP(n, 10);
else
return ROUNDUP(n, 100);
}

That means, when size is below 500, a list size will always be multiple of
10. When size is larger, it is always multiple of 100. Of course, if the
object is already of the right size, the system resize() does nothing.

Seems like magic to me... I run the following program and I didn't see *any*
slowdown, when each of the lists is as large as 8M, eating up 60M memory...

a = []
b = []
i = 0
while 1:
i = i + 1
if i % 100000 == 0:
print i
a.append(0)
b.append(0)

(Will it be the system realloc which actually do the geometric scaling, or
is it really that geometric scaling is unnecessary?)

Regards,
Isaac.

Roy Smith

unread,

May 22, 2001, 8:54:52 AM5/22/01

to

"Alex Martelli" <ale...@yahoo.com> wrote:
> Not quite -- Python's sources are pretty clear about that:-). A
> list DOES allocate a few extra entries when it grows, but it does
> NOT grow geometrically. So IN THEORY appending should be O(N^2)
> for lists as well -- but in practice it seems N never gets big
> enough to actually measure that effect.

How does python deal with sparse lists, i.e. what if I did this:

list = []
list[9001] = 'foo'
for i in range(9000):
list[i] = ' '
del list[9001]

would the list[9001] = 'foo' statement force enough memory to be
pre-allocated for the entire list, allowing the loop to run in linear time?

Alex Martelli

unread,

May 22, 2001, 8:52:57 AM5/22/01

to

"Tim Peters" <tim...@home.com> wrote in message
news:mailman.990526353...@python.org...
> [Alex Martelli]
...

> > So, an array.array of characters turns out to be
> > systematically and repeatably 20% faster than a
> > list for this task on my box (an old clunky P3-266

...

> One reasonable conclusion to draw is that there are no generally
applicable
> conclusions <wink>.

...particularly given that cStringIO appears to be
almost twice as fast again as the array.array('c')
for this specific task... as per my followup post.

I think the generally applicable conclusion is:
*MEASURE* where the time is going, if/when speed
is a concern when your application is running. And
if the bottleneck turns out to be a "one char at a
time" string building, try lists, arrays, cStringIO's,
AND powdered bat wings dried in the moonlight over
a hanged man's grave -- one never knows...

Alex

Alex Martelli

unread,

May 22, 2001, 9:34:39 AM5/22/01

to

"Roy Smith" <r...@panix.com> wrote in message
news:roy-916424.0...@news1.panix.com...
...

> How does python deal with sparse lists,

By raising exceptions, at least in your example.

> i.e. what if I did this:
>
> list = []
> list[9001] = 'foo'

This raises an IndexError. Python lists never
automatically "grow" when an assignment statement
attempts to bind an item at an invalid index --
rather, they raise IndexError. To "grow" a list,
you may assign appropriately to a SLICE of it,
but that can never cause "sparsity" either. Or
you can call .append or .extend on the list, but,
again, no 'sparsity' can thereby be caused.

The normal way to say "add 9001 items with value
None to my list" is
mylist.extend(9001*[None])
and this, too, can never cause "sparsity".

To say "extend my list to 9001 items long no
matter how long it is now":
mylist.extend((9001-len(mylist))*[None])

Note that, if len(mylist)>=9001 when you do
this, no problem -- N*[None] produces the
empty list if N<=0, so everything works out
fine - mylist.extend([]) is a quiet no-op:-).

Assigning to a *dictionary* item with a key (index)
that is not yet present does 'grow' the dictionary,
but by one item only -- again, no 'sparsity' issue
can possibly arise.

If you do need to handle "sparse lists" a lot
in your applications, you'll probably want to
wrap the "sparse list" implementation into a
class. One might start from UserList, but I
suspect that little of its implementation would
in fact be reusable -- it's likely that the
best underlying implementation data structure
is a dictionary plus perhaps some auxiliary
information to allow faster computation of,
e.g., __length__.

Alex

Michael Hudson

unread,

May 22, 2001, 1:31:54 PM5/22/01

to

"Alex Martelli" <ale...@yahoo.com> writes:

> ...particularly given that cStringIO appears to be
> almost twice as fast again as the array.array('c')
> for this specific task... as per my followup post.

One of the reasons array.array often isn't especially quick is the
conversions that must take place every time something is put into or
accessed in the array.

Cheers,
M.

--
You owe The Oracle a TV with an 'intelligence' control - I've
tried 'brightness' but that didn't work.
-- Internet Oracularity #1192-01

Michael Hudson

unread,

May 22, 2001, 1:33:01 PM5/22/01

to

Michael Hudson <m...@python.net> writes:

> "Alex Martelli" <ale...@yahoo.com> writes:
>
> > ...particularly given that cStringIO appears to be
> > almost twice as fast again as the array.array('c')
> > for this specific task... as per my followup post.
>
> One of the reasons array.array often isn't especially quick is the
> conversions that must take place every time something is put into or
> accessed in the array.

Duh! Tim said that in the article you were replying to.

Cheers,
M.

--
There are 'infinite' number of developed artifacts and one cannot
develop appreciation for them all. It would be all right to not
understand something, but it would be imbecilic to put judgements
on things one don't understand. -- Xah, comp.lang.lisp

E. Mark Ping

unread,

May 22, 2001, 2:11:15 PM5/22/01

to

In article <7ik839g...@enark.csis.hku.hk>,

Note please that I was writing about list, not vector when an entry is
added. I clearly stated that vector uses a geometric algorithm (like
doubling capacity when existing capacity is exhausted).
--
Mark Ping
ema...@soda.CSUA.Berkeley.EDU

Ben Hutchings

unread,

May 22, 2001, 4:38:03 PM5/22/01

to

Isaac To Kar Keung <kk...@csis.hku.hk> writes:
<snip>

> That means, when size is below 500, a list size will always be multiple of
> 10. When size is larger, it is always multiple of 100. Of course, if the
> object is already of the right size, the system resize() does nothing.
>
> Seems like magic to me... I run the following program and I didn't see *any*
> slowdown, when each of the lists is as large as 8M, eating up 60M memory...

<snip>

> (Will it be the system realloc which actually do the geometric scaling, or
> is it really that geometric scaling is unnecessary?)

As the list buffer grows, realloc() will at some point find that the
only place to put it is at the top of the heap. Once it's there,
further realloc() calls will grow the heap and will not need to move
the buffer. However, other systems - those with a shared memory
space, or without a smart implementation of realloc() - are likely to
behave differently.

--
Any opinions expressed are my own and not necessarily those of Roundpoint.

Tim Peters

unread,

May 22, 2001, 8:55:46 PM5/22/01

to pytho...@python.org

[Alex Martelli]

> ...particularly given that cStringIO appears to be
> almost twice as fast again as the array.array('c')
> for this specific task... as per my followup post.

[Michael Hudson]

> One of the reasons array.array often isn't especially quick is the
> conversions that must take place every time something is put into or
> accessed in the array.

[Presumably Michael Hudson's better-read twin]

> Duh! Tim said that in the article you were replying to.

True, but buried in so many other complications I can't fault your brother
for missing it in his admirable eagerness to be helpful.

usenet-group-hug-ly y'rs - tim

Tim Peters

unread,

May 22, 2001, 9:04:03 PM5/22/01

to pytho...@python.org

[Isaac To Kar Keung]

> That means, when size is below 500, a list size will always be
> multiple of 10. When size is larger, it is always multiple of
> 100. Of course, if the object is already of the right size, the
> system resize() does nothing.
>
> Seems like magic to me... I run the following program and I
> didn't see *any* slowdown, when each of the lists is as large
> as 8M, eating up 60M memory...
> <snip>
> (Will it be the system realloc which actually do the geometric
> scaling, or is it really that geometric scaling is unnecessary?)

[Ben Hutchings]

> As the list buffer grows, realloc() will at some point find that the
> only place to put it is at the top of the heap. Once it's there,
> further realloc() calls will grow the heap and will not need to move
> the buffer.

Ah, but Isaac (To? Kar? what do you call a man with four names <wink>?)
anticipated that (which is indeed "the usual" dodge). Look at his test
program again:

a = []
b = []
i = 0
while 1:
i = i + 1
if i % 100000 == 0:
print i
a.append(0)
b.append(0)

He's appending to *two* lists, and if realloc moves one to "the top of the
heap", how do you account for the other not slowing things down?

My assumption was that he's running on a box where realloc eventually puts
giant things into their own mmap'ed segment. This kind of programming can
definitely kill you across platforms, though, and, e.g., Win95 isn't so slick
with just one list growing forever.

Courageous

unread,

May 22, 2001, 10:24:53 PM5/22/01

to

>My assumption was that he's running on a box where realloc eventually puts
>giant things into their own mmap'ed segment. This kind of programming can
>definitely kill you across platforms, though, and, e.g., Win95 isn't so slick
>with just one list growing forever.

*shrug*. I have a suitable replacement for Python's list that has the following
properties:

insert-at-head: O[1] amortized
append-at-tail: O[1] amortized
random-access: O[1] guaranteed

This is a hybridized vector preallocation strategy that uses both tail _and_
head preallocation. In order to "preallocate at the head" I actually have to
keep a "virtual zero point" in the array, but Python dispatch overhead being
what it is, no Python program would ever notice, and it's only add/compare
in any case.

I'd offer it as a _complete_ replacement for Python's list, except that I never
really felt like filling out the tuple support and implementing sorting (the last
of which intimidated me well and good, sample sort's a bitch). :-)

C//

Isaac To Kar Keung

unread,

May 22, 2001, 10:47:48 PM5/22/01

to

>>>>> "Ben" == Ben Hutchings <ben.hu...@roundpoint.com> writes:

Ben> As the list buffer grows, realloc() will at some point find that
Ben> the only place to put it is at the top of the heap. Once it's
Ben> there, further realloc() calls will grow the heap and will not need
Ben> to move the buffer. However, other systems - those with a shared
Ben> memory space, or without a smart implementation of realloc() - are
Ben> likely to behave differently.

Yes, I postulated that before. But it is proven otherwise by my last
example. (Did you see why I have two arrays growing rather than just one?)

Regards,
Isaac.

Isaac To Kar Keung

unread,

May 22, 2001, 10:51:38 PM5/22/01

to

>>>>> "E" == E Mark Ping <ema...@CSUA.Berkeley.EDU> writes:

E> Note please that I was writing about list, not vector when an entry
E> is added. I clearly stated that vector uses a geometric algorithm
E> (like doubling capacity when existing capacity is exhausted).

Oops. Will try to read posts more carefully next time.

Regards,
Isaac.

Isaac To Kar Keung

unread,

May 22, 2001, 11:29:56 PM5/22/01

to

>>>>> "Tim" == Tim Peters <tim...@home.com> writes:

Tim> My assumption was that he's running on a box where realloc
Tim> eventually puts giant things into their own mmap'ed segment.

Actually, on Debian GNU/Linux 2.4.4 under i386.

Tim> This kind of programming can definitely kill you across platforms,
Tim> though, and, e.g., Win95 isn't so slick with just one list growing
Tim> forever.

So does it mean that you've tried that in Win95 and found it slowing down as
the number grows?

if like_off_topic:

Tim> Ah, but Isaac (To? Kar? what do you call a man with four names
Tim> <wink>?) ...

Name mangling is an unsolvable problem. ;p Most Chinese names consist of
two to four Chinese characters, one or two of which is the surname, one
or two is the first name. For me, "Kar-Keung" is the first name, "To"
is the family name. But most foreigners have difficulty pronouncing
Chinese characters accurately, so it is customary to add a Western name.
Without that strange things can happen, e.g., if I don't put "Isaac"
first, I'll probably be called "Kar", which is half of my first name.
Even worse is that for most people, all siblings with the same sex get
the same Chinese character as the first part of their first name.

The difficulty of locating last names is long known, which is why you'll
see registration forms of international conferences asking you "What's
your last name" and "What's your first name" instead of simply "What's
your name".

The Chinese naming system is not really complicated. In many countries
multiple first and last names, parts of them inherits from the names of
other family members.

Regards,
Isaac.

Ben Wolfson

unread,

May 23, 2001, 12:56:14 AM5/23/01

to

In article <7i8zjog...@enark.csis.hku.hk>, "Isaac To Kar Keung"
<kk...@csis.hku.hk> wrote:

> if like_off_topic:
[snip]

> The Chinese naming system is not really complicated. In many
> countries multiple first and last names, parts of them inherits from
> the names of other family members.

if really_like_off_topic:

F'rinstance, some Spanish-speaking countries. It annoys me
greatly when bookstores file books by Gabriel Garcia Marquez
under the 'M's instead of 'G's.

--
Barnabas T. Rumjuggler
I had a son once, I ain't got no son today
My son is gone now, my son is gone, oy weh
-- Joe Frank, "Woman and Bull in a Paint Factory"

Mark C Favas

unread,

May 23, 2001, 6:24:58 AM5/23/01

to

[Alex adds cStringIO to his tests, and it proves to be faster...]

>and, whaddyaknow...:

and running it on current CVS Python, on Tru64 Unix, I get:

python oaat.py
[]: 1.13
array('c'): 1.55
<StringO object at 1400e4f70>: 1.22

python oaat.py
[]: 1.13
array('c'): 1.58
<StringO object at 1400e4f70>: 1.22

So, platforms are (surprise!) different...

Mark
--
Email - ma...@chem.uwa.edu.au ,-_|\ Mark C Favas
Phone - +61 9 380 3482 / \ Department of Chemistry
Fax - +61 9 380 1005 ---> *_,-._/ The University of Western Australia
v Nedlands
Loc - 31.97 S, 115.81 E Western Australia 6009

Andrew Dalke

unread,

May 23, 2001, 6:31:34 AM5/23/01

to

Ben Wolfson wrote:
> if really_like_off_topic:
>
> F'rinstance, some Spanish-speaking countries. It annoys me
> greatly when bookstores file books by Gabriel Garcia Marquez
> under the 'M's instead of 'G's.

Down that path lies what to me seems madness.

Take a look at Knuth, The Art of Computer Programming, v.3
("Sorting and Searching"), problem 17 on p7.

17. [33] (Library card sorting.) ... The following "alphabetical"
listing indicates may of the procedures recommended in the American
Library Association Rules for Filing Catalog Cars (Chicago: 1942):

Text of card Remarks
R. Accademia nazionale dei Ignore foreign royalty (except British)
Lincei, Rome.
1812; ein historischer Roman. Achtzehnhundert zw"olf
Biblioth`eque d'histoire Treat apostrophe as space in French
re'volutionnaire.
...
Brown, Mrs. J. Crosby Ignore designation of rank
Brown, John Names with dates follow those without
Brown, John, mathematician ... and the latter are subarranged
Brown, John, of Boston by descriptibe words
Brown, John, 1716-1766 Arrange identical names by birthdate
BROWN, JOHN, 1715-1766 Works "about" follow works "By"
Brown, John, d. 1811 Sometimes birthdate must be estimated
...
Dix, Morgan, 1827-1908 Names precede words
1812 ouverture Dix-huit cent douze
Le XIXe si`ele franc,ais Die-neuvi`eme
The 1847 issue of U.S. stamps Eighteen forty-seven
1812 overture Eighteen twelve
... _ _
al-Khuwarizmi, _ _ Ignore initial "al-" in Arabic names
Muhammad ibn Musa
...
McCarthy, John, 1927- Mc = Mac
...
Mrs. Dalloway "Mrs." = "Mistress"
...
Ste-Marie, Gaston P Sainte
...

If each book was assigned a unique number, would you be annoyed?
(Eg, Dewey decimal or Library of Congress ordering.) So do what
I do and treat bookstores as using some arbitary ordering scheme,
but where the key just happens to be somewhat guessable from
knowledge of the author.

No so very much different from the skill I once had with Dewey
decimal where I could figure out at least which case the book
was on given a description of it content. (Granted, it was a
small library, but not all that small.)

Andrew
da...@acm.org

Helen Dawson

unread,

May 24, 2001, 2:47:58 AM5/24/01

to

Thanks for all the replies. I learned several tricks.

However I find myself more worried than ever. I started out
thinking that list was guaranteed to be O(n) for my problem,
but now it appears that it isn't - it's dependent on good realloc
behaviour. I think that's one thing the STL guys recognized
well - incrementing by any (more or less) constant size
can produce arbitrarily bad performance. A worst case
memory penalty of 2x for doubling the size of list each time
seems a fair price to pay for avoiding that
(especially since you can make the memory penalty
arbitrarily small - any ratio above one is theoretically
equivalent). Oh well.

And, to the person who pointed out that my test script
is utterly meaningless for strings - yes it is, but that is what
happens when you simplify your code to its essence. It's
certainly better than posting the original 50 line script
and expecting random readers to sort through the chaff.

Thanks again all.

Bruce/Helen Dawson

Helen Dawson wrote:

> The following script tests appending to a list and to a string.
>
> def AppendTest(listorstring):

> for i in range(0, 100):
> for x in range(0, 2000):

> listorstring += " "
> print i
>
> AppendTest([])
> AppendTest("")
>
> The test with the list runs very swiftly and smoothly - the numbers
> are printed out at a steady rate. The test with the string runs
> slowly, and the speed that the numbers appear slows down - a lot. In
> fact, I have never seen the string test end - I always abort it out
> of boredom before it gets to twenty five.
>
> Without even looking in the Python internals it is obvious what is
> happening. In the string case, each time a character is appended the
> entire string is copied. Therefore adding n characters requires

> n^2/2 bytes to be copied, making processing large files (where large
> is > about 10K) very inefficient - and processing files larger
> than about 100K is virtually impossible.
>

> List does much better. Why? It probably does the same thing that the
> STL vector typically does - the allocabe space doubles whenever it
> runs out of room, and then the newly allocated space is filled in
> as needed.
>

> So, why can't string do that?
>

> When we could only go string = string + " " then fixing this problem
> was probably impossible, but with += it seems like it should be quite
> possible - maybe even easy!
>

> Right now if I am producing a text file a character at a time I have
> to do it to a list and then use string.join() to convert at the end.
> That works, but seems clunky.
>

> It's also a landmine for the naive. I recognized the problem
> immediately, but many people would simply assume that Python was slow,
> when it took an hour to process a one Mbyte file.
>

Alex Martelli

unread,

May 24, 2001, 4:36:41 AM5/24/01

to

"Helen Dawson" <hel...@accessone.com> wrote in message

news:3B0CAF2C...@accessone.com...

> Thanks for all the replies. I learned several tricks.
>
> However I find myself more worried than ever. I started out
> thinking that list was guaranteed to be O(n) for my problem,
> but now it appears that it isn't - it's dependent on good realloc
> behaviour. I think that's one thing the STL guys recognized
> well - incrementing by any (more or less) constant size
> can produce arbitrarily bad performance. A worst case
> memory penalty of 2x for doubling the size of list each time
> seems a fair price to pay for avoiding that

Just one more reflection... often you can get a sensible
estimate of how big your list will need to be, beforehand,
particularly if you're willing to tolerate a 'slop' of a
factor of 2 or so. In this case, you can get substantially
better performance by "preallocating" the list rather than
using .append() on it. If you wrap this into a UserList
subclass, or your own sequence object, it's not too hard
to keep client code blissfully unaware of the issue - it
calls .append() anyway, and the append method implementation
is the place where "the smarts" hide.

It's unlikely that you'll get an actual performance boost
from this approach, but you might get peace of mind:-).

Alex

Aahz Maruch

unread,

May 24, 2001, 10:34:35 AM5/24/01

to

In article <3B0CAF2C...@accessone.com>,

Helen Dawson <hel...@accessone.com> wrote:
>
>However I find myself more worried than ever. I started out thinking
>that list was guaranteed to be O(n) for my problem, but now it appears
>that it isn't - it's dependent on good realloc behaviour. I think
>that's one thing the STL guys recognized well - incrementing by any
>(more or less) constant size can produce arbitrarily bad performance. A
>worst case memory penalty of 2x for doubling the size of list each
>time seems a fair price to pay for avoiding that (especially since you
>can make the memory penalty arbitrarily small - any ratio above one is
>theoretically equivalent). Oh well.

What you're saying is theoretically true. In practice, very few people
run into the non-O(n) nature of lists, because the constant factor for
expansion is large enough and other operations tend to swamp the list
resize time.
--
--- Aahz <*> (Copyright 2001 by aa...@pobox.com)

Androgynous poly kinky vanilla queer het Pythonista http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6

"You do not make history; you can only hope to survive it." --G'kar

Courageous

unread,

May 24, 2001, 12:13:56 PM5/24/01

to

>What you're saying is theoretically true. In practice, very few people
>run into the non-O(n) nature of lists, because the constant factor for
>expansion is large enough and other operations tend to swamp the list
>resize time.

The dispatch time from Python-to-C alone is enough to mask plenty
of time.

C//

Helen Dawson

unread,

May 25, 2001, 1:25:50 AM5/25/01

to

After discoverting that lists actually are O(n^2), and after
hearing a rumour that they behaved badly on Win9x, I decided
to investigate.

It turns out that my previous tests had stopped just before the
quadratic nature of lists starts showing up. In my new test script
(pasted in at the bottom of this message) I append two million
items to a list. On NT you can see the performance steadily
degrade - each set of 10,000 items takes noticeably longer to
add, however it does finish. On Win9x the performance is
constant, until a wall is hit and the performance plummets.

A bit later, on Win9x, with about 1.5 million items in the list,
the program halts with an out of memory error. This happens
despite the fact that Python only has a few tens of Mbytes
allocated on my 128 Mbyte machine. It turns out that Python
actually runs out of address space! 2 GBytes of address space
swallowed up in a few minutes.

Here's what happens. The Python list keeps adding 100 items at
a time. Since Python uses realloc these arrays can grow without
being copied, which is nice and efficient. However, the default heap
size on Win9x is 4Mbytes. When the list doesn't fit then it starts
getting copied. Each time it gets copied the OS (or the C run-time)
allocates a heap that is big enough - barely. So, when the list grows
a few more times (64K probably?) it fills another heap. So, suddenly
the list is being frequently copied.

Worse yet, the old heaps don't get freed. There memory mostly
does, but the address space doesn't. I don't know if that's a Python
glitch or a C run-time glitch, but pretty soon the address space is
filled with about 500 4 Mbyte heaps, and the script halts. The
line of increasing large heaps that are all mostly empty is quite
amusing in my little memory viewer.

NT handles it better, but the performance still degrades.

The current behaviour is to add 10 items initially, and then
switch to adding 100 items. This strikes me as being a weak
approximation to proper geometric expansion, and I'm not sure
what the benefit's are. Increasing by some fraction of the
current size would be much more reliable. Instead of this:

static int
roundupsize(int n)
{
if (n < 500)
return ROUNDUP(n, 10);
else

return ROUNDUP(n, 100);
}

why not this:

static int

roundupsize(int n)
{
if (n < 500)
return ROUNDUP(n, 10);
else

return n + n / 10;
}

This method will avoid the quadratic behaviour, it will avoid
running out of address space on systems with imperfect
allocators, and it will waste, at most, 10% of memory. If
10% is too much to waste, change the /10 to /20.

The current system can run a hundred times slower than it
needs to, and can fail when it needn't. Is there any good
reason not to make the fix? Any program that adds more
than 1000 items to a list in small batches will benefit from
this fix, and there is virtually no downside.

Bruce/Helen Dawson

Alex Martelli wrote:

print """This program demonstrates that Python is very efficient when
it comes to appending to arrays - it takes linear time as you add more
items. However if you keep appending to a list the performance can
get arbitrarily bad - it is O(n^2) which means that appending ten
times as many characters takes one hundred times longer. Lists behave
much better than strings, but still fail unacceptably much earlier
than they need to.
"""

print "Written by Bruce Dawson"
import time

import array
def arrayappendtest(myarray, count=100):
starttime = time.clock()
for o in range(count):
for i in range(10000):
myarray.append(" ")
endtime = time.clock()
print "Array iteration %d/%d - time %f" % (o, count,
endtime - starttime)
starttime = endtime
return myarray

def appendtest(stringorlist, count=100):
starttime = time.clock()
for o in range(count):
for i in range(10000):
stringorlist += " "
endtime = time.clock()
print "List iteration %d/%d - time %f" % (o, count,
endtime - starttime)
starttime = endtime
return stringorlist

print "Appending 150,000 items to a list works fine, but even over"
print "a small number of iterations"
print "you can see that the time to add an item is not constant -"
print "it's proportional to the number of elements in the list."
print "Therefore, the time to add n items is O(n^2) - if n is"
print "reasonably large."
x = appendtest([], 15)
raw_input("Notice that lists slow down - press enter to continue")
print "Arrays take constant time to add items."
x = arrayappendtest(array.array('c'), 15)
raw_input("Notice arrays do not slow down - press enter to continue")
print "Since arrays scale well we can do lots of iterations quickly"
x = arrayappendtest(array.array('c'), 200)
raw_input("Notice arrays are still fast - press enter to continue")
print "This runs out of address space (not memory, address space)"
print "on Win9x. It ends up reserving a huge number of heaps, that"
print "are mostly empty."
print "It typically fails when the outer loop reaches around 290, at"
print "which point all 2 GBytes of user address space is reserved."
print "The performance gets very bad first."
print "A modest amount of memory is in use when the program fails."
x = appendtest([], 200)
raw_input("Success! - you're not running Win9x")

Courageous

unread,

May 25, 2001, 2:20:30 AM5/25/01

to

On Fri, 25 May 2001 05:25:50 GMT, Helen Dawson <hel...@accessone.com> wrote:

>The current behaviour is to add 10 items initially, and then
>switch to adding 100 items. This strikes me as being a weak
>approximation to proper geometric expansion, and I'm not sure
>what the benefit's are. Increasing by some fraction of the
>current size would be much more reliable.

And furthermore, using a 100% growth factor enables a provably*
O(1)+k amortized append time. If you don't like the Python list,
use my "dlist" at http://www.sourceforge.net/projects/pythonic.
While it's incomplete (no sorting, for example), it has favorable
growth factors, with the addition of simulateously supporting
O(1) insert-at-head, O(1)+k append-at-tail, and O(1) random
access. It's a bit of a memory hog because of the allocation
strategy it uses to achieve the fast head and tail performance.

It is *only* available by anonymous cvs at the moment. Follow
the links to cvs, and check out the "pythonic" cvs module.
Then compile the adt submodule and read the test_dlist.py
source example.

If you have any more questions, contact me by email.

C//

*beyond my math skills, but so I've been told.

Isaac To Kar Keung

unread,

May 25, 2001, 2:41:08 AM5/25/01

to

>>>>> "Helen" == Helen Dawson <hel...@accessone.com> writes:

Helen> After discoverting that lists actually are O(n^2), and after
Helen> hearing a rumour that they behaved badly on Win9x, I decided to
Helen> investigate.

Actually I don't think that the mmap argument hold a lot of water, and I do
think that the behaviour should be quadratic unless the realloc of Linux do
quadratic allocation. The problem is that even if it is mmap'ed, it is
still in the same address space, and thus eventually the address spaces of
the two arrays would touch each other, leading to an actual reallocation.
Yes, we can save the copy of the content, but then we need to copy the VM
tables.

Helen> why not this:

Helen> static int roundupsize(int n) { if (n < 500) return ROUNDUP(n,
Helen> 10); else return n + n / 10; }

But that won't even get close to work. Currently, Python calls PyMem_RESIZE
everytime a resize is needed, but on the other hand there is no member
within the structure to store something like the STL capacity. If you
really change Python to the above, everything would be the same (or,
actually, worse) than before. Instead of allocation of size 10, 10, 10,
..., 20, 20, 20, ..., 30, 30, ..., 500, 500, ..., 600, ...; the system will
make allocation like 10, 10, 10, ..., 20, 20, 20, ..., 30, ..., 490, 550,
551, 552, 553, 554, 555, 556, 557, 558, 559, 561, ... You'll get a huge
number of reallocation and copying this way. ;p

I don't actually see a good solution, actually, short of storing the
capacity somewhere. The problem is that whatever the mapping, there is a
possible case in which copying and reallocation is done basically everytime.
For example, the current scheme cannot guarantee that the memory system will
be happy with the following code:

a = [None] * 497 # 12 byte header + 497 pointers = 2000 bytes
while 1:
a.append(None) # Allocate 2100 bytes
# do some more interesting thing here
a[-1:] = [] # Become 2000 bytes

Everytime PyMem_RESIZE is called with a different size, and we now have to
hope that the memory system of the OS do the right thing.

There are two possible attutides for this problem. We can say that this is
a flaw in the C library, and realloc should really deal with the problem
because it happens so often. Or we can say that this is a Python problem,
that it should always make sure that it runs right in whatever C library.
Perhaps we have to accept that the latter is the answer, given that Win98 is
not doing it right but nobody is going to give you another Win98.

There are two places where we can add the capacity field. PyObject_VAR_HEAD
is my choice. Adding 4 bytes of current capacity to the currently 12 bytes
header storing the reference count, type and size. This has the additional
benefit to get the whole thing to align on 16-byte boundary whereever it is
benefitial.

But there is another solution that does not need the 4-bytes space overhead.
The C library should know the size of allocation anyway, so we can actually
reuse that information. This ends up that we need to have a library
dependent PyCore_REALLOC_FUNC. I don't really think that is a good idea,
though.

Regards,
Isaac.

Tim Peters

unread,

May 25, 2001, 3:52:16 AM5/25/01

to pytho...@python.org

[Bruce Dawson]

> After discoverting that lists actually are O(n^2),

"Actually" is an odd word for something that took two days to provoke after
giving up the first time then being assured it was possible to provoke it.
Is it "actually" the case on Linux too? As has been said, the behavior is
platform dependent.

> and after hearing a rumour that they behaved badly on Win9x, I
> decided to investigate.
>
> It turns out that my previous tests had stopped just before the
> quadratic nature of lists starts showing up. In my new test script
> (pasted in at the bottom of this message) I append two million
> items to a list.

Is this typical of a useful program you're going to run? Is so, you know
better now, at least on NT and Win9x. But if not, well, "who cares?" comes
to mind. It's not the only extreme thing you can do to drive Windows insane.

> On NT you can see the performance steadily degrade - each set of
> 10,000 items takes noticeably longer to add, however it does
> finish.

That was my experience two years ago too -- glad to hear NT has remained
stable <wink>.

> On Win9x the performance is constant, until a wall is hit and
> the performance plummets.

Ditto.

> A bit later, on Win9x, with about 1.5 million items in the list,
> the program halts with an out of memory error. This happens
> despite the fact that Python only has a few tens of Mbytes
> allocated on my 128 Mbyte machine. It turns out that Python
> actually runs out of address space! 2 GBytes of address space
> swallowed up in a few minutes.

Yup. It's like trying to avert your eyes from a train wreck!

> ...

> Worse yet, the old heaps don't get freed. There memory mostly
> does, but the address space doesn't. I don't know if that's a Python
> glitch or a C run-time glitch,

It's hard to say whether it's a C library problem or (more likely, I think) a
deeper OS VM mgmt problem. Python dutifully returns the memory via free(),
but Win9x's VM defragmenter is apparently overwhelmed.

> but pretty soon the address space is filled with about 500 4 Mbyte
> heaps, and the script halts. The line of increasing large heaps
> that are all mostly empty is quite amusing in my little memory
> viewer.
>
> NT handles it better, but the performance still degrades.

> ...

> Instead of this:
>
> static int
> roundupsize(int n)
> {
> if (n < 500)
> return ROUNDUP(n, 10);
> else
> return ROUNDUP(n, 100);
> }
>
> why not this:
>
> static int
> roundupsize(int n)
> {
> if (n < 500)
> return ROUNDUP(n, 10);
> else
> return n + n / 10;
> }
>
> This method will avoid the quadratic behaviour, it will avoid
> running out of address space on systems with imperfect
> allocators,

How do you know that? You didn't actually try it -- if you had, you would
have seen it didn't work as you hoped. Hint: as lists grow larger, you're
now asking the system for strictly more space on *every* append -- the cure
is worse than the disease. You can repair this with enough effort, but
please *try* it before making claims; also make sure it doesn't hurt
performance on Linux and umpteen minority OSes. As an example of something
"that works even on Windows" but hurts too much elsewhere, you *could* try
this:

static int
roundupsize(int n)
{
int i;

if (n < 500)
return ROUNDUP(n, 10);

i = 0;
while (n) {
++i;
n >>= 1;
}
return 3 << (i - 1);
}

> and it will waste, at most, 10% of memory. If 10% is too much to
> waste, change the /10 to /20.
>
> The current system can run a hundred times slower than it
> needs to, and can fail when it needn't. Is there any good
> reason not to make the fix?

You have to find a fix that actually does more good than harm first, it's not
a problem in practice, and a better long-term solution lies in adopting
Vladimir Marangozov's PyMalloc (Win9x malloc is too poor in general to be
worth the effort of worming around in one endcase that doesn't happen in
practice anyway -- but it's worth replacing it for other reasons).

> Any program that adds more than 1000 items to a list in small batches
> will benefit from this fix, and there is virtually no downside.

mercifully y'rs - tim

Tim Peters

unread,

May 25, 2001, 5:15:55 AM5/25/01

to pytho...@python.org

[Isaac To Kar Keung]
> ...

> I don't actually see a good solution, actually, short of storing the
> capacity somewhere.

We can do that if we use our own malloc, but won't otherwise (we're not going
to add more members to the Python listobject struct, because millions of
lists with a few things are about a million times more common than a few
lists with millions of things).

> The problem is that whatever the mapping, there is a possible case
> in which copying and reallocation is done basically everytime.

We could, e.g., always round the list size up to the next higher power of 2.
That way the capacity is implicit. We're unlikely to, though.

> ...

> There are two possible attutides for this problem. We can say that
> this is a flaw in the C library, and realloc should really deal with
> the problem because it happens so often.

Except that it rarely happens at all -- don't overblow this. About once each
16 months *somebody* gets excited about this, then loses interest as they
discover it never hurts much except in the artificial cases they've
constructed to provoke it.

> Or we can say that this is a Python problem, that it should always
> make sure that it runs right in whatever C library.

The third possiblity is the one you're not ready for yet: Python is neither
a C library nor an OS, neither is it obligated to provide theoretically
optimal performance in every conceivable case. Python lists are wonderfully
zippy and compact as-is, and doing anything that hurts either of those would
be a net loss for the vast majority of Python users. If building
mega-element lists via one append at a time is too slow for your app, fine,
don't do that.

> ...

> There are two places where we can add the capacity field.
> PyObject_VAR_HEAD is my choice.

As above, not a chance.

> ...

> But there is another solution that does not need the 4-bytes space
> overhead. The C library should know the size of allocation
> anyway, so we can actually reuse that information.

Also not a chance, but it becomes possible if we enable Vladimir's PyMalloc
and fiddle it accordingly.

Courageous

unread,

May 25, 2001, 12:09:36 PM5/25/01

to

> because millions of lists with a few things are ...

The use of the word "millions" is hyperbolous, yes? I just can't
imagine that adding a 32 bit number to every list in Python is all
that big a deal. Do you have actual numbers here?

>We could, e.g., always round the list size up to the next higher power of 2.
>That way the capacity is implicit. We're unlikely to, though.

Is this a memory usage issue?

>Except that it rarely happens at all -- don't overblow this. About once each
>16 months *somebody* gets excited about this, then loses interest as they
>discover it never hurts much except in the artificial cases they've
>constructed to provoke it.

Indeed.

> If building
>mega-element lists via one append at a time is too slow for your app, fine,
>don't do that.

And there are alternatives; I can't be the only one to have written a list
replacement.

>As above, not a chance.

Fine, but I'm having a hard time believing that it's that big a problem. The
other day I modded my Python distro to do postmortem's on all Python
dictionaries after a run of the pystone suite; this forced me to keep them
from ever being destroyed so I could check their collision statistics and
the like (they were very good). Anyway, the point: there really weren't _that_
many dictionaries in use, and this includes all the ones used internally in
Python proper. I can't imagine that there'd be all that many lists in use in
a typical Python run?

>Also not a chance, but it becomes possible if we enable Vladimir's PyMalloc
>and fiddle it accordingly.

Sure.

C//

Bruce Sass

unread,

May 25, 2001, 2:34:00 PM5/25/01

to pytho...@python.org

On Fri, 25 May 2001, Tim Peters wrote:
<>
> We can do that if we use our own malloc, but won't otherwise (we're not going
> to add more members to the Python listobject struct, because millions of
> lists with a few things are about a million times more common than a few
> lists with millions of things).

So, the discussion should be about optimizing the existing stuff for
small lists, and creating an extension lib so one can do, maybe...

long_list = []L
or
l = [] ; ... ; l = longlist(l)

?

- Bruce

Tim Peters

unread,

May 25, 2001, 4:17:44 PM5/25/01

to pytho...@python.org

[Courageous]

> The use of the word "millions" is hyperbolous, yes?

Depends on the app, of course. For example, you only need one dict mapping a few million keys to lists of associated
items to get a few million lists. While not "a typical app", that's typical of *some* reasonable apps.

> I just can't imagine that adding a 32 bit number to every list in
> Python is all that big a deal.

Adding a new data member to a builtin type is always a big deal, and has gotten bigger lately due to the emerging
popularity of Python ports to small devices. In any case, Guido rejected this idea the last time it went around (hence
my "no chance" -- nothing material has changed since then, and this is repetition of arguments made many times before).

I'm going to try to stop the Win9X madness via a no-new-data hack, against the day we can do it right (via PyMalloc).
BTW, I'm actually more appalled that the current ROUNDUP macro uses integer division (a very slow operation on some
boxes).

> Do you have actual numbers here?

Every you time you run a Python program, it emails detailed runtime statistics to Guido via a secret backdoor in the
code. And that's the real reason list.append() eventually slows down, but only on systems like Windows with poor email
performance <wink>.

> ...

> And there are alternatives; I can't be the only one to have written a
> list replacement.

I don't recall another list replacement implemented *as* a list. More popular is to try some flavor of balanced tree,
or a skip list.

> ...

> The other day I modded my Python distro to do postmortem's on all
> Python dictionaries after a run of the pystone suite; this forced
> me to keep them from ever being destroyed

If you do this again, write the info you want to analyze to a text file (opened for append) at the start of
dict_dealloc(). Then you don't need to keep the dicts alive, and can write Python programs to analyze the text-file
dumps later. For example, write out the dict size, and for those slots that aren't empty write the index and key hash
code (special-casing dummies).

BTW, pystone makes no attempt to model data usage: it's an artificial benchmark from the Drhystone line, pasting
together senseless code contrived to recreate the aggregate operation-count statistics obtained from traces of "real
programs". In pystone's case, it's a recoding in Python of a recoding in C of a synthetic Ada program that sought to
reproduce the relative operation counts across traced runs of some low-level integer systems programs. Its relation to
"typical Python programs" is thus non-existent: it makes no use of classes except to fake a C struct, makes heavy use
of module globals, and doesn't even use Python's "for" loop except to simulate Ada's integer-counter loop. It makes no
explict use of dicts at all, and uses lists only to fake fixed-size C arrays; etc. The best that can be said is that
what it measures *can* be measured <0.9 wink -- and that's useful, provided you don't read much into it>.

Isaac To Kar Keung

unread,

May 25, 2001, 11:51:58 PM5/25/01

to

>>>>> "Tim" == Tim Peters <tim...@home.com> writes:

>> The problem is that whatever the mapping, there is a possible case in
>> which copying and reallocation is done basically everytime.

Tim> We could, e.g., always round the list size up to the next higher
Tim> power of 2. That way the capacity is implicit. We're unlikely to,
Tim> though.

No, this does not help the situation when the size of the list goes
repeatedly below and above the threshold.

Tim> Except that it rarely happens at all -- don't overblow this.

Yes, it rarely happens. But is the solution really so difficult that we
have any excuse not doing it right?

Tim> The third possiblity is the one you're not ready for yet: Python is
Tim> neither a C library nor an OS, neither is it obligated to provide
Tim> theoretically optimal performance in every conceivable case.

I see this as a fundamental design problem: it ties the conceptual list to
physical memory allocation. The user should have their own right to say
that the list is to step between 397 and 398 elements (or whatever number
they are in mind), and the performance of the library shouldn't be in
completely different orders (e.g., from constant time, linear space to
linear time, quadratic space) in such cases. This would make Python
completely useless.

Tim> Python lists are wonderfully zippy and compact as-is, and doing
Tim> anything that hurts either of those would be a net loss for the
Tim> vast majority of Python users.

The answer is far from the difficulty you've implied. Lists normally
contain some hundreds of bytes, and adding 4 bytes there is not that much
for anybody. Extra space allocation to the list itself only happens on on
PyMem_REALLOC, not if you have initially created the list of the right size.
If we ever start doing realloc, it is likely that we continue reallocating
anyway. And we can have a compilation flag that activate the structure that
don't do geometric scaling by themselves. I don't see how this can possibly
"hurt the vast majority of Python users".

Regards,
Isaac.

Tim Peters

unread,

May 26, 2001, 1:01:36 AM5/26/01

to pytho...@python.org

[Isaac To Kar Keung]
> ...

> Yes, it rarely happens. But is the solution really so difficult
> that we have any excuse not doing it right?

You need to define "the problem" first. I'm really not clear on what it is
you think needs to be solved! If you define it carefully, then I wager
you'll discover, e.g., that Bruce Dawson's complaint cannot be solved on
Win9x even if you use an explicit capacity field and do growth by doubling
(he doesn't realize that yet either). It can be *delayed*, but Win9x VM
management is such that address space will still get fragmented beyond hope
unless Python also takes over allocating heaps. This gets absurd.

> ...

> I see this as a fundamental design problem: it ties the conceptual list
> to physical memory allocation.

It's a design question, and Guido picked a particular answer 10 years ago;
in practice it works fine. There would be much more sympathy for a list
implementation that was friendlier to insert and remove at both ends,
because people actually bump into that now (it's rare, but still much more
frequent than gripes about append behavior).

> The user should have their own right to say that the list is to
> step between 397 and 398 elements (or whatever number they are in mind),

Says who? I feel no need for this. If you think it's a need, write an
extension module implementing it and then sell that on real-life merit. If
people find it gives them a real advantage, they'll agitate to move it into
the core.

> and the performance of the library shouldn't be in completely different
> orders (e.g., from constant time, linear space to linear time,
> quadratic space) in such cases. This would make Python completely
> useless.

"If we don't do X, Python would be completely useless." Python doesn't do
X. Yet neither is it completely useless (according to me). I don't know,
but suspect there may be a lapse in this argument <wink>.

> ...

> The answer is far from the difficulty you've implied.

That's because you haven't faced what it takes to worm around all provokable
problems on Windows. Don't argue about it: write some code and *try* it,
then come back when "it's solved".

I'm not arguing about the extra space again, because it's pure waste on
Linux, doesn't actually solve the problem on Win9x, and Guido rejected the
idea last time this went around anyway.

it's-not-like-there-aren't-useful-things-that-need-doing-too-ly y'rs - tim

Isaac To Kar Keung

unread,

May 26, 2001, 3:09:34 AM5/26/01

to

>>>>> "Tim" == Tim Peters <tim...@home.com> writes:

>> ... But there is another solution that does not need the 4-bytes
>> space overhead. The C library should know the size of allocation
>> anyway, so we can actually reuse that information.

Tim> Also not a chance, but it becomes possible if we enable Vladimir's
Tim> PyMalloc and fiddle it accordingly.

Forgotton in the last post...

Yes, I agree that using our own malloc can be a clean solution.

Regards,
Isaac.

Andrew Dalke

unread,

May 26, 2001, 3:19:12 AM5/26/01

to

Isaac To Kar Keung wrote:
>I see this as a fundamental design problem: it ties the conceptual list to
>physical memory allocation. The user should have their own right to say
>that the list is to step between 397 and 398 elements (or whatever number
>they are in mind), and the performance of the library shouldn't be in
>completely different orders (e.g., from constant time, linear space to
>linear time, quadratic space) in such cases. This would make Python
>completely useless.

Since Python does this and people use it, that means Python cannot
be "completely useless."

It also isn't a fundamental design problem. If you want to create
your own list which supports scaling growth, you can write your
own list-like class, even in Python.

For example, look at least year's version of this thread titled
"The REALLY bad thing about Python lists.." where I expand on
a comment from Moshe and sketch an implementation of a ScalableList
which roughly doubles the list size when the list is full.
It has a linear performance hit because of function call overhead,
but if the lists are long enough to incur O(n**2) expansion
penalities, the function call overhead won't be a problem.

Andrew
da...@acm.org
P.S.
Why does the python-list archive at
http://mail.python.org/pipermail/python-list/
skip 2000-March through to 2000-September, and have astonishing
small messages for February and October?

Isaac To Kar Keung

unread,

May 26, 2001, 3:30:36 AM5/26/01

to

>>>>> "Tim" == Tim Peters <tim...@home.com> writes:

Tim> It can be *delayed*, but Win9x VM management is such that address
Tim> space will still get fragmented beyond hope unless Python also
Tim> takes over allocating heaps. This gets absurd.

Disclaimer: I don't run a WinXX OS, so I've got no way to test whatever I
say here.

I'm actually wondering how it can be this bad in WinXX. If it can make it
more than say 50% of the space unusable due to fragmentation, then I agree
that it is so hopelessly broken that Python shouldn't deal with it at all.
Yes, at all. Python, as an application, don't have the VM tools to do this
efficiently.

Regards,
Isaac.

Tim Peters

unread,

May 26, 2001, 4:48:43 AM5/26/01

to pytho...@python.org

[Tim]

> It can be *delayed*, but Win9x VM management is such that address

> space will still get fragmented beyond hope unless Python also

> takes over allocating heaps. This gets absurd.

[Isaac To Kar Keung]

> Disclaimer: I don't run a WinXX OS, so I've got no way to test whatever
> I say here.
>
> I'm actually wondering how it can be this bad in WinXX.

Bruce explained it in detail a few msgs back:

http://groups.google.com/groups?hl=en&lr=&safe=off&
ic=1&th=8fb1dd7dabf75024,41&seekm=
3B0DED72.41317BCE%40accessone.com#p

(Yikes! There's gotta be a better way to get a Google URL than that!)

> If it can make it more than say 50% of the space unusable due to
> fragmentation,

No, most of the memory remains available, but spread out over hundreds of
system-allocated heaps that the OS and/or C runtime seem unable to
recombine, and there's no *contiguous* chunk of address space remaining big
enough to satisfy the request. The entire VM address space gets turned into
Swiss cheese. Bruce seemed to think that array.array('c').append behavior
was better, but that's only because the limit on his test was too small (the
array version adds 1 byte per element, the list version 4, and that's why
the list version fails sooner). No matter what growth strategy you use
(heck, overallocate by a factor of 10), this still bites: every time
growing the array ends up allocating a new system heap, the heap it's copied
from appears to be lost forever. This isn't a guess -- I wrote the code,
tried it, and watched the heap trace go to hell.

> then I agree that it is so hopelessly broken that Python shouldn't
> deal with it at all. Yes, at all. Python, as an application, don't
> have the VM tools to do this efficiently.

It could, but, again, it's absurd: Python could do its own heap mgmt on
Win32, using the Win32 API heap mgmt functions instead of MS's libc's.
They're adequate to the task. But I'm not adding gobs of delicate
platform-specific code to worm around a problem nobody has <0.9 wink>.

In the meantime, I did check in a change (for 2.2) that continues to do mild
over-allocation, but on a sliding scale that doesn't think the world ends at
500 elements. This speeds real-life appends, but mostly because it gets rid
of the current integer multiplication and division in favor of simple bit
shifts. For Bruce's kind of test, it delays the point at which
one-at-a-time list.append(x) exhausts Win9x VM from about 2 million elements
to about 35 million. I have higher hopes for NT/Win2K, but haven't yet
tested it there.

Courageous

unread,

May 26, 2001, 5:06:59 AM5/26/01

to

>It's a design question, and Guido picked a particular answer 10 years ago;
>in practice it works fine. There would be much more sympathy for a list
>implementation that was friendlier to insert and remove at both ends,

I have this implementation at http://www.sourceforge/projects/pythonic.
In cvs, checkout adt. You want the "dlist" (double ended list) type. It's
a vector which uses prealloc at the head AND the tail. This implements
much of the basic python list's capability, albeit it's never been intended
as a patch replacement for listobject.c.

>Says who? I feel no need for this. If you think it's a need, write an
>extension module implementing it and then sell that on real-life merit. If
>people find it gives them a real advantage, they'll agitate to move it into
>the core.

mylist = []1000

This would be a cool construct.

C..

Alex Martelli

unread,

May 26, 2001, 5:23:08 AM5/26/01

to

"Tim Peters" <tim...@home.com> wrote in message
news:mailman.990866978...@python.org...
...

> > then I agree that it is so hopelessly broken that Python shouldn't
> > deal with it at all. Yes, at all. Python, as an application, don't
> > have the VM tools to do this efficiently.
>
> It could, but, again, it's absurd: Python could do its own heap mgmt on
> Win32, using the Win32 API heap mgmt functions instead of MS's libc's.
> They're adequate to the task. But I'm not adding gobs of delicate
> platform-specific code to worm around a problem nobody has <0.9 wink>.

As I recall from back when I was focusing on heap fragmentation
issues for our apps, it's not libc's fault (not in VC++6.0 at least,
but I think I recall 5.0's was decent already) -- recoding at a lower
level left gobs of problems in Win95 (and I believe Win98 did not
change much there). Using NT4 made the problems go away (even
with libc), so that's what we advised to our customers, and it happens
to be MS's party line too -- "forget Win9x, that's just for games, NT
is the OS for professional use". Maybe XP, allegedly based on the
NT/2000 underlying kernel technology, WILL solve things...

Alex

Isaac To

unread,

May 26, 2001, 6:21:27 AM5/26/01

to

>>>>> "Tim" == Tim Peters <tim...@home.com> writes:

Tim> Bruce explained it in detail a few msgs back:

Tim> http://groups.google.com/groups?hl=en&lr=&safe=off&
Tim> ic=1&th=8fb1dd7dabf75024,41&seekm=
Tim> 3B0DED72.41317BCE%40accessone.com#p

(My last disclaimer still holds.)

I do read messages of others. What I don't understand is exactly what Bruce
didn't: why in the hell that "old heap" is not freed. For this to happen,
there must be some "small chunks of memory" allocated for Python every other
4M space, and that is ridiculous if all we do is to append things into a
list, without creating a single more variable. There is another way that
this can happen without the "small chunks of memory": the malloc subsystem
neglected to combine small chunks of contiguous freed memory. "Perhaps it
is really so horribly broken."

Regards,
Isaac.

Tim Peters

unread,

May 26, 2001, 6:17:11 PM5/26/01

to pytho...@python.org

[Tim]

> Bruce explained it in detail a few msgs back:
>

> http://groups.google.com/groups?hl=en&lr=&safe=off&
> ic=1&th=8fb1dd7dabf75024,41&seekm=
> 3B0DED72.41317BCE%40accessone.com#p

[Isaac]

> I do read messages of others. What I don't understand is exactly
> what Bruce didn't: why in the hell that "old heap" is not freed. For
> this to happen, there must be some "small chunks of memory" allocated

> for Python every other 4M space ...

Na, here's a straight C program that does nothing but reallocs:

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>

void
main()
{
size_t CHUNK = 100;
size_t size = CHUNK;
char* p = (char*)malloc(CHUNK);

for (;;) {
size += CHUNK;
p = (char*)realloc(p, size);
if (!p) {
fprintf(stderr, "out of memory, at %u\n", size);
exit(1);
}
}
}

Compiled in release mode under MSVC 6, and run on a Win98SE box with 256MB
RAM, it died with

out of memory, at 5906500

Doubling the size each time works much better -- the first time. But when I
put that in a loop, the OS froze solid at about the sixth iteration, and a
reboot was needed. That may be coincidence, but I don't care enough to try
it again.

If you can't let this go, I suggest asking Microsoft very nicely to let you
look at the Windows source code <wink>.

In the meantime, the change I checked in lets Python get away with growing a
list to about 140MB on Win98SE before realloc() craps out.

Martijn Faassen

unread,

May 26, 2001, 7:08:32 PM5/26/01

to

Tim Peters <tim...@home.com> wrote:
[snip]

>> Do you have actual numbers here?

> Every you time you run a Python program, it emails detailed runtime
> statistics to Guido via a secret backdoor in the
> code. And that's the real reason list.append() eventually slows down

> but only on systems like Windows with poor email
> performance <wink>.

No PSU secrets revealed here, folks, just move along. How come Tim
gets to reveal these things when I can't? I keep being cut off,
for instance if I try to reveal that

Darn. I'll try another secret: there are four official bots, the timbot, the
effbot, the martellibot and the

Hmpf. Anyway, this reminds me of Bill Gates' quantum optronic network
built into all windows versions. At nights when nobody is there in the
office, Bill uses that network to switch on all Windows boxes to use
them in a massive cluster just to finish the Windows XP compiles in time.

This explains why hackers commonly use alternative operating systems;
they're the ones that tend to be up late at nights and they don't like
yet another mysterious Windows behavior. It's spooky, all those
boxes spinning up in the dark, status lights blinking on, and then the
sounds of their harddrives crunching away into the night...

good-thing-I-can-still-reveal-minor-secrets-ly yours,

Martijn
--
History of the 20th Century: WW1, WW2, WW3?
No, WWW -- Could we be going in the right direction?

Tim Peters

unread,

May 28, 2001, 1:29:42 AM5/28/01

to pytho...@python.org

[Courageous]
> ...

> I have this implementation at http://www.sourceforge/projects/pythonic.

Better make that

http://www.sf.net/projects/pythonic

("sf" works the same as "sourceforge", but leaving off the ".net" makes the
link dead)

> In cvs, checkout adt. You want the "dlist" (double ended list) type.
> It's a vector which uses prealloc at the head AND the tail. This
> implements much of the basic python list's capability, albeit it's
> never been intended as a patch replacement for listobject.c.

And fun, wasn't it? Extending Python in C is a hoot! I can tell you
enjoyed yourself <wink>.

To answer a question in the code, memcpy() and memmove() are both std C
functions, and you definitely want memmove() whenever the source and
destination slices overlap; memcpy() explicitly refuses to define what
happens then, to make life easier for optimized implementations using
buffering tricks to speed the copy; memmove() guarantees to do the right
thing.

Here's a dirty trick: things like reverse() (which you did implement) and
sort() (which you didn't) can't change the size or address of the data
vector. So you don't have to implement those yourself: instead you can
create a temporary list object, set its ob_item pointer to *your* data
pointer, fill in its ob_size field with your size, and call the
corresponding Python list operation (PyList_Reverse(), PyList_Sort(), etc).
When those return, your "items" pointer will point to the dlist guts
appropriately shuffled. Then set the list object's ob_item member to NULL
and decref the list object (setting ob_item to NELL prevents list dealloc
from crawling over the ob_item vector and decref'ing *its* contents too).

That's cheating, of course, but this has been stable for years, and you
already have access to the list object's defn. by virtue of #include'ing
Python.h (note that you didn't have to #include listobject.h too -- Python.h
already does that). Of course if that breaks some day, tough luck -- but it
probably won't, and the sort code in particular is a major undertaking to
reproduce.

> mylist = []1000
>
> This would be a cool construct.

Yuck.

perhaps-for-sufficiently-uncool-values-of-cool-ly y'rs - tim

Helen Dawson

unread,

May 30, 2001, 2:01:05 AM5/30/01

to

Tim Peters wrote:

> [Isaac To Kar Keung]
> > ...
> > Yes, it rarely happens. But is the solution really so difficult
> > that we have any excuse not doing it right?
>
> You need to define "the problem" first. I'm really not clear on what it is
> you think needs to be solved! If you define it carefully, then I wager
> you'll discover, e.g., that Bruce Dawson's complaint cannot be solved on
> Win9x even if you use an explicit capacity field and do growth by doubling
> (he doesn't realize that yet either). It can be *delayed*, but Win9x VM
> management is such that address space will still get fragmented beyond hope
> unless Python also takes over allocating heaps. This gets absurd.

If you use the list size doubling technique (no, I haven't tried it with Python,

but I'm assuming you haven't either, so my guess about what will happen is
as valid as yours :-) then your list will grow within a single
heap until it reaches the default size of a heap - typically 4 Mbytes. This is
fundamentally unchanged from the current behaviour.

Then, instead of allocating a heap that is 4 Mbytes + 64K, an 8 Mbyte heap
will be allocated, because Python will ask for roughly 8 Mbytes of memory.

Then Python will ask for 16 Mbytes, then 32, then 64, then 128, then 256,
and then 512 Mbytes of memory. Each time, Win9x will happily allocate
a heap of the requested size. When Python asks for a 1024 Mbyte heap
then it will probably fail, because the address space will almost certainly
be too fragmented. It may even fail on the 512 Mbyte allocation. However,
if it makes it to the 256 Mbyte allocation, which seems very likely,
then it will have succeeded in making a list that is approximately 32 times
longer than when it previously failed. On some machines, it might even
run out of memory before it runs out of address space.

If you define the problem as being able to use all 2 Gbytes of address space
for a single list then no, doubling the list size doesn't solve it. If you
define
the problem as being able to use lists that are much less constrained, then
the list doubling method works famously. It also should substantially improve
the performance on WinNT and Win2K, where the behaviour is quadratic.

As well as allowing much larger lists to be created much more quickly (no
O(n^2) behaviour) it also leaves memory much less fragmented. Address
space will be allocated to seven or eight heaps, instead of 500. Those
heaps will be available for allocations of any size up to 128 Mbytes or so
(I believe they will be available - I have not verified that).

As to whether the problem is real or not, remember that this thread started
with my experience with solving a *real* problem with strings, which
performed badly in the way I was using them. I figured out why that was
so and switched to lists. Now I have found out that my program will
behave badly with lists also, if I happen to try processing a 'large' text
file - where large is defined as a Mbyte or two (performance degrades
to 25% of array speed after one million appends on WinNT, and
runs out of address space after 1.5 million appends on Win9x).

Now obviously I can fix this problem - switch from list to array, or to
cStringIO. I'm glad that Python gives me these options. However after
years spent lecturing my coworkers on the importance of algorithm
analysis it's surprising to find what seems to be unnecessary quadratic
behavior in Python. Also, many people will not realize when they need
to make the switch.

As Guido must know (from the statistics that are e-mailed to him :-) there
are a lot of programs out there encountering the edges of these conditions
that are running slowly without their authors realizing why. Or, programs
that are written to process small files, and occasionally are used on large
ones - to bad effect. Or, programs that are written on Unix and then
run on NT or Win2K or WinXP or Win9x or some other platform whose
allocator doesn't hide the quadratic behaviour of lists.

Tim Peters wrote (in another message):

> [Bruce Dawson]
> > After discoverting that lists actually are O(n^2),
>
> "Actually" is an odd word for something that took two days to provoke after
> giving up the first time then being assured it was possible to provoke it.
> Is it "actually" the case on Linux too? As has been said, the behavior is
> platform dependent.

It didn't take two days to provoke the problem on Win9x - it took about
five minutes, using the text processing script I had written, on a slightly
larger (double real size) text file. I then simplified things for posting.

I certainly agree that the behavior is platform dependent - that is the whole
crux of the problem (if it is a problem). The Unix allocators happen to be
very good at dealing with this sort of memory allocation pattern, therefore
they *hide* the quadratic behavior. The only sin of the WinNT allocator is
that it fails to paper over quadratic behavior in Python. If the list allocation

scheme grew exponentially (which I now understand is not as trivial to
add to Python as I had thought) then the behavior *wouldn't* be (as)
platform dependent, and scripts would be more portable.

Now, I will try to resist the urge to comment further until I have more
concrete evidence that the problem is real, or a solution, or both. I suspect
that the problem is more serious than the experts believe, but I can't
prove it.

> In the meantime, I did check in a change (for 2.2) that continues to do mild
> over-allocation, but on a sliding scale that doesn't think the world ends at
> 500 elements. This speeds real-life appends, but mostly because it gets rid
> of the current integer multiplication and division in favor of simple bit
> shifts. For Bruce's kind of test, it delays the point at which
one-at-a-time list.append(x) exhausts Win9x VM from about 2 million elements
to about 35 million. I have higher hopes for NT/Win2K, but haven't yet
tested it there.

Tim Peters wrote, in yet another message:

> In the meantime, I did check in a change (for 2.2) that continues to do mild
> over-allocation, but on a sliding scale that doesn't think the world ends at
> 500 elements. This speeds real-life appends, but mostly because it gets rid
> of the current integer multiplication and division in favor of simple bit
> shifts. For Bruce's kind of test, it delays the point at which
> one-at-a-time list.append(x) exhausts Win9x VM from about 2 million elements
> to about 35 million. I have higher hopes for NT/Win2K, but haven't yet
> tested it there.

Cool! I just saw this one. That will definitely help. A factor of 16 improvement

in maximum size (and similar improvement in performance) will make any
remaining issues increasingly theoretical. Can you post your new code snippet?

I guess my only (slightly wrinkled nose?) comment would be that it seems like
Python lists are acquiring an exponential growth factor (which is a good thing,
it solves the quadratic performance) the Rube Goldberg way - one special case
at a time.

If I have a better solution I'll post it.

Bruce/Helen Dawson

Helen Dawson

unread,

May 30, 2001, 2:54:01 AM5/30/01

to

Tim Peters wrote:

> [Tim]

> In the meantime, I did check in a change (for 2.2) that continues to do mild
> over-allocation, but on a sliding scale that doesn't think the world ends at
> 500 elements. This speeds real-life appends, but mostly because it gets rid
> of the current integer multiplication and division in favor of simple bit
> shifts.

For what it's worth - this might actually be one point in MS's favour. If I
remember correctly the divides in the roundup() routine are all dividing
by integer constants. VC++ deals with those very nicely. In optimized
builds (optimized for speed anyway) it is impossible to get it to emit a
divide instruction - it always replaces it with bit shifts. It does a
particularly efficient job if the number you are dividing is unsigned. It
doesn't matter what number you divide by - it knows how to simulate
dividing by every integer I've ever tested with.

So, you may have improved on a Win32-only problem, and also improved
on a non-Win32-only problem also. The symmetry appeals to me.

Bruce/Helen Dawson

Tim Peters

unread,

May 30, 2001, 2:38:09 AM5/30/01

to pytho...@python.org

[Bruce Dawson]
> ...

> Now obviously I can fix this problem - switch from list to array,

No, array doesn't even do as much as lists used to do to avoid quadratic
behavior. Your test case simply didn't try arrays big enough for this to
show up. The difference is that array.array('c').append(ch) adds one byte
at a time but list.append(x) adds 4 bytes at a time (on a 32-bit box), so
you need 4x the # of array elements before seeing the same bad Windows
behaviors.

> ...

[Tim]

>> In the meantime, I did check in a change (for 2.2) that continues
>> to do mild over-allocation, but on a sliding scale that doesn't
>> think the world ends at 500 elements. This speeds real-life
>> appends, but mostly because it gets rid of the current integer
>> multiplication and division in favor of simple bit shifts. For
>> Bruce's kind of test, it delays the point at which one-at-a-time
>> list.append(x) exhausts Win9x VM from about 2 million elements
>> to about 35 million. I have higher hopes for NT/Win2K, but haven't
>> yet tested it there.

I've tested it now, and on Win2K it remains flat and fast until the list
exhausts physical RAM, at which point there are long hiccups at resize
points. Linux was always well-behaved, and still is, and the hiccups are
much shorter there.

[Bruce]

> Cool! I just saw this one. That will definitely help. A factor of
> 16 improvement

Maybe. As I mentioned in another msg, when I did a straight C realloc
double-the-size test (no Python) and put it in a loop, the end result was
that Win98SE froze on about the 6th iteration. If you haven't noticed yet
in other contexts, take the hint <wink>: Win9x is not a robust platform.

> in maximum size (and similar improvement in performance) will make any
> remaining issues increasingly theoretical. Can you post your new
> code snippet?

It's checked in. Go to

http://cvs.sf.net/cgi-bin/viewcvs.cgi/python/python/dist/src/

and navigate to Objects/listobject.c, revision 2.93.

> I guess my only (slightly wrinkled nose?) comment would be that
> it seems like Python lists are acquiring an exponential growth
> factor

But a very mild one. It had a very mild one before too, but stopped
increasing it after 500 elements. Now there's no bound.

> (which is a good thing, it solves the quadratic performance) the
> Rube Goldberg way - one special case at a time.

Not sure I follow. list.append() specifically was "the problem". What
other behavior are you trying to address?

> If I have a better solution I'll post it.

Sure, but next time please try it first <wink>.

Tim Peters

unread,

May 30, 2001, 4:57:57 AM5/30/01

to pytho...@python.org

[Bruce Dawson]

> For what it's worth - this might actually be one point in MS's
> favour. If I remember correctly the divides in the roundup()
> routine are all dividing by integer constants.

They used to, yes. Not anymore.

> VC++ deals with those very nicely. In optimized builds (optimized
> for speed anyway) it is impossible to get it to emit a divide
> instruction - it always replaces it with bit shifts.

Not quite. This is well-known technology in the compiler biz, and shifts
only suffice for very special divisors. Here's what it generates for an int
divide by 100 (with numerator in ecx), which is more typical:

mov eax, 1374389535
imul ecx
mov eax, edx
sar eax, 5
mov ecx, eax
shr ecx, 31
add eax, ecx

The expensive part is the IMUL, and the cost of that varies widely across
Pentium flavors. It's certainly faster than a division, but still much
costlier than a simple shift.

> It does a particularly efficient job if the number you are dividing
> is unsigned.
>
> It doesn't matter what number you divide by - it knows how to simulate
> dividing by every integer I've ever tested with.

If you're curious about how this works, search Google for "Integer Division
Using Reciprocals" (1991), by Robert Alverson. That even covers how to do
it with multiplies and shifts if you *don't* know the divisor at
compile-time (and, e.g., the Itanium has no division instruction, so on that
chip these tricks are essential).

For a known divisor c, in effect you treat x/c as x*(2**i/c)/2**i for a
suitable i, compute 2**i/c at compile-time, multiply x by that at runtime,
"/2**i" is a right shift, and so multiply and a right shift gets an
excellent guess. Then you carefully adjust for all possible endcase errors.
It's not simple, but it does work <wink>.

> So, you may have improved on a Win32-only problem, and also improved
> on a non-Win32-only problem also. The symmetry appeals to me.

Yup. I was much keener to get rid of the division than to do the resizing
bit, in fact.

Greg Ewing

unread,

May 31, 2001, 1:39:15 AM5/31/01

to

Helen Dawson wrote:
>
> Address
> space will be allocated to seven or eight heaps, instead of 500. Those
> heaps will be available for allocations of any size up to 128 Mbytes or so

If allocating one of these heaps is just an allocation of
address space - without any actual memory being allocated
within the heap until needed - then why doesn't the C
library just allocate one great big heap at the outset?

--
Greg Ewing, Computer Science Dept, University of Canterbury,
Christchurch, New Zealand
To get my email address, please visit my web page:
http://www.cosc.canterbury.ac.nz/~greg

Helen Dawson

unread,

May 31, 2001, 2:30:58 AM5/31/01

to

Tim Peters wrote:

> [Bruce Dawson]
> > ...
> > Now obviously I can fix this problem - switch from list to array,
>
> No, array doesn't even do as much as lists used to do to avoid quadratic
> behavior. Your test case simply didn't try arrays big enough for this to

Darn.

> Maybe. As I mentioned in another msg, when I did a straight C realloc
> double-the-size test (no Python) and put it in a loop, the end result was
> that Win98SE froze on about the 6th iteration. If you haven't noticed yet
> in other contexts, take the hint <wink>: Win9x is not a robust platform.

Oooohhhh yeah. I've managed to avoid doing any significant development
on Win9x - I've run NT variants for seven years - but my code still needs
to run on Win9x. Such a pity.

As further proof of how bizarre the Win9x behavior is, I hypothesized that
if I went:

range(4000000)

in the PythonWin interactive window before running my script that I might
get better performance. Indeed I do. Win9x will then run my script to
completion, and *much* faster also.

The range(4000000) statement creates a list with four million elements
(obviously) which forces a large heap to be created. Since the list is
immediately destroyed, the heap is empty and available. Then, all
subsequent growing of my other lists run quite nicely because the
allocations all fit in the large heap, so realloc() works without any
copying.

A truly gross hack, mentioned purely for amusement purposes.

Carlos Ribeiro

unread,

May 31, 2001, 6:58:30 PM5/31/01

to pytho...@python.org

Just an idea - isn't a paged array better suited for this kind of stuff? It
would scale pretty well. I do remember that Unix has a similar structure
for disk block allocation. For small lists a single lookup is needed; for
large ones a few extra lookups could be compensated by avoiding the
realloc() performance hit.

Of course, we assume that the problem is related to the list structure
itself. The list elements are still subject to their own allocation
policies. Also bear in mind that operations such as insertion and deletion
could take a potentially bigger performance hit. However, I feel that such
operations should not be an issue anyway, at least for large arrays; the
only way we can remotely hope of performing them efficiently (for large
lists) is using linked lists, or other more complex data structures.

Just my $.02 worth.

Carlos Ribeiro

Tim Peters

unread,

May 31, 2001, 10:32:34 PM5/31/01

to pytho...@python.org

[Carlos Ribeiro]

> Just an idea - isn't a paged array better suited for this kind of
> stuff?

For someone who wants to append millions of things one a time, sure --
*anything* is better than a contiguous vector then. If somebody wants that
enough, they can write an extension to provide it, but Python lists are
contiguous vectors and will stay that way -- Guido had more than enough of
slowing down every case to cater to unreasonable cases when he worked on
ABC. As a result, Python's builtin types strive to be just as dumb as
possible <0.9 wink>.

Carlos Ribeiro

unread,

Jun 1, 2001, 7:21:54 AM6/1/01

to Tim Peters, pytho...@python.org

At 22:32 31/05/01 -0400, Tim Peters wrote:
>... Guido had more than enough of

>slowing down every case to cater to unreasonable cases when he worked on
>ABC. As a result, Python's builtin types strive to be just as dumb as
>possible <0.9 wink>.

I can see the wisdom of Guido on this. Taken separately, each little such
"optimization" slows down the most common case very little. Looking only at
it it seems reasonable. But when you get hundreds of such optimizations,
the resulting code is bloated and *much* slower that the simpler one would
ever be.

Anyway, (as said before) Python nature makes it easy to solve it my means
of extensions. A different list implementation could be done and exposed
using some well known interfaces. Maybe it proves to be faster for every
case; if it does so, nobody would mind patching The Source with it, dont
you think?

Who's volunteer? :-)

Carlos Ribeiro

Panu A Kalliokoski

unread,

Jun 1, 2001, 8:14:39 AM6/1/01

to

Courageous <jkra...@san.rr.com> wrote:

> The correct thing to do if performance becomes important
> to you in this context is to convert the string to a list,
> manipulate it as a list, and then convert it back to a string.

Interesting that nobody mentioned this: because strings are immutable, a
string produced by concatenation could be represented by a tree node of
the two substrings instead of a copy of the original character arrays.
This works because the substrings are guaranteed to remain the same.
String handling will become more complex so that this might not be worth
doing, but in the example case it would keep the performance O(n) for the
test.

Panu Kalliokoski

Helen Dawson

unread,

Jun 5, 2001, 1:59:22 AM6/5/01

to

Just for the interest of those who are interested in the original Win9x
behaviour and want to see what it does to the memory map, I have
posted a text memory map of the Win98 address space of Python.exe
after running my append script. This is the script that kept appending
characters to a Python list until Win98 ran out of address space,
but not memory.

http://www.cygnus-software.com/papers/pythonmemorymap.txt

You can see the heaps, starting at 4088K and getting bigger by 4K
until memory is filled. Completely understanding the memory map
requires understanding MMUs and the Win32 allocator system, but the
rough idea should be clear. Reserved means that address space is
reserved but not memory. Committed means that memory is reserved
also.

Address space is allocated in 64K chunks, which is why there are
reserved blocks that are 4K to 60K at the end of heaps.

"Amiguous Ambiguous Unkown" means that that block has multiple
"states" and "protection" flags and that the type of the memory is not
known. But trust me, the 4Gbyte+ chunks are mostly heaps.

Bruce/Helen Dawson