Performance of lists vs. list comprehensions

Gerald Britton

unread,

Jan 19, 2010, 10:30:21 AM1/19/10

to pytho...@python.org

Yesterday I stumbled across some old code in a project I was working
on. It does something like this:

mystring = '\n'.join( [ line for line in lines if <some conditions
depending on line> ] )

where "lines" is a simple list of strings. I realized that the code
had been written before you could put a list comprehension as an
argument to join(). I figured that I could drop the '[' and ']' and
leave the rest as a list comprehension since that works with current
Python (and the project's base is 2.5). So I rewrote the original
statement like this:

mystring = '\n'.join( line for line in lines if <some conditions
depending on line> )

It works as expected. Then I got curious as to how it performs. I
was surprised to learn that the rewritten expression runs more than
twice as _slow_. e.g.:

>>> l
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

>>> Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit()
2.9967339038848877

>>> Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit()
7.2045478820800781

Notice that I dropped the condition testing that was in my original
code. I just wanted to see the effect of two different expressions.

I thought that maybe there was some lower bound on the number of the
items in the list or list comprehension beyond which the comprehension
would prove more efficient. There doesn't appear to be one. I scaled
the length of the input list up to 1 million items and got more or
less the same relative performance.

Now I'm really curious and I'd like to know:

1. Can anyone else confirm this observation?

2. Why should the "pure" list comprehension be slower than the same
comprehension enclosed in '[...]' ?

--
Gerald Britton

Alf P. Steinbach

unread,

Jan 19, 2010, 10:57:39 AM1/19/10

to

* Gerald Britton:

5.8625191190500345

>>> Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit()

12.093135300715574
>>> _

> 2. Why should the "pure" list comprehension be slower than the same
> comprehension enclosed in '[...]' ?

Regarding (2) the unparenthesized expression in join is *not* a list
comprehension but a generator expression.

And as such it involves join calling next() on the generator object repeatedly,
with each next() call involving a light-weight context shift.

In addition the docs mumble something about "lazy" evaluation, and that may also
contribute to the overhead.

I think that in contrast, the interpreter can evaluate a list comprehension, [x
for x in blah], directly without any context shifting, just by transforming it
to equivalent code and putting the target expressions innermost there.

And so the main factor causing a slowdown for a list comprehension would, I
think, be paging and such if the list it produced was Really Huge, while for the
generator there's no memory issue but rather much calling & context shifting.

Cheers & hth.,

- Alf

Gerald Britton

unread,

Jan 19, 2010, 11:10:43 AM1/19/10

to Alf P. Steinbach, pytho...@python.org

Thanks! Good explanation.

> --
> http://mail.python.org/mailman/listinfo/python-list
>

--
Gerald Britton

Gerald Britton

unread,

Jan 19, 2010, 11:26:43 AM1/19/10

to Stephen Hansen, pytho...@python.org

Interestingly, I scaled it up to a million list items with more or
less the same results. It's helpful to see that your results are
different. That leads me to suspect that mine are somehow related to
my own environment.

Still I think the key is the overhead in calling next() for each item
in the generator expression. That in itself probably accounts for the
differences since function calls are somewhat expensive IIRC.

On Tue, Jan 19, 2010 at 11:18 AM, Stephen Hansen <apt.s...@gmail.com> wrote:
> On Tue, Jan 19, 2010 at 7:30 AM, Gerald Britton <gerald....@gmail.com>
> wrote:
> [snip]

>>
>> mystring = '\n'.join( line for line in lines if <some conditions
>> depending on line> )
>

> Note, this is not a list comprehension, but a generator comprehension.
> A list comprehension is used to build, in one sweep, a list and return it.
> A generator comprehension is used to build an generator that can be iterated
> over to produce a sequence of items.
> I think you're seeing not performance of the expression, but the function
> call overhead which generators include. A generator requires one to call its
> next method to get each item: a list comprehension is just syntactical sugar
> for a for loop.
> As to which is faster, I think it depends. Your test-case is using *really*
> small ranges-- just ten items. In this case, just doing a simple loop to
> build a list and then pass it through join is probably faster. If you're
> using a much larger list though, the characteristics of the problem may
> change, where the lazy evaluation of a generator expression may be more
> desirable.
> A list comprehension includes a waste of memory, too: you have to build up a
> complete list before you return it, and if you have a lot of lines? That can
> be a problem.
> As you can see, the performance characteristics between the two narrow
> considerably if you compare a larger sample:
> >>> Timer("' '.join([str(x) for x in l])", 'l = xrange(100)').timeit()
> 50.092024087905884
>>>> Timer("' '.join(str(x) for x in l)", 'l = xrange(100)').timeit()
> 54.591049909591675
> --S

--
Gerald Britton

Wolfram Hinderer

unread,

Jan 19, 2010, 2:29:15 PM1/19/10

to

On 19 Jan., 16:30, Gerald Britton <gerald.brit...@gmail.com> wrote:
> >>> Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit()
>
> 2.9967339038848877
>
> >>> Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit()
>
> 7.2045478820800781

[...]

> 2. Why should the "pure" list comprehension be slower than the same
> comprehension enclosed in '[...]' ?

Others have already commented on "generator expression" vs. "list
comprehension". I'll try to shed some light on the cause of the
slowness.

For me it's

>>> Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit()

0.813948839866498

>>> Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit()

2.226825476422391

But wait! I'm on Python 3.1 and the setup statement has to be changed
to make this test meaningful.
>>> Timer("' '.join([x for x in l])", 'l = list(map(str,range(10)))').timeit()
2.5788493369966545
>>> Timer("' '.join(x for x in l)", 'l = list(map(str,range(10)))').timeit()
3.7431774848480472

Much smaller factor now.
But wait! If we want to test list comprehension against generator
comprehension, we should try a function that just consumes the
iterable.

>>> setup = """l = list(map(str,range(10)))
... def f(a):
... for i in a: pass
... """
>>> Timer("f([x for x in l])", setup).timeit()
3.288511528699928
>>> Timer("f(x for x in l)", setup).timeit()
2.410873798206012

Oops! Iteration over generator expressions is not inherently more
expension than iteration over list comprehensions. But certainly
building a list from a generator expression is more expensive than a
list comprehension?

>>> Timer("[x for x in l]", 'l = list(map(str,range(10)))').timeit()
2.088602950933364
>>> Timer("list(x for x in l)", 'l = list(map(str,range(10)))').timeit()
3.691566805277944

Yes, list building from a generator expression *is* expensive. And
join has to do it, because it has to iterate twice over the iterable
passed in: once for calculating the memory needed for the joined
string, and once more to actually do the join (this is implementation
dependent, of course). If the iterable is a list already, the list
building is not needed.

Gerald Britton

unread,

Jan 19, 2010, 3:06:13 PM1/19/10

to Wolfram Hinderer, pytho...@python.org

[snip]

>
> Yes, list building from a generator expression *is* expensive. And
> join has to do it, because it has to iterate twice over the iterable
> passed in: once for calculating the memory needed for the joined
> string, and once more to actually do the join (this is implementation
> dependent, of course). If the iterable is a list already, the list
> building is not needed.

if join has to iterate twice over the iterable, how does this work?

$ python3.1
Python 3.1.1+ (r311:74480, Nov 2 2009, 14:49:22)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> l = map(str, (x for x in range(10) if int(x)%2))
>>> '.'.join(l)
'1.3.5.7.9'
>>>

If join had to iterate twice over l, it would be consumed on the first
pass. If it works as you say then join would have to copy the
iterable on the first pass, effectively turning it into a list.
Though I didn't read through it, I would suppose that join could use a
dynamic-table approach to hold the result, starting with some
guesstimate then expanding the result buffer if and when needed.

Arnaud Delobelle

unread,

Jan 19, 2010, 4:01:31 PM1/19/10

to

Gerald Britton <gerald....@gmail.com> writes:

Looking at the source (py3k):

PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)
{
[skip declarations]

fseq = PySequence_Fast(seq, "");
if (fseq == NULL) {
return NULL;
}

[code that works out the length of the joined string then allocates
memory, then fills it]
}

Where PySequence_Fast(seq, "") returns seq if seq is already a tuple or
a list and otherwise returns a new tuple built from the elements of seq.

So no, it doesn't guess the size of the joined string and yes, it
iterates twice over the "sequence" (I would have thought it should be
called an iterable) by copying it into a tuple.

--
Arnaud

Wolfram Hinderer

unread,

Jan 19, 2010, 4:15:14 PM1/19/10

to

On 19 Jan., 21:06, Gerald Britton <gerald.brit...@gmail.com> wrote:
> [snip]
>
>
>
> > Yes, list building from a generator expression *is* expensive. And
> > join has to do it, because it has to iterate twice over the iterable
> > passed in: once for calculating the memory needed for the joined
> > string, and once more to actually do the join (this is implementation
> > dependent, of course). If the iterable is a list already, the list
> > building is not needed.
>
> if join has to iterate twice over the iterable, how does this work?
>
> $ python3.1
> Python 3.1.1+ (r311:74480, Nov 2 2009, 14:49:22)
> [GCC 4.4.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>
> >>> l = map(str, (x for x in range(10) if int(x)%2))
> >>> '.'.join(l)
> '1.3.5.7.9'
>
> If join had to iterate twice over l, it would be consumed on the first
> pass.

Yes. (Coincidentally, l is consumed in the first execution of the Timer
()-statement, which is why I had to add the call to list. Not to
mention the xrange example of Stephen. But all this is not linked to
what join does internally.)

> If it works as you say then join would have to copy the
> iterable on the first pass, effectively turning it into a list.

Yes, that's what I'm saying above in the first two lines of what you
quoted.
I should have added somehting like "AFAIK CPython does it that way".

> Though I didn't read through it, I would suppose that join could use a
> dynamic-table approach to hold the result, starting with some
> guesstimate then expanding the result buffer if and when needed.

Probably. But that's not what happens. Try
"".join("" for x in range(10**10)
and watch join eating memory.

Gerald Britton

unread,

Jan 19, 2010, 4:20:42 PM1/19/10

to Arnaud Delobelle, pytho...@python.org

That's surprising. I wouldn't implement it that way at all. I'd use a
dynamically-expanding buffer as I suggested. That way you get a
single pass and don't have to calculate anything before you begin. In
the best case, you'd use half the memory (the result just fits in the
buffer after its last expansion and no saved tuple). In the worst
case, the memory use is about the same (you just expanded the buffer
using a 2x expansion rule then you hit the last item).

Still I suppose the author thought of that approach and rejected it
for reasons I can't yet see.

Raymond Hettinger

unread,

Jan 19, 2010, 4:54:02 PM1/19/10

to

[Wolfram Hinderer]

> Yes, list building from a generator expression *is* expensive. And
> join has to do it, because it has to iterate twice over the iterable
> passed in: once for calculating the memory needed for the joined
> string, and once more to actually do the join (this is implementation
> dependent, of course). If the iterable is a list already, the list
> building is not needed.

Good analysis. That is exactly what is going on under the hood.

Raymond

Arnaud Delobelle

unread,

Jan 19, 2010, 4:55:49 PM1/19/10

to

Gerald Britton <gerald....@gmail.com> writes:

> That's surprising. I wouldn't implement it that way at all. I'd use a
> dynamically-expanding buffer as I suggested. That way you get a
> single pass and don't have to calculate anything before you begin. In
> the best case, you'd use half the memory (the result just fits in the
> buffer after its last expansion and no saved tuple). In the worst
> case, the memory use is about the same (you just expanded the buffer
> using a 2x expansion rule then you hit the last item).
>
> Still I suppose the author thought of that approach and rejected it
> for reasons I can't yet see.

I don't know the reasons, but I'm guessing they could be historic.
Before Python had iterators, str.join would mostly have been only given lists
and tuples as arguments, in which case the current approach seems to be
the most appropriate. Later, when things like generator functions and
generator expressions were introduced, perhaps str.join wasn't optimized
to accomodate them.

--
Arnaud

Steven D'Aprano

unread,

Jan 19, 2010, 8:52:41 PM1/19/10

to

On Tue, 19 Jan 2010 11:26:43 -0500, Gerald Britton wrote:

> Interestingly, I scaled it up to a million list items with more or less
> the same results.

A million items is not a lot of data. Depending on the size of each
object, that might be as little as 4 MB of data:

>>> L = ['' for _ in xrange(10**6)]
>>> sys.getsizeof(L)
4348732

Try generating a billion items, or even a hundred million, and see how
you go.

This is a good lesson in the dangers of premature optimization. I can't
think how many times I've written code using a generator expression
passed to join, thinking that would surely be faster than using a list
comprehension ("save building a temporary list first").

--
Steven

Steven D'Aprano

unread,

Jan 19, 2010, 9:52:16 PM1/19/10

to

On Tue, 19 Jan 2010 16:20:42 -0500, Gerald Britton wrote:

> That's surprising. I wouldn't implement it that way at all. I'd use a
> dynamically-expanding buffer as I suggested. That way you get a single
> pass and don't have to calculate anything before you begin. In the best
> case, you'd use half the memory (the result just fits in the buffer
> after its last expansion and no saved tuple). In the worst case, the
> memory use is about the same (you just expanded the buffer using a 2x
> expansion rule then you hit the last item).

In the worst case, you waste 50% of the memory allocated. And because
strings are immutable (unlike lists and dicts, which also use this
approach), you can never use that memory until the string is garbage
collected.

In the current approach, join produces a temporary sequence, but it
doesn't last very long. With your suggested approach, you could end up
with a large number of long-lasting strings up to twice the size
necessary. Since join is particularly useful for building large strings,
this could be a significant memory pessimation.

The obvious fix is for join to shrink the buffer once it has finished
building the string, but the success of that may be operating system
dependent. I don't know -- it sounds like a recipe for memory
fragmentation to me. And even if it can be done, reclaiming the memory
will take time, potentially more time than the current approach.

Still, it's probably worth trying, and seeing if it speeds join up.

--
Steven

Alf P. Steinbach

unread,

Jan 19, 2010, 11:25:22 PM1/19/10

to

* Steven D'Aprano:

> On Tue, 19 Jan 2010 16:20:42 -0500, Gerald Britton wrote:
>
>> That's surprising. I wouldn't implement it that way at all. I'd use a
>> dynamically-expanding buffer as I suggested. That way you get a single
>> pass and don't have to calculate anything before you begin. In the best
>> case, you'd use half the memory (the result just fits in the buffer
>> after its last expansion and no saved tuple). In the worst case, the
>> memory use is about the same (you just expanded the buffer using a 2x
>> expansion rule then you hit the last item).
>
> In the worst case, you waste 50% of the memory allocated.

Yes. That is a good argument for not doing the expanding buffer thing. But such
buffers may be generally present anyway, resulting from optimization of "+".

Using CPython 2.6.4 in Windows XP:

>>> def elapsed_time_for( f, n_calls ):
... return timeit.timeit( f, number = n_calls )
...
>>> def appender( n ):
... def makestr( n = n ):
... s = ""
... for i in xrange( n ):
... s = s + "A"
... return s
... return makestr
...
>>> appender( 1000 )() == 1000*"A"
True
>>>
>>> for i in xrange( 10 ):
... print( elapsed_time_for( appender( 10000*(i+1) ), 100 ) )
...
0.782596670811
1.37728454314
2.10189898437
2.76442173517
3.34536707878
4.08251830889
4.79620119317
5.42201844089
6.12892811796
6.84236460221
>>> _

Here the linear increase of times indicate that the "+" is being optimized using
an expanding buffer for the string. If only the minimal space was allocated each
time then one would expect O(n^2) behavior instead of the apparently O(n) above.
Example of that O(n^2) behavior given below.

>>> def exact_appender( n ):
... def makestr( n = n ):
... s = ""
... for i in xrange( n ):
... new_s = s + "A"
... s = new_s
... return s
... return makestr
...
>>> exact_appender( 1000 )() == 1000*"A"
True
>>> for i in xrange( 10 ):
... print( elapsed_time_for( exact_appender( 10000*(i+1) ), 100 ) )
...
3.28094241027
9.30584501661
19.5319170453
33.6563767183
52.3327800042
66.5475022663
84.8809736992
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "<stdin>", line 2, in elapsed_time_for
File "C:\Program Files\cpython\python26\lib\timeit.py", line 227, in timeit
return Timer(stmt, setup, timer).timeit(number)
File "C:\Program Files\cpython\python26\lib\timeit.py", line 193, in timeit
timing = self.inner(it, self.timer)
File "C:\Program Files\cpython\python26\lib\timeit.py", line 99, in inner
_func()
File "<stdin>", line 5, in makestr
KeyboardInterrupt
>>> _

So, given that apparently the simple '+' in the first example is optimized using
an expanding buffer, which then hangs around, it's not clear to me that the
space optimization in 'join' really helps. It may be (but isn't necessarily)
like shoveling snow in a snowstorm. Then the effort/cost could be for naught.

> And because
> strings are immutable (unlike lists and dicts, which also use this
> approach), you can never use that memory until the string is garbage
> collected.

I think that the simple '+', with the apparent optimization shown in the first
example above, can use that space. I know for a fact that when you control a
string implementation then it can do that (since I've implemented that). But I
don't know for a fact that it's practical to do so in Python. In order to use
the space the implementation must know that there's only one reference to the
string. And I don't know whether that information is readily available in a
CPython implementation (say), although I suspect that it is.

Cheers,

- Alf

Steven D'Aprano

unread,

Jan 20, 2010, 1:12:18 AM1/20/10

to

On Wed, 20 Jan 2010 05:25:22 +0100, Alf P. Steinbach wrote:

> * Steven D'Aprano:
>> On Tue, 19 Jan 2010 16:20:42 -0500, Gerald Britton wrote:
>>
>>> That's surprising. I wouldn't implement it that way at all. I'd use a
>>> dynamically-expanding buffer as I suggested. That way you get a
>>> single pass and don't have to calculate anything before you begin. In
>>> the best case, you'd use half the memory (the result just fits in the
>>> buffer after its last expansion and no saved tuple). In the worst
>>> case, the memory use is about the same (you just expanded the buffer
>>> using a 2x expansion rule then you hit the last item).
>>
>> In the worst case, you waste 50% of the memory allocated.
>
> Yes. That is a good argument for not doing the expanding buffer thing.
> But such buffers may be generally present anyway, resulting from
> optimization of "+".

As near as I can determine, the CPython optimization you are referring to
doesn't use the "double the buffer when needed" technique. It operates on
a completely different strategy. As near as I can tell (as a non-C
speaker), it re-sizes the string in place to the size actually needed,
thus reducing the amount of copying needed.

The optimization patch is here:

http://bugs.python.org/issue980695

and some history (including Guido's opposition to the patch) here:

http://mail.python.org/pipermail/python-dev/2004-August/046686.html

Nevertheless, the patch is now part of CPython since 2.4, but must be
considered an implementation-specific optimization. It doesn't apply to
Jython, and likely other implementations.

> Using CPython 2.6.4 in Windows XP:

[snip time measurements]

> Here the linear increase of times indicate that the "+" is being
> optimized using an expanding buffer for the string.

Be careful of jumping to the conclusion from timings that a certain
algorithm is used. All you can really tell is that, whatever the
algorithm is, it has such-and-such big-oh behaviour.

> If only the minimal
> space was allocated each time then one would expect O(n^2) behavior
> instead of the apparently O(n) above.

I don't follow that reasoning.

> Example of that O(n^2) behavior given below.

The example shown demonstrates that the + optimization only applies in
certain specific cases. In fact, it only applies to appending:

s += t
s = s + t

but not prepending or concatenation with multiple strings:

s = t + s
s += t + u

However, keep in mind that the CPython keyhole optimizer will take
something like this:

s += "x" + "y"

and turn it into this:

s += "xy"

which the concatenation optimization does apply to. Optimizations make it
hard to reason about the behaviour of algorithms!

--
Steven

Stefan Behnel

unread,

Jan 20, 2010, 2:38:12 AM1/20/10

to

Steven D'Aprano, 20.01.2010 07:12:

> On Wed, 20 Jan 2010 05:25:22 +0100, Alf P. Steinbach wrote:
>> That is a good argument for not doing the expanding buffer thing.
>> But such buffers may be generally present anyway, resulting from
>> optimization of "+".
>
> As near as I can determine, the CPython optimization you are referring to
> doesn't use the "double the buffer when needed" technique. It operates on
> a completely different strategy. As near as I can tell (as a non-C
> speaker), it re-sizes the string in place to the size actually needed,
> thus reducing the amount of copying needed.
>

>> Using CPython 2.6.4 in Windows XP:
> [snip time measurements]
>> Here the linear increase of times indicate that the "+" is being
>> optimized using an expanding buffer for the string.
>
> Be careful of jumping to the conclusion from timings that a certain
> algorithm is used. All you can really tell is that, whatever the
> algorithm is, it has such-and-such big-oh behaviour.

Which is particularly tricky here since the algorithms depends more on the
OS than on the code in CPython. The better timings come from the fact that
the OS does *not* need to copy the buffer on each iteration, but does
smarter things when asked to enlarge the buffer. If you ran the benchmark
on an OS that *did* copy the buffer each time, the runtime would really be
quadratic.

BTW, I think it would actually be worth trying to apply the same approach
to str.join() if the argument is not a sequence (obviously followed by a
benchmark on different platforms).

Stefan

Alf P. Steinbach

unread,

Jan 20, 2010, 3:21:30 AM1/20/10

to

Provided that the CPython code is that code then you're right, it only calls
PyObject_REALLOC, depending on that operation having (amortized) constant time.

And PyObject_REALLOC can in principle achieve that by relying on paging, e.g.
using the OS' virtual memory allocator, instead of doubling and copying.

However, I'm baffled by the code I find via Google Code search; there
PyObject_REALLOC simply calls malloc and copies, which if that were the code
used in CPython 2.4 and greater, and if the '+' code is the code that you refer
to above, would produce O(n^2) time for the first '+' example.

>> Using CPython 2.6.4 in Windows XP:
> [snip time measurements]
>> Here the linear increase of times indicate that the "+" is being
>> optimized using an expanding buffer for the string.
>
> Be careful of jumping to the conclusion from timings that a certain
> algorithm is used. All you can really tell is that, whatever the
> algorithm is, it has such-and-such big-oh behaviour.

Well, as for intended meaning you're right about that, it needs not be a
doubling buffer.

>> If only the minimal
>> space was allocated each time then one would expect O(n^2) behavior
>> instead of the apparently O(n) above.
>
> I don't follow that reasoning.

The reasoning assumes independent allocations rather than reallocation
(extension of existing allocation).

Then quadratic time for the example follows from sum(range(n)) = (n^2-n)/2 of
the sizes of the data copied.

But with PyObject_REALLOC as essentially constant time also 'join' could take
advantage of this. :-)

Cheers,

- Alf

Dave Angel

unread,

Jan 20, 2010, 6:05:50 AM1/20/10

to Steven D'Aprano, pytho...@python.org

Steven D'Aprano wrote:
> A million items is not a lot of data. Depending on the size of each
> object, that might be as little as 4 MB of data:
>
>
>>>> L = ['' for _ in xrange(10**6)]
>>>> sys.getsizeof(L)
>>>>
> 4348732
>

> <snip>
Note that this sys.getsizeof() is only saying the size of the list, not
the size of the objects referenced in the list. So that number doesn't
depend on "the size of the object."

DaveA