How to pop a list item in Cython?

Simon King

unread,

Oct 19, 2011, 10:26:37 AM10/19/11

to sage-support

Hi!

I found that L.pop(0) (where L is a list) tends to be relatively slow
in Cython - even if L is cdefined as list.

Google pointed me to some discussion about a Cython optimisation for
L.pop(0), but it is not clear to me whether the optimisation actually
made it into Sage. Can you tell me the status?

Can you show me a way to be faster than what Cython makes of L.pop(0)?
What I found (see below) is faster on small lists, but not so good on
longer lists. Also it doesn't change the list in place.

Best regards,
Simon

%cython
include "python.pxi"
cpdef f(list l):
return l.pop(0),l
cpdef g(list l):
return l[0], PyList_GetSlice(l,1,PyList_GET_SIZE(l))

g is faster than f for lists of size 60, at least on sagenb. I am
aware that g does not change l in place - actually I would prefer if
it did.

Simon King

unread,

Oct 19, 2011, 1:49:56 PM10/19/11

to sage-support

It might be a good idea to use collections.deque, not lists. First
tests seem to indicate that their popleft() is faster than a list's
pop(0).

Alexander Juarez

unread,

Oct 19, 2011, 1:58:24 PM10/19/11

to sage-s...@googlegroups.com

I think I found the collections.deque() to be faster that than the
list.pop() which makes sense if the list is implemented as a linked
list object. And the collections uses a array implementation.

Sent from my iPhone

> --
> To post to this group, send email to sage-s...@googlegroups.com
> To unsubscribe from this group, send email to sage-support...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/sage-support
> URL: http://www.sagemath.org

Simon King

unread,

Oct 19, 2011, 2:57:02 PM10/19/11

to sage-support

Simon King

unread,

Oct 19, 2011, 3:42:44 PM10/19/11

to sage-support

Hi Alexander and all,

First of all, sorry that I posted twice the same - google groups
reported an error when I first tried to post.

On 19 Okt., 19:58, Alexander Juarez <coua...@gmail.com> wrote:
> I think I found the collections.deque() to be faster that than the
> list.pop() which makes sense if the list is implemented as a linked
> list object. And the collections uses a array implementation.

In contrast to my first tests, it meanwhile seems to me that the
L[0], PyList_GetSlice(L,1,PyList_GET_SIZE(L))
idiom is faster than deque in my applications. Recall that this was
the fastest replacement for pop(0), if the lists are not too long
(which will probably be the case in my applications).

After all, "collections" is Python, not Cython.

Nevertheless, I'd still appreciate to learn a Cython replacement for
pop(0) that is fast on long lists as well.

I found this:
http://www.mail-archive.com/cytho...@codespeak.net/msg07518.html
but Sturla Molden's trick didn't work for me: Cython complained about
an incomplete type.

Best regards,
Simon

Robert Bradshaw

unread,

Oct 20, 2011, 12:59:53 AM10/20/11

to sage-s...@googlegroups.com

On Wed, Oct 19, 2011 at 12:42 PM, Simon King <simon...@uni-jena.de> wrote:
> Hi Alexander and all,
>
> First of all, sorry that I posted twice the same - google groups
> reported an error when I first tried to post.
>
> On 19 Okt., 19:58, Alexander Juarez <coua...@gmail.com> wrote:
>> I think I found the collections.deque() to be faster that than the
>> list.pop() which makes sense if the list is implemented as a linked
>> list object. And the collections uses a array implementation.
>
> In contrast to my first tests, it meanwhile seems to me that the
> L[0], PyList_GetSlice(L,1,PyList_GET_SIZE(L))
> idiom is faster than deque in my applications. Recall that this was
> the fastest replacement for pop(0), if the lists are not too long
> (which will probably be the case in my applications).

I have no idea why L.pop(0) is not optimized, but L.pop(<int>0) is
(and the latter should be very fast). Clearly a bug in Cython, but
user the latter.

> After all, "collections" is Python, not Cython.
>
> Nevertheless, I'd still appreciate to learn a Cython replacement for
> pop(0) that is fast on long lists as well.

There isn't one as pop(0) on a list requires re-copying the entire
contents of the list. You could use numpy arrays, where slices such as
L[1:] are views. A cython-implemented linked list may perform well
here as well.

> I found this:
> http://www.mail-archive.com/cytho...@codespeak.net/msg07518.html
> but Sturla Molden's trick didn't work for me: Cython complained about
> an incomplete type.
>
> Best regards,
> Simon
>

Simon King

unread,

Oct 20, 2011, 2:52:17 AM10/20/11

to sage-support

Hi Robert,

On 20 Okt., 06:59, Robert Bradshaw <rober...@math.washington.edu>
wrote:

> I have no idea why L.pop(0) is not optimized, but L.pop(<int>0) is
> (and the latter should be very fast). Clearly a bug in Cython, but
> user the latter.

Surprise: Even though the annotated code for L.pop(<int>0) is less
yellow than for L.pop(0), it is *slower* by about 30%!

In flask.sagenb, I get:
{{{
%cython
def f(list L):
L.append(1)
return L.pop(0),L
def g(list L):
L.append(1)
return L.pop(<int>0),L
def h(list L):
L.append(1)
return L.pop(),L
}}}
{{{
L = range(1000)
}}}

timeit("f(L)") --> 625 loops, best of 3: 701 ns per loop
timeit("g(L)") --> 625 loops, best of 3: 971 ns per loop
timeit("h(L)") --> 625 loops, best of 3: 131 ns per loop

Since L.pop() (popping the last item) is so much faster than L.pop(0),
I could perhaps just revert my lists.

> > Nevertheless, I'd still appreciate to learn a Cython replacement for
> > pop(0) that is fast on long lists as well.
>
> There isn't one as pop(0) on a list requires re-copying the entire
> contents of the list.

Really? I'm surprised that an in-place change requires copying.

> You could use numpy arrays, where slices such as
> L[1:] are views. A cython-implemented linked list may perform well
> here as well.

I've never used numpy before, but I'll give it a try.

Thank you very much,
Simon

Simon King

unread,

Oct 20, 2011, 6:46:41 AM10/20/11

to sage-support

Hi Robert,

On 20 Okt., 06:59, Robert Bradshaw <rober...@math.washington.edu>
wrote:

> There isn't one as pop(0) on a list requires re-copying the entire
> contents of the list. You could use numpy arrays, where slices such as
> L[1:] are views. A cython-implemented linked list may perform well
> here as well.

Some tests in pure Python seem to indicate that numpy arrays are
slower than lists.
In particular, I need containment test and pop (or slices), and the
items are types:

sage: import numpy
sage: from numpy import array
sage: A = CommutativeAlgebras(QQ); B = HopfAlgebras(QQ); C = Fields();
D = FiniteEnumeratedSets()
sage: L1=A.parent_class.mro()
sage: L2=B.parent_class.mro()
sage: L3=C.parent_class.mro()
sage: L4=D.parent_class.mro()
sage: A1 = array(L1)
sage: A2 = array(L2)
sage: A3 = array(L3)
sage: A4 = array(L4)
sage: c = Algebras(QQ).parent_class
sage: timeit("c in L2")
625 loops, best of 3: 344 ns per loop
sage: timeit("c in A2")
625 loops, best of 3: 14 µs per loop
sage: timeit("c in L4")
625 loops, best of 3: 555 ns per loop
sage: timeit("c in A4")
625 loops, best of 3: 12.9 µs per loop
sage: timeit("L4[1:]")
625 loops, best of 3: 835 ns per loop
sage: timeit("A4[1:]")
625 loops, best of 3: 1.07 µs per loop

Will the picture be different when cimporting numpy.ndarray in Cython?

Best regards,
Simon

Robert Bradshaw

unread,

Oct 20, 2011, 7:38:39 AM10/20/11

to sage-s...@googlegroups.com

On Wed, Oct 19, 2011 at 11:52 PM, Simon King <simon...@uni-jena.de> wrote:
> Hi Robert,
>
> On 20 Okt., 06:59, Robert Bradshaw <rober...@math.washington.edu>
> wrote:
>> I have no idea why L.pop(0) is not optimized, but L.pop(<int>0) is
>> (and the latter should be very fast). Clearly a bug in Cython, but
>> user the latter.
>
> Surprise: Even though the annotated code for L.pop(<int>0) is less
> yellow than for L.pop(0), it is *slower* by about 30%!
>
> In flask.sagenb, I get:
> {{{
> %cython
> def f(list L):
> L.append(1)
> return L.pop(0),L
> def g(list L):
> L.append(1)
> return L.pop(<int>0),L
> def h(list L):
> L.append(1)
> return L.pop(),L
> }}}
> {{{
> L = range(1000)
> }}}
>
> timeit("f(L)") --> 625 loops, best of 3: 701 ns per loop
> timeit("g(L)") --> 625 loops, best of 3: 971 ns per loop
> timeit("h(L)") --> 625 loops, best of 3: 131 ns per loop

Very interesting--looks like the crossover point is around 200
elements. Looks like Python's native pop uses memmove rather than
doing a loop shift the array which could account for the difference.

> Since L.pop() (popping the last item) is so much faster than L.pop(0),
> I could perhaps just revert my lists.

Inserting into the 0th element will have the same issues as popping it.

>> > Nevertheless, I'd still appreciate to learn a Cython replacement for
>> > pop(0) that is fast on long lists as well.
>>
>> There isn't one as pop(0) on a list requires re-copying the entire
>> contents of the list.
>
> Really? I'm surprised that an in-place change requires copying.

It's an array, not a linked list. The k-th item of a list occupies a
fixed location in memory relative to the head of the object, so
removing an item requires shifting everything. (Not the data, just the
pointers.)

>> You could use numpy arrays, where slices such as
>> L[1:] are views. A cython-implemented linked list may perform well
>> here as well.
>
> I've never used numpy before, but I'll give it a try.
>

> 625 loops, best of 3: 1.07 µs per loop

As numpy slices are views, they are a bit more expensive but require
O(1) time, whereas slicing a list is O(n) time.

sage: from numpy import array

sage: for k in [2..5]:
... L = range(10^k)
... N = array(L)
... timeit("L[1:]")
... timeit("N[1:]")
625 loops, best of 3: 656 ns per loop
625 loops, best of 3: 576 ns per loop
625 loops, best of 3: 3.42 µs per loop
625 loops, best of 3: 627 ns per loop
625 loops, best of 3: 45.5 µs per loop
625 loops, best of 3: 578 ns per loop
625 loops, best of 3: 959 µs per loop
625 loops, best of 3: 630 ns per loop

Looks like your arrays are pretty small though.

> Will the picture be different when cimporting numpy.ndarray in Cython?

Likely not. The primary benefits (100x or more speedups) are when
indexing and using native C types. Why not write your own simple
queue?

cdef class Queue:
cdef list elements
cdef int start
cdef int end
cdef int capacity
def __init__(self, initial_capacity=5):
self.elements = [None] * initial_capacity
self.start = 0
self.end = 0
self.capacity = initial_capacity
cpdef push(self, o):
self.elements[self.end] = o
self.end = (self.end + 1) % self.capacity
if self.end == self.start:
self.elements = self.elements[self.start:] +
self.elements[:self.end] + [None] * self.capacity
self.start = 0
self.end = self.capacity
self.capacity *= 2
cpdef pop(self):
if self.start == self.end:
raise ValueError, "Empty queue"
try:
return self.elements[self.start]
finally:
self.elements[self.start] = None
self.start = (self.start + 1) % self.capacity

def f(Queue q):
q.push(1)
return q.pop(), q

q = Queue()
for k in range(1000):
q.push(k)
timeit("f(q)")

625 loops, best of 3: 147 ns per loop

- Robert

Simon King

unread,

Oct 20, 2011, 7:55:08 AM10/20/11

to sage-support

Hi Robert,

On 20 Okt., 13:38, Robert Bradshaw <rober...@math.washington.edu>
wrote:

> > Since L.pop() (popping the last item) is so much faster than L.pop(0),
> > I could perhaps just revert my lists.
>
> Inserting into the 0th element will have the same issues as popping it.

Yes. But the point is that I am popping from (several) input lists,
while I am only appending to the list that will eventually be
returned. I can revert the input lists (profiting from fast L.pop())
and don't need to revert the output list.

> It's an array, not a linked list. The k-th item of a list occupies a
> fixed location in memory relative to the head of the object, so

> removing an item requires shifting everything...

... except when removing the last item (which explains why L.pop() is
fast), OR THE FIRST. Because, when removing the first item, you don't
need to shift the other items towards the head, because you could more
easily move the head towards the other items.

> Likely not. The primary benefits (100x or more speedups) are when
> indexing and using native C types. Why not write your own simple
> queue?

Would be worth a try. Thank you for the code!

Best regards,
Simon

Nils Bruin

unread,

Oct 20, 2011, 12:45:16 PM10/20/11

to sage-support

On Oct 19, 12:42 pm, Simon King <simon.k...@uni-jena.de> wrote:
> In contrast to my first tests, it meanwhile seems to me that the
> L[0], PyList_GetSlice(L,1,PyList_GET_SIZE(L))
> idiom is faster than deque in my applications. Recall that this was
> the fastest replacement for pop(0), if the lists are not too long
> (which will probably be the case in my applications).
>
> After all, "collections" is Python, not Cython.

It sounds like you are in a sufficiently low range that a worse order
algorithm with lower overhead is better. However, in other situations:
"collections" is python, but the actual implementations are in
_collections, which is an extension module. By writing the right
header files, you should be able to call the routines in there with
the same low overhead as PyList_* etc. (provided those are not
macros). It's conceivable you could still write better code for
special situations, but otherwise I'd expect that interfacing
_collections.deque directly should provide you with a good optimized
implementation (either via doubly linked list or via circular buffer.
I don't know what they use)

Access to lists is always going to be a little faster for the same
reason why "pop" has worse complexity: one less level of indirection.

Robert Bradshaw

unread,

Oct 20, 2011, 6:02:37 PM10/20/11

to sage-s...@googlegroups.com

On Thu, Oct 20, 2011 at 4:55 AM, Simon King <simon...@uni-jena.de> wrote:
> Hi Robert,
>
> On 20 Okt., 13:38, Robert Bradshaw <rober...@math.washington.edu>
> wrote:
>> > Since L.pop() (popping the last item) is so much faster than L.pop(0),
>> > I could perhaps just revert my lists.
>>
>> Inserting into the 0th element will have the same issues as popping it.
>
> Yes. But the point is that I am popping from (several) input lists,
> while I am only appending to the list that will eventually be
> returned. I can revert the input lists (profiting from fast L.pop())
> and don't need to revert the output list.

Ah, in that case it would make total sense. Your example feels a lot
like the category pushout stuff :).

>> It's an array, not a linked list. The k-th item of a list occupies a
>> fixed location in memory relative to the head of the object, so
>> removing an item requires shifting everything...
>
> ... except when removing the last item (which explains why L.pop() is
> fast), OR THE FIRST. Because, when removing the first item, you don't
> need to shift the other items towards the head, because you could more
> easily move the head towards the other items.

That depends on how easily you can shift the head; it's an
implementation detail in Python (and most other languages that I'm
aware of) that you can't. Of course C lets you do this, but that's
because it doesn't really have arrays, just pointers (and you still
usually have to keep the "original" head around to free it). Of course
being able to shift the head as well as the tail is what I
implemented.

>> Likely not. The primary benefits (100x or more speedups) are when
>> indexing and using native C types. Why not write your own simple
>> queue?
>
> Would be worth a try. Thank you for the code!
>
> Best regards,
> Simon
>

Simon King

unread,

Oct 21, 2011, 1:44:55 AM10/21/11

to sage-support

Hi Robert,

On 21 Okt., 00:02, Robert Bradshaw <rober...@math.washington.edu>
wrote:

> > Yes. But the point is that I am popping from (several) input lists,
> > while I am only appending to the list that will eventually be
> > returned. I can revert the input lists (profiting from fast L.pop())
> > and don't need to revert the output list.
>
> Ah, in that case it would make total sense. Your example feels a lot
> like the category pushout stuff :).

It is eventually supposed to be in sage.categories, but it is about
all_super_categories,not about pushout.

The idea is: If C is a category then C.parent_class.mro() should be
exactly parallel to C.all_super_categories() - that's a matter of
consistency. But C.all_super_categories() is currently not using the
underlying algorithm of mro (which is called C3 algorithm, if I am not
mistaken).

I would love to simply import the function used by python to determine
__mro__ of a type, but apparently it can't be just imported and it is
restricted to lists of types (while I'd use it on lists of
categories).

So, I'm implementing C3 myself. Presumably it will only be applied on
small examples (say, input is at most 4 lists of size at most 60). But
in that range, it should be as fast as possible: The stuff in elliptic
curves tends to create polynomial rings over thousands of different
finite fields. Polynomial rings involve the creation of a category of
algebras over their base ring (by #9138); #11900 and #11935 cope with
the resulting speed regression.

Thus, each example of C.all_super_categories() is small, but a single
elliptic curve computation could easily involve thousands of examples.

Cheers,
Simon

Nils Bruin

unread,

Oct 21, 2011, 3:56:42 AM10/21/11

to sage-support

On Oct 20, 10:44 pm, Simon King <simon.k...@uni-jena.de> wrote:

> Thus, each example of C.all_super_categories() is small, but a single
> elliptic curve computation could easily involve thousands of examples.

Have you checked that the elliptic curve computations *use*
all_super_categories? I understand that it's there for generic
reasons, but if actual code shows that people want to be able to do a
little bit of work with polynomials over thousands of different finite
fields, perhaps there should be lighter weight option available?
Reducing overhead is a lofty goal, but it can't beat removing it (or
not introducing it in the first place). Can this attribute be lazy?

The whole paradigm of "parents are heavy, elements are light" is only
effective if people create relatively few parents and then do a lot
with elements of those. It fits with many computer algebra
computations, but if in your observation the creation of the parents
is a significant part of the runtime for some elliptic curve
computations then perhaps a model with different trade-offs would be
more appropriate.

Simon King

unread,

Oct 21, 2011, 5:34:34 AM10/21/11

to sage-support

Hi Nils,

On 21 Okt., 09:56, Nils Bruin <nbr...@sfu.ca> wrote:
> Have you checked that the elliptic curve computations *use*
> all_super_categories?

> ...

> Reducing overhead is a lofty goal, but it can't beat removing it (or
> not introducing it in the first place). Can this attribute be lazy?

Sorry, I have to name a few technical details.
In one word: With #9138, they *do* use all_super_categories
implicitly, but with #11900 they don't.

Namely, by #9138, GF(3)['x'] is initialised as an object in "Join of
Category of euclidean domains and Category of commutative algebras
over Finite Field of size 3". Hence, the join category must be created
and its parent_class determined.

Computing a join of two categories using Category.join([...]) involves
testing whether one category is a sub-category of another. And that
involves calling all_super_categories.

Part of #11900 is to avoid these implications: One knows the result of
the join, hence, one can directly construct a JoinCategory, which does
*not* involve the subcategory tests that are part of
Category.join([...]).

With #9138 plus #11900, GF(3)['x'] is still initialised as an object
in "Join of Category of euclidean domains and Category of commutative
algebras over Finite Field of size 3", but the overhead becomes so
small that it is hardly noticeable in the elliptic curve computations.

Concerning lazyness: Another part of #11900 is to postpone the
category initialisation of matrix algebras. MatrixSpace(GF(p),n) is an
algebra, and by #9138 it would be initialised as such. However, the
elliptic curve code (and I guess most other code as well) uses the
matrix space just as a container, so that the category initialisation
is totally useless. The solution proposed at #11900 is to *not*
initialise the category during creation of a matrix space, but to add
a method to initialise the category later.

And if, in addition, one makes the parent_class of a category
independent of the category's base ring, which is taken care of at
#11935, then the elliptic curve computations actually become *faster*
than they were prior to #9138.

Hence, with #9138+#11900+#11935, one has category initialisation and
(at least for the tests we considered) a speed-up!

Back to the subject of this thread: Even if the use of
all_super_categories can often be avoided, I think it is still a good
idea to make it as fast as possible, in the range I mentioned: Using
the C3 algorithm on few lists of length not more than 60.

Best regards,
Simon

Reply all

Reply to author

Forward