Large C Arrays to Python lists

1,873 views
Skip to first unread message

Eric Frederich

unread,
Feb 21, 2012, 3:11:42 PM2/21/12
to cython...@googlegroups.com
Is there an efficient way to convert a large C array to a Python list?
Right now my bindings use list comprehension.
I know how long the array is.

    py_list = [a[i] for i in xrange(n)]

This must be doing a bunch of memory allocations right?
Hopefully its not doing (n) of them.

I tried using PyList_New and PyList_SetItem but got back garbage.
Is there something wrong with my cython code here?...

# === Cython Code ===

from cpython cimport PyObject, PyList_New, PyList_SetItem
def hello_n_times(int n):
    o = PyList_New(n)
    for i in xrange(n):
        s = 'hi there %d' % i
        print s
        PyList_SetItem(o, i, s)
    return o

# === resulting output ===

hi there 0
hi there 1
hi there 2
hi there 3
hi there 4
['Py_Repr', 'stdout', 'Py_Repr', 'stdout', 'Py_Repr']


Thanks,
~Eric

Robert Bradshaw

unread,
Feb 21, 2012, 3:24:47 PM2/21/12
to cython...@googlegroups.com
On Tue, Feb 21, 2012 at 12:11 PM, Eric Frederich
<eric.fr...@gmail.com> wrote:
> Is there an efficient way to convert a large C array to a Python list?
> Right now my bindings use list comprehension.
> I know how long the array is.
>
>     py_list = [a[i] for i in xrange(n)]
>
> This must be doing a bunch of memory allocations right?
> Hopefully its not doing (n) of them.

This uses PyList_Append under the hood, which has O(1) Amortized worst
case complexity (you can think of it as doubling the size of the list
every time it runs out of room, though the formula is slightly more
complicated than that). Usually this is fast enough.

> I tried using PyList_New and PyList_SetItem but got back garbage.
> Is there something wrong with my cython code here?...
>
> # === Cython Code ===
>
> from cpython cimport PyObject, PyList_New, PyList_SetItem
> def hello_n_times(int n):
>     o = PyList_New(n)
>     for i in xrange(n):
>         s = 'hi there %d' % i
>         print s
>         PyList_SetItem(o, i, s)
>     return o
>
> # === resulting output ===
>
> hi there 0
> hi there 1
> hi there 2
> hi there 3
> hi there 4
> ['Py_Repr', 'stdout', 'Py_Repr', 'stdout', 'Py_Repr']

You need to manually incref s. However, unless you need this speed
(profile it), the list comprehension should be nearly as fast and much
cleaner. I've often considered doing this optimization when the size
of the resulting list is known ahead of time (though on this note
PyList_Append will shrink a list that is too large, so just
initializing the list to a larger capacity is insufficient).

- Robert

Eric Frederich

unread,
Feb 21, 2012, 4:21:45 PM2/21/12
to cython...@googlegroups.com
Thanks for your quick reply.
I definitely don't want to optimize if I don't need to.  I probably should have tested first.
My bindings connect to a server which sometimes returns arrays hundreds of thousands to millions of unsigned ints.

O(1) didn't sound right to me.
I remember being taught in CS classes to double the size (or at least do something based on the size) and you'll get max log2(n) allocations.
I looked at the Objects/listobject.c file in Python's source and indeed the formula is complicated / weird.
It basically increases the size by an eighth of the original size.
The source doesn't claim O(1) it claims to have linear behavior which would be O(n).

In any case it isn't doing anything boneheaded so I guess I should just leave my list comprehension alone.

Thanks again,
~Eric

Robert Bradshaw

unread,
Feb 21, 2012, 5:32:16 PM2/21/12
to cython...@googlegroups.com
On Tue, Feb 21, 2012 at 1:21 PM, Eric Frederich
<eric.fr...@gmail.com> wrote:
> Thanks for your quick reply.
> I definitely don't want to optimize if I don't need to.  I probably should
> have tested first.
> My bindings connect to a server which sometimes returns arrays hundreds of
> thousands to millions of unsigned ints.
>
> O(1) didn't sound right to me.
> I remember being taught in CS classes to double the size (or at least do
> something based on the size) and you'll get max log2(n) allocations.

Yep.

> I looked at the Objects/listobject.c file in Python's source and indeed the
> formula is complicated / weird.
> It basically increases the size by an eighth of the original size.
> The source doesn't claim O(1) it claims to have linear behavior which would
> be O(n).

It's the amortized cost that's O(1). Though an individual insert might
be O(n), these are exponentially rare, so the average insert is O(n) /
n. (The actual cost is c for some c that depends on your resizing
parameter (as well as the expense of allocation, copying, how likely
that realloc doesn't require a copy, etc.) It's a balance between
reallocation frequency and memory waste.)

> In any case it isn't doing anything boneheaded so I guess I should just
> leave my list comprehension alone.

Yep, that's what I'd do. You might also want to look into using NumPy
or memory views to provide Python-level access to the underlying data
(avoiding creating a Python object for each element in your list.)

Henry Gomersall

unread,
Feb 21, 2012, 7:06:05 PM2/21/12
to cython...@googlegroups.com
On Tue, 2012-02-21 at 14:32 -0800, Robert Bradshaw wrote:
>
> Yep, that's what I'd do. You might also want to look into using NumPy
> or memory views to provide Python-level access to the underlying data
> (avoiding creating a Python object for each element in your list.)

yes, do this!

I was discussing only this morning how it wasn't immediately obvious to
me that Numpy is fantastic for a big set of things beyond numerical
computation. Basically, anywhere where you need to deal with a block of
memory in python in a clean pythonic way, Numpy's your friend. It's just
wonderful, and I really can't stress that enough!

cheers,

hen

Eric Frederich

unread,
Feb 22, 2012, 10:47:00 AM2/22/12
to cython...@googlegroups.com
Do I need to install cython bindings for NumPy?
How else would numpy be able to get a C pointer to my array?
If it is simple could you provide an example?
Most of the arrays returned from the library are unsigned int arrays.

Thanks,
~Eric

Henry Gomersall

unread,
Feb 22, 2012, 11:06:29 AM2/22/12
to cython...@googlegroups.com
On Wed, 2012-02-22 at 10:47 -0500, Eric Frederich wrote:
> Do I need to install cython bindings for NumPy?
> How else would numpy be able to get a C pointer to my array?
> If it is simple could you provide an example?
> Most of the arrays returned from the library are unsigned int arrays.

How is your array being filled? Do you invoke the data getter (written
in C) from within Numpy?

The easiest way to do this is to call your C code with an array that is
already being managed by Numpy. So you create the Numpy array, then you
call your C code, passing it a pointer to the Numpy array, which it then
fills.

Another option is to do a memcopy from the C array into a preallocated
Numpy array within Cython. This obviously has the overhead of a copy
(albeit a very fast, low level copy).

Finally, if the memory is necessarily being allocated within the C code
and you don't want to do a copy, then you're outside my experience, but
I understand it's possible. https://gist.github.com/1253698 seems to be
a neat example (though I defer to someone more knowledgeable on this
one).

If you want an example for either of the first two options, I can do
that.

cheers,

hen

Eric Frederich

unread,
Feb 22, 2012, 11:47:32 AM2/22/12
to cython...@googlegroups.com
Henry,

Great reply... all options seem feasible.
My library that I'm creating bindings for allocates the array and expects me to free it.
So what I have been doing in Cython is calling the C function, using list comprehension to convert it into a Python list and then free'ing the C array.
With NumPy I would like to use either your option 1 or 3 and never explicitly free the returned array but let garbage collection do its thing.

I'll have to experiment, but I may have to go with your 2nd option since this 3rd party library tells me to use their own MEM_free function.
At least with NumPy instead of list comprehension I will 1) only be doing 0 or 1 extra memory allocation, 2) not creating Python objects for every index of the array.

I was already in the process of creating last example on github (almost verbatim).
I got stuck because I missed the line...
    np.PyArray_UpdateFlags(ndarray, ndarray.flags.num | np.NPY_OWNDATA)

Not having that line had weird results.
I could run from the command line...
    python -c "from ericbindings import *; print gimmie_an_array(5)"
... and it seemed to work and printed my array.
But then when I went to use it interactively I got a segfault.

Adding that UpdateFlags call helped.

I think I'm all set now.

Thanks all,
~Eric

mark florisson

unread,
Feb 22, 2012, 11:54:54 AM2/22/12
to cython...@googlegroups.com

Henry Gomersall

unread,
Feb 22, 2012, 11:55:38 AM2/22/12
to cython...@googlegroups.com
On Wed, 2012-02-22 at 11:47 -0500, Eric Frederich wrote:
> My library that I'm creating bindings for allocates the array and
> expects me to free it.
> So what I have been doing in Cython is calling the C function, using
> list comprehension to convert it into a Python list and then free'ing
> the C array.
> With NumPy I would like to use either your option 1 or 3 and never
> explicitly free the returned array but let garbage collection do its
> thing.
>
Yeah, it sounds like the best option is letting Numpy take control of
the memory, as described in that example. But, as I said, I have little
experience of that way of working, so please be wary of my advice!

>
> Adding that UpdateFlags call helped.
>
> I think I'm all set now.

Great!

hen

Henry Gomersall

unread,
Feb 22, 2012, 11:58:25 AM2/22/12
to cython...@googlegroups.com
On Wed, 2012-02-22 at 16:54 +0000, mark florisson wrote:
> You could also go through the buffer interface using memoryviews:
> https://sage.math.washington.edu:8091/hudson/job/cython-docs/doclinks/1/src/userguide

Neat!

Eric Frederich

unread,
Feb 22, 2012, 12:22:47 PM2/22/12
to cython...@googlegroups.com
I did some timing tests and found the NumPy solution to be about 10 times faster than list comprehension.
I'm pretty sure this has to do not with the amount of allocations but with the amount of python objects being created.
Here gimmie_an_array_slow using list comprehension and gimmie_an_array uses NumPy

Thanks for all the help.
After I look at buffer / memory views, I will integrate one of these solutions into my bindings replacing the current list comprehension.

numpy [~/dev/cython_test] > for i in {1..5}; do printf "from ericbindings import *\nfor i in xrange(2000):\n    x = gimmie_an_array_slow(10000)" | /usr/bin/time -f "%E" python; done
0:01.16
0:01.12
0:01.14
0:01.11
0:01.14
numpy [~/dev/cython_test] > for i in {1..5}; do printf "from ericbindings import *\nfor i in xrange(2000):\n    x = gimmie_an_array(10000)" | /usr/bin/time -f "%E" python; done
0:00.16
0:00.16
0:00.16
0:00.17
0:00.17
numpy [~/dev/cython_test] > for i in {1..5}; do printf "from ericbindings import *\nfor i in xrange(2000):\n    x = gimmie_an_array_slow(20000)" | /usr/bin/time -f "%E" python; done
0:02.16
0:02.22
0:02.15
0:02.20
0:02.14
numpy [~/dev/cython_test] > for i in {1..5}; do printf "from ericbindings import *\nfor i in xrange(2000):\n    x = gimmie_an_array(20000)" | /usr/bin/time -f "%E" python; done
0:00.26
0:00.25
0:00.26
0:00.25
0:00.25
numpy [~/dev/cython_test] > for i in {1..5}; do printf "from ericbindings import *\nfor i in xrange(4000):\n    x = gimmie_an_array_slow(20000)" | /usr/bin/time -f "%E" python; done
0:04.23
0:04.33
0:04.32
0:04.29
0:04.29
numpy [~/dev/cython_test] > for i in {1..5}; do printf "from ericbindings import *\nfor i in xrange(4000):\n    x = gimmie_an_array(20000)" | /usr/bin/time -f "%E" python; done
0:00.43
0:00.43
0:00.44
0:00.43
0:00.43

Eric Frederich

unread,
Feb 22, 2012, 12:44:08 PM2/22/12
to cython...@googlegroups.com
I have 0.15.1 and it doesn't seem to have these memoryviews.
What is the deal with them?  Are they coming in a future release?

mark florisson

unread,
Feb 22, 2012, 12:59:33 PM2/22/12
to cython...@googlegroups.com
On 22 February 2012 17:44, Eric Frederich <eric.fr...@gmail.com> wrote:
> I have 0.15.1 and it doesn't seem to have these memoryviews.
> What is the deal with them?  Are they coming in a future release?
>

Yes, you can clone Cython from github or wait for the
soon-to-be-released Cython 0.16.

Eric Frederich

unread,
Feb 23, 2012, 9:52:25 AM2/23/12
to cython...@googlegroups.com
Mark,

Thanks a bunch; cython.array's and memory views look like the ticket.

I don't think this 3rd party library I'm using is doing anything crazy in their implementation of MEM_free.
Even though I'd probably be find just using NumPy, this cython.array keeps me compliant because I can override the callback for free'ing.
Does NumPy provide a way to override the free function?

Also, this adds no additional dependencies since my bindings use Cython anyway.

Looking forward to 0.16

Thanks everyone,
~Eric
Reply all
Reply to author
Forward
0 new messages