Cython array creation is horribly slow

6,212 views
Skip to first unread message

Ian Bell

unread,
Nov 10, 2012, 1:21:59 PM11/10/12
to cython...@googlegroups.com
I have been playing with memory views and cython arrays.  I am not sure what exactly is going wrong, but it seems like there is something not right about how cython allocates memory for cython arrays (as in http://docs.cython.org/src/userguide/memoryviews.html).  Creating of memory views are also horribly slow.  Or I have done something braindead in my code but I don't think so.  The docs for cython arrays are pretty much non-existent.  Where are they in the source tree?  I couldn't find the cython.view module anywhere

I ran a quick check to see what is going on with the array creation in cython.  I have attached the pyx and the setup file used to generate this data and copied it at the end of the email too.  FWIW, I'm on Windows 7 Python 2.7, visual studio 2008 wih a fresh cython from GIT head.  The array creation time for cython arrays seems to be constant independent of length of the array? To be fair, the numpy example isn't exactly the same, but still, 10 times slower in cython than pure c is not too good.  2us per array creation for a 10 element array is very slow, a speed that is just not cutting it for my severely speed constrained application.  In fact what I was trying to was to extend the code that was started in https://groups.google.com/forum/?fromgroups=#!topic/cython-users/PpIokkZVOrA to use the native cython arrays in order to avoid the hacking in the array example provided.

My idea is I am going to wrap my own 1-d array floating point datatype extension type from the ground up with a focus on speed for the element-wise operations.  I'll still include the same interface by implementing most or all of the special methods.

Incidentally, it does not seem to be possible to inherit from array.array class in 0.17.1

Here is the summary (averaged elapsed time per call for one million calls):
Elapsed time to make array with length of 1 is 2.40862075722 us
Elapsed time to make array with length of 10 is 2.36220109038 us
Elapsed time to make array with length of 100 is 2.37456325685 us
Elapsed time to make array with length of 1000 is 2.37144104029 us
Elapsed time to make array with length of 10000 is 2.72447256951 us
Elapsed time to make numpy array with length of 1 is 0.806911490145 us
Elapsed time to make numpy array with length of 10 is 0.869661726899 us
Elapsed time to make numpy array with length of 100 is 0.926880930014 us
Elapsed time to make numpy array with length of 1000 is 1.52742669012 us
Elapsed time to make numpy array with length of 10000 is 7.3662581456 us
Elapsed time with raw c-allocation with length of 1 is 0.195063960543 us
Elapsed time with raw c-allocation with length of 10 is 0.196661594872 us
Elapsed time with raw c-allocation with length of 100 is 0.204236565547 us
Elapsed time with raw c-allocation with length of 1000 is 0.211111834023 us
Elapsed time with raw c-allocation with length of 10000 is 0.363121851293 us

mv_test.pyx:
------------------------------------------------
from cython.view cimport array as cvarray
import time

import numpy as np
cimport numpy as np

from libc.stdlib cimport malloc, free

cdef long N = 1000000

cdef double* ptr

for L in [1,10,100,1000,10000]:
    t1 = time.clock()
    for i in range(N):
        a = cvarray((L,),sizeof(double),'d')
    t2 = time.clock()
    print 'Elapsed time to make array with length of ' +str(L)+' is '+str((t2-t1)/N*1e6)+' us'
   
for L in [1,10,100,1000,10000]:
    t1 = time.clock()
    for i in range(N):
        a = np.arange(L)
    t2 = time.clock()
    print 'Elapsed time to make numpy array with length of ' +str(L)+' is '+str((t2-t1)/N*1e6)+' us'
   
for L in [1,10,100,1000,10000]:
    t1 = time.clock()
    for i in range(N):
        ptr = <double*> malloc(sizeof(double) * L)
        free(ptr)
    t2 = time.clock()
    print 'Elapsed time with raw c-allocation with length of ' +str(L)+' is '+str((t2-t1)/N*1e6)+' us'


setup.py
---------------------------
from distutils.core import setup
from Cython.Build import cythonize
import numpy

import Cython

#This will generate HTML to show where there are still pythonic bits hiding out
Cython.Compiler.Options.annotate = True

setup(
    name = "My hello app",
    ext_modules = cythonize('mv_test.pyx'), # accepts a glob pattern
    include_dirs = [numpy.get_include()]
)
mv_test.pyx
setup.py

Ian Bell

unread,
Nov 10, 2012, 2:08:51 PM11/10/12
to cython...@googlegroups.com
FWIW, same behavior in ubuntu as well

Robert Bradshaw

unread,
Nov 10, 2012, 4:59:17 PM11/10/12
to cython...@googlegroups.com
Certainly creating an array is doing a lot more than a malloc, so if
all you want is a 1-dimensional contiguous non-Python-visible list of
c values, you're not going to beat that. I'm really not sure why
there's that much overhead though, I'd hoped it would be better for
this case.

Ian Bell

unread,
Nov 10, 2012, 5:32:39 PM11/10/12
to cython...@googlegroups.com
On Sat, Nov 10, 2012 at 10:59 PM, Robert Bradshaw <robe...@gmail.com> wrote:
Certainly creating an array is doing a lot more than a malloc, so if
all you want is a 1-dimensional contiguous non-Python-visible list of
c values, you're not going to beat that. I'm really not sure why
there's that much overhead though, I'd hoped it would be better for
this case.

Undoubtedly there is some stuff happening behind the scenes, but 10 times the allocation time of a malloc? And still 3 times slower than a numpy array, which isn't exactly the fastest for allocation..

I could live with malloc but the manual memory management is not very fun.  If I implement __dealloc__ could I just do the free() there?  Would that work? 

 

Robert Bradshaw

unread,
Nov 10, 2012, 5:46:56 PM11/10/12
to cython...@googlegroups.com
On Sat, Nov 10, 2012 at 2:32 PM, Ian Bell <ian.h...@gmail.com> wrote:
>
>
> On Sat, Nov 10, 2012 at 10:59 PM, Robert Bradshaw <robe...@gmail.com>
> wrote:
>>
>> Certainly creating an array is doing a lot more than a malloc, so if
>> all you want is a 1-dimensional contiguous non-Python-visible list of
>> c values, you're not going to beat that. I'm really not sure why
>> there's that much overhead though, I'd hoped it would be better for
>> this case.
>
> Undoubtedly there is some stuff happening behind the scenes, but 10 times
> the allocation time of a malloc? And still 3 times slower than a numpy
> array, which isn't exactly the fastest for allocation..

Yes, I agree it's pretty dismal. We should be able to at least match numpy.

> I could live with malloc but the manual memory management is not very fun.
> If I implement __dealloc__ could I just do the free() there? Would that
> work?

Yes, if the lifetime of the memory can be tied to the lifetime of an
object, this works well. (I do that a lot.)

Robert Bradshaw

unread,
Nov 10, 2012, 5:52:54 PM11/10/12
to cython...@googlegroups.com
On Sat, Nov 10, 2012 at 2:46 PM, Robert Bradshaw <robe...@gmail.com> wrote:
> On Sat, Nov 10, 2012 at 2:32 PM, Ian Bell <ian.h...@gmail.com> wrote:
>>
>>
>> On Sat, Nov 10, 2012 at 10:59 PM, Robert Bradshaw <robe...@gmail.com>
>> wrote:
>>>
>>> Certainly creating an array is doing a lot more than a malloc, so if
>>> all you want is a 1-dimensional contiguous non-Python-visible list of
>>> c values, you're not going to beat that. I'm really not sure why
>>> there's that much overhead though, I'd hoped it would be better for
>>> this case.
>>
>> Undoubtedly there is some stuff happening behind the scenes, but 10 times
>> the allocation time of a malloc? And still 3 times slower than a numpy
>> array, which isn't exactly the fastest for allocation..
>
> Yes, I agree it's pretty dismal. We should be able to at least match numpy.

http://trac.cython.org/cython_trac/ticket/794

If you really care about speed, for the time being you're probably not
going to beat a custom extension class with a raw pointer in it. Of
course this is assuming you need to allocate and delete a lot of
arrays quickly.

Ian Bell

unread,
Nov 10, 2012, 6:06:49 PM11/10/12
to cython...@googlegroups.com
On Sat, Nov 10, 2012 at 11:52 PM, Robert Bradshaw <robe...@gmail.com> wrote:
On Sat, Nov 10, 2012 at 2:46 PM, Robert Bradshaw <robe...@gmail.com> wrote:
> On Sat, Nov 10, 2012 at 2:32 PM, Ian Bell <ian.h...@gmail.com> wrote:
>>
>>
>> On Sat, Nov 10, 2012 at 10:59 PM, Robert Bradshaw <robe...@gmail.com>
>> wrote:
>>>
>>> Certainly creating an array is doing a lot more than a malloc, so if
>>> all you want is a 1-dimensional contiguous non-Python-visible list of
>>> c values, you're not going to beat that. I'm really not sure why
>>> there's that much overhead though, I'd hoped it would be better for
>>> this case.
>>
>> Undoubtedly there is some stuff happening behind the scenes, but 10 times
>> the allocation time of a malloc? And still 3 times slower than a numpy
>> array, which isn't exactly the fastest for allocation..
>
> Yes, I agree it's pretty dismal. We should be able to at least match numpy.

http://trac.cython.org/cython_trac/ticket/794

If you really care about speed, for the time being you're probably not
going to beat a custom extension class with a raw pointer in it. Of
course this is assuming you need to allocate and delete a lot of
arrays quickly.

>> I could live with malloc but the manual memory management is not very fun.
>> If I implement __dealloc__ could I just do the free() there?  Would that
>> work?
>
> Yes, if the lifetime of the memory can be tied to the lifetime of an
> object, this works well. (I do that a lot.)

This is exactly my use case.  Think many millions of creations, copies, and element-wise operations.  I think I will soon have something that will work - might be useful to add into the cython code somehow.  I imagine that other people have similar needs for a stripped down super-fast extension type like I am building right now
 

mark florisson

unread,
Nov 10, 2012, 7:28:21 PM11/10/12
to cython...@googlegroups.com
On 10 November 2012 18:21, Ian Bell <ian.h...@gmail.com> wrote:
> I have been playing with memory views and cython arrays. I am not sure what
> exactly is going wrong, but it seems like there is something not right about
> how cython allocates memory for cython arrays (as in
> http://docs.cython.org/src/userguide/memoryviews.html). Creating of memory
> views are also horribly slow. Or I have done something braindead in my code
> but I don't think so. The docs for cython arrays are pretty much
> non-existent. Where are they in the source tree? I couldn't find the
> cython.view module anywhere

cython.view is a fake module. The cython array implementation and
other things can be found in Cython/Utility/MemoryView.pyx.

The reason it is slow is because no one ever cared to optimize it. It
allocates shape and strides independently on the heap. So it contains
at least 3 malloc calls, and a bunch of other code.

If you want to make it really fast, implement it with a free list,
only malloc the data and make sure the code in the constructor is
native where possible. No one is actively working on it, but it's just
Cython code, so patches would be appreciated and merged.

mark florisson

unread,
Nov 10, 2012, 7:31:28 PM11/10/12
to cython...@googlegroups.com
On 11 November 2012 00:28, mark florisson <markflo...@gmail.com> wrote:
> On 10 November 2012 18:21, Ian Bell <ian.h...@gmail.com> wrote:
>> I have been playing with memory views and cython arrays. I am not sure what
>> exactly is going wrong, but it seems like there is something not right about
>> how cython allocates memory for cython arrays (as in
>> http://docs.cython.org/src/userguide/memoryviews.html). Creating of memory
>> views are also horribly slow. Or I have done something braindead in my code
>> but I don't think so. The docs for cython arrays are pretty much
>> non-existent. Where are they in the source tree? I couldn't find the
>> cython.view module anywhere
>
> cython.view is a fake module. The cython array implementation and
> other things can be found in Cython/Utility/MemoryView.pyx.
>
> The reason it is slow is because no one ever cared to optimize it. It
> allocates shape and strides independently on the heap. So it contains
> at least 3 malloc calls, and a bunch of other code.
>
> If you want to make it really fast, implement it with a free list,
> only malloc the data and make sure the code in the constructor is
> native where possible. No one is actively working on it, but it's just
> Cython code, so patches would be appreciated and merged.

At the time, I considered whether cython.array should be even
documented, the interface is rather horrid. What you'd really want is
variable sized memoryviews and C arrays, like 'cdef double[:M, :N]
myarray', with a check to see whether it should be stack or heap
allocated (depending also on how it is used).

Andreas van Cranenburgh

unread,
Nov 16, 2012, 9:31:47 AM11/16/12
to cython...@googlegroups.com
That's not an efficient way to create a cpython.array, you want to use clone(). I also suspect that your idea of wrapping a 1-D C array in an object mostly amounts to using cpython.array without the memoryview interface. If you leave out the memoryview interface, you can use the cpython.array pointer directly (so no boundscheck):

from cpython.array cimport array, clone
cdef array uintarray = array("I", [])
[...]
def foo():
    cdef array counts = clone(uintarray, 10, False)
    counts.data.as_uint[1] = 2

As for the documentation, look at the docstrings in Cython/Includes/cpython/array.pxd
and the examples in tests/run/pyarray.pyx (paths relative to directory with cython sources).

Cheers, Andreas

Andreas van Cranenburgh

unread,
Nov 16, 2012, 4:11:44 PM11/16/12
to cython...@googlegroups.com
I changed your example code to follow my recommendations, to see how much overhead Python arrays have:

from cpython.array cimport array, clone
cdef array ar, template = array('d')
for L in [1,10,100,1000,10000]:
    t1 = time.clock()
    for i in range(N):
        ar = clone(template, L, False)
    t2 = time.clock()
    print 'Elapsed time to make array with length of ' +str(L)+' is '+str((t2-t1)/N*1e6)+' us'

And I get the following results:

Elapsed time to make array with length of 1 is 0.46 us
Elapsed time to make array with length of 10 is 0.47 us
Elapsed time to make array with length of 100 is 0.52 us
Elapsed time to make array with length of 1000 is 0.45 us
Elapsed time to make array with length of 10000 is 0.51 us
[... snip numpy much slower...]
Elapsed time with raw c-allocation with length of 1 is 0.39 us
Elapsed time with raw c-allocation with length of 10 is 0.37 us
Elapsed time with raw c-allocation with length of 100 is 0.42 us
Elapsed time with raw c-allocation with length of 1000 is 0.38 us
Elapsed time with raw c-allocation with length of 10000 is 0.5 us


So in conclusion there's barely any overhead with cpython.array, and it has the advantage of being part of the standard library and it takes memory management out of your hands.

Andreas van Cranenburgh

unread,
Nov 17, 2012, 8:03:16 AM11/17/12
to cython...@googlegroups.com
I just discovered that, confusingly, cython.array and cpython.array are actually two different modules.
This doesn't change my benchmarks and conclusion of recommending the latter for this case, but perhaps to avoid confusion it's better to just remove cython.array, seeing as it provides a subset of cpython.array (which has an optional memoryview interface) and has performance limitations as noted in this thread.

mark florisson

unread,
Nov 17, 2012, 8:50:57 AM11/17/12
to cython...@googlegroups.com
I'd suggest using NumPy and not allocating small arrays repeatedly.
But cython.view.array it does not provide a subset of functionality,
it supports multi-dimensional and structured arrays (but 'support' is
a big word, it accepts them and allows you to obtain memoryviews from
them). I agree it's not very useful though, it was never really meant
to be used by end users.

Andreas van Cranenburgh

unread,
Nov 17, 2012, 9:50:02 AM11/17/12
to cython...@googlegroups.com
The overhead of cpython.array over malloc is marginal, while NumPy is more than twice as slow as cpython.array, so the latter is clearly preferred in the 1-dimensional case (in this benchmark I used np.empty so it's uninitialized memory as with malloc), and especially when the arrays should be light weight. You can argue that one should avoid allocating many small arrays, but if the problem requires that and you try to allocate one large array, you'll end up re-implementing a sort of malloc or arena to manage things, which is not worthwhile in most cases. 

[ array creation benchmark script: https://gist.github.com/4096484 ]
Elapsed time to make arrays with lengths [1,10,100,1000,10000]
numpy: 1.09 1.09 1.29 1.35 1.35 us
raw c-allocation: 0.34 0.34 0.46 0.44 0.38 us
array: 0.41 0.4 0.44 0.44 0.45 us
array with memview: [...out of memory...]

That is, if you don't use cpython.array's memoryview support, because when that's enabled there appears to be a huge memory leak (computer grinds to a halt due to low memory). I've already established that the __releasebuffer__ function in array.pxd is called correctly, so it  may me a general bug related to buffers/memory views.

yarde...@gmail.com

unread,
Mar 1, 2014, 3:33:24 PM3/1/14
to cython...@googlegroups.com
Is it possible to use the 'clone' trick to initialize a 2d array, as substitute for this?

cdef double[:,:] foo = np.empty((N,M),dtype=float)

thanks.

Zaur Shibzukhov

unread,
Apr 16, 2015, 2:58:01 PM4/16/15
to cython...@googlegroups.com


воскресенье, 2 марта 2014 г., 0:33:24 UTC+4 пользователь yarde...@gmail.com написал:
Is it possible to use the 'clone' trick to initialize a 2d array, as substitute for this?

cdef double[:,:] foo = np.empty((N,M),dtype=float)


It's now possible with arraybuffer.

1. Define arraybuffer description in order to reuse it for creation arraybuffers.
2. Create arraybuffer object -- actually buffer object for typed memoryviews and multidimensional arrays if they support buffer protocol.

from arraybuffer cimport arraybuffer, arraybuffer_descr

cdef arraybuffer_descr ab_descr
= arraybuffer_descr((N,M), 'd')
cdef
double[:,:] arr = arraybuffer(ab_descr)


Hai Nguyen

unread,
Apr 16, 2015, 5:25:59 PM4/16/15
to cython-users
how's about this?

from cpython.array cimport array as pyarray
from cpython cimport array

cdef int w=100, h=100
cdef pyarray arr0 = array.clone(pyarray('d', []), w*h, zero=True)
cdef double[:] myview_0 = arr0
cdef double[:, :] myview_1 = <double[:w, :h]> &myview_0[0]

myview_1[3, 99] = 100.
assert myview_0[399] == arr0[399] == myview_1[3, 99] == 100.

Hai 

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Adri

unread,
Jan 29, 2023, 8:24:10 AM1/29/23
to cython-users
We are developing a project where we create a lot of typed memory views as numpy arrays (https://github.com/rht/mesa_perf) and the creation of memory views as np.ndarray is impacting a lot because it has lots of interactions with Python, so that with small sizes it brings away a lot of the speed-up, I found that there are some “tricks” to improve the initialization step with python array and the clone function, so that for numeric 1d arrays we could do that way, but we are stuck in generalizing it for numeric 2+d arrays (found on https://stackoverflow.com/questions/18462785/what-is-the-recommended-way-of-allocating-memory-for-a-typed-memory-view) ? If not, what would be a good way to manually create 2+d memory views (possibly even managing python objects) in a safe way? I tried
cdef double[:, :] myview_1 = <double[:w, :h]> &myview_0[0]
suggested but seems that it doesn't work in my case (some segfault happens)
------------------------

Indirizzo istituzionale di posta elettronica degli studenti e dei laureati dell'Università di Torino
Official University of Turin email address for students and graduates 

Prakhar Goel

unread,
Jan 29, 2023, 4:07:48 PM1/29/23
to cython...@googlegroups.com
Adri,

I took a quick (5min) look at the code in your GH link but it's hard
to understand what you're trying to do without more
context/documentation. My general advice would be to take a step back.
Creating small numpy arrays will never be fast. You can create the 2d
array by allocating the underlying store and wrapping it using
functions like PyArray_SimpleNewFromData (or better would be to create
a properly sized array and then fill it in-place). But it doesn't
really buy you that much. If you're going to extract the individual
1-d arrays, you'll end up allocating the individual 1-d arrays anyway
and likely get very little performance gain.

Ultimately, there's no substitute for thinking about your problem from
first principles and then figuring out how to fit that within the
constraints of the machine, language, etc... What fundamental
computation are you trying to do and why does it require creating all
these little arrays? What interface/information do the users of your
library actually need?
--
________________________
Warm Regards
Prakhar Goel
> Indirizzo istituzionale di posta elettronica degli studenti e dei laureati dell'Università di Torino
> Official University of Turin email address for students and graduates
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "cython-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cython-users/f56cd726-37f2-43a0-bfbb-78c218f0bd8en%40googlegroups.com.

D Woods

unread,
Jan 29, 2023, 4:49:24 PM1/29/23
to cython-users
One bit of advice based on a quick skim of your repository (not related to your core question):

A 1D memoryview of type `object` is basically the same structure in memory as a Python list. But probably less efficient because it's a little more complicated to set up and validate. Cython deals with arguments typed as `list` pretty well so if you're not dealing with fixed numeric values I'd just use lists instead. (lists can't be 2D, or sliced and strided, but you don't look to be using either of those)

Adri

unread,
Jan 30, 2023, 3:04:42 AM1/30/23
to cython-users
Oh thank you very much guys to have spent the time to look at the repo! Using lists seems to be the best way forward to me indeed. If you have time, I would really appreciate if you take a look at the Grid.pyx(+pxd) file with a new implemented grid_only_list class, considering the content of the message I posted on the other discussion:  https://groups.google.com/g/cython-users/c/TC9Ktl0Ryf4 , which I cite:

> thanks for all your help, really appreciated! the repo of the project is here: https://github.com/rht/mesa-perf, the library when completed should become part of the back-
> end of the pure python library https://github.com/projectmesa/mesa which is an agent based modeling library in pure python and we are constrained to keep the
> interface as it is, so many times the user wants lists in output or gives lists in input. We have two aims: 1. Improve the performance of the main library 2. Create some
> sort of interface for the Cython back-end so that the user could improve the performance even more if he wants to do that. The main aim is 1. but the conversions when > we have small inputs list are gone in the creation of the memory views (I posted a question on stack overflow on the matter:https://stackoverflow.com/questions
> so that maybe going fully with lists (I see big speed-up sometimes even with them) and drop 2. can be an option. If you have any suggestion, I will be really grateful!
Reply all
Reply to author
Forward
0 new messages