cython slower than numpy and numba on trivial array loops?

538 views
Skip to first unread message

Antonio Suriano

unread,
Nov 30, 2015, 3:25:01 PM11/30/15
to cython-users
Hi

I am trying some trivial loops like copying, summing, dot product

like

cimport cython
cimport numpy as np

@cython.boudscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def copia(np,ndarray[np.float64_t] a, np.ndarray[np.float64_t] b:
cdef int i, N=a.shape[0]
for i in prange(N,nogil=True):
     b[i]=a[i]

I have to use threads to match np.copyto and numba naive jitting of same loop.

On single thread for every size this is half speed of numpy and numba.


What is wrong in my code?
   

Sturla Molden

unread,
Dec 1, 2015, 3:27:55 AM12/1/15
to cython...@googlegroups.com
On 30/11/15 20:18, Antonio Suriano wrote:

> What is wrong in my code?

You are probably trying to beat the speed of a memcpy, which is
implemented in the operating system. That is silly business. Do you have
any idea how many work hours have been put into optimizing that
particular piece of code?

Also, you did not specify the number of dimensions and whether the array
is contiguous.

Another thing is you should use typed memoryviews instead of the old
numpy syntax (though it will not make your code faster).

As for dot product, you are competing against the BLAS subroutines
*GEMM, *GEMV, and *DOT. Chances are you are competing against
performance code written by engineers at Intel, Apple or AMD, depending
on which BLAS implementation you use (that is, MKL, Accelerate
Framework, or ACML, repectively). And if you are not, you are likely
trying to beat the performance of the most optimized kernels in ATLAS or
OpenBLAS. You will not succeed, so spend your time om something less
futile instead.


Sturla

Antonio Suriano

unread,
Dec 1, 2015, 4:10:32 AM12/1/15
to cython-users, sturla...@gmail.com
1 dimension contiguous arrays of different sizes.
I wrote the function with np,ndarray[np.float64_t] because double[:] was a little bit slower.

The point is this: 

why does LLVMLite JIT (numba) succeed in matching np.copyto and memcpy and not cython.

Anyhow if you write a memcpy wrapper in cython or use the a[:]=b (that probably makes the same) you don't reach numba and np.copy.

So is this intrinsic in cython functions ? (call overhead).

Robert Bradshaw

unread,
Dec 1, 2015, 4:43:34 AM12/1/15
to cython...@googlegroups.com
On Tue, Dec 1, 2015 at 1:10 AM, Antonio Suriano
<antonio...@gmail.com> wrote:
> 1 dimension contiguous arrays of different sizes.
> I wrote the function with np,ndarray[np.float64_t] because double[:] was a
> little bit slower.
>
> The point is this:
>
> why does LLVMLite JIT (numba) succeed in matching np.copyto and memcpy and
> not cython.
>
> Anyhow if you write a memcpy wrapper in cython or use the a[:]=b (that
> probably makes the same) you don't reach numba and np.copy.

Cython's buffer/memoryview optimization only applies to direct
indexing, and as mentioned a naive C loop (which this compiles down
to) is not going to beat memcpy or an optimized BLAS, so you're asking
"Why isn't Cython faster than C?"

> So is this intrinsic in cython functions ? (call overhead).

There is non-trivial "call" overhead in acquiring the buffer. How big
are your arrays? Do the timings scale proportionally with the size of
the array? As you notices a significant difference between the old
buffer syntax and memory views, I'd guess your arrays are small enough
that this dominates.

Also, Cython doesn't know your arrays are contiguous; declaring them
as np,ndarray [np.float64_t, mode='c'] could help.

> Il giorno martedì 1 dicembre 2015 09:27:55 UTC+1, Sturla Molden ha scritto:
>>
>> On 30/11/15 20:18, Antonio Suriano wrote:
>>
>> > What is wrong in my code?
>>
>> You are probably trying to beat the speed of a memcpy, which is
>> implemented in the operating system. That is silly business. Do you have
>> any idea how many work hours have been put into optimizing that
>> particular piece of code?
>>
>> Also, you did not specify the number of dimensions and whether the array
>> is contiguous.
>>
>> Another thing is you should use typed memoryviews instead of the old
>> numpy syntax (though it will not make your code faster).
>>
>> As for dot product, you are competing against the BLAS subroutines
>> *GEMM, *GEMV, and *DOT. Chances are you are competing against
>> performance code written by engineers at Intel, Apple or AMD, depending
>> on which BLAS implementation you use (that is, MKL, Accelerate
>> Framework, or ACML, repectively). And if you are not, you are likely
>> trying to beat the performance of the most optimized kernels in ATLAS or
>> OpenBLAS. You will not succeed, so spend your time om something less
>> futile instead.
>>
>>
>> Sturla
>>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "cython-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cython-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Antonio Suriano

unread,
Dec 1, 2015, 4:50:17 AM12/1/15
to cython-users


Il giorno martedì 1 dicembre 2015 10:43:34 UTC+1, Robert Bradshaw ha scritto:


Cython's buffer/memoryview optimization only applies to direct
indexing, and as mentioned a naive C loop (which this compiles down
to) is not going to beat memcpy or an optimized BLAS, so you're asking
"Why isn't Cython faster than C?"

Not faster but as fast as when possible ....
 

> So is this intrinsic in cython functions ? (call overhead).

There is non-trivial "call" overhead in acquiring the buffer. How big
are your arrays? Do the timings scale proportionally with the size of
the array? As you notices a significant difference between the old
buffer syntax and memory views, I'd guess your arrays are small enough
that this dominates.

Also, Cython doesn't know your arrays are contiguous; declaring them
as np,ndarray [np.float64_t, mode='c'] could help.


Well I am a bit confused here: aren't 1d arrays always contiguous?
 

I tried arrays from 1 element to 1000000 elements. The difference scales linearly (a 2x factor). For large arrays using 4/8 thread reduces the difference.

The simple code
@numba.jit
def copia (a,b):
for i in range(len(a)):
    a[i]=b[i]

is as fast as np.copyto.

Could it be some simd/sse3 vectorization stuff?

Sturla Molden

unread,
Dec 1, 2015, 5:18:55 AM12/1/15
to cython...@googlegroups.com
On 01/12/15 10:10, Antonio Suriano wrote:

> why does LLVMLite JIT (numba) succeed in matching np.copyto and memcpy
> and not cython.

Numba is a JIT compiler and can see things at run-time that Cython
cannot. Cython is a static compiler.

np.copyto makes a call to PyArray_AssignArray, which in turn calls
PyArray_GetDTypeTransferFunction, which finally returns a function that
calls memmove if the two arrays are both contiguous.

You can test on your Cython for comparison:


from libc.string cimport memmove

def copyia(double[::1] a, double[::1] b):
cdef Py_ssize_t Na = a.shape[0]
cdef Py_ssize_t Nb = b.shape[0]
if (Na != Nb):
raise ValueError("length of A and B must match")
memmove(<void*> &b[0],<void*> &a[0], N*sizeof(double))


Sturla

Sturla Molden

unread,
Dec 1, 2015, 5:25:39 AM12/1/15
to cython...@googlegroups.com
On 01/12/15 10:50, Antonio Suriano wrote:

> The simple code
> @numba.jit
> def copia (a,b):
> for i in range(len(a)):
> a[i]=b[i]
>
> is as fast as np.copyto.
>
> Could it be some simd/sse3 vectorization stuff?

No.

And I hope you realize the code is not even equivalent, as your version
will cause havoc when the memory overlaps.

Sturla

Antonio Suriano

unread,
Dec 1, 2015, 6:32:09 AM12/1/15
to cython-users, sturla...@gmail.com
Very interesting version.

For large arrays your function matches np.copyto but for small 1d arrays (about 100 elements) it is 5 times slower than naive array loop (that is 1.8x slower than np.copyto and numba jitted).

Sturla Molden

unread,
Dec 1, 2015, 7:56:06 AM12/1/15
to cython...@googlegroups.com
On 01/12/15 12:32, Antonio Suriano wrote:

> Very interesting version.
>
> For large arrays your function matches np.copyto but for small 1d arrays
> (about 100 elements) it is 5 times slower than naive array loop (that is
> 1.8x slower than np.copyto and numba jitted).

That is because it uses the buffer protocol (typed memoryviews) instead
of the NumPy C API. The buffer protocol incures a constant, but
substantial overhead on buffer acquisition, which in this case happens
when the function is called because its arguments are typed memoryview.
If I had used the NumPy C API the performance would have been equal, but
the amount of boilerplate code would have obfuscated the important part.

You cannot move memory faster than memcpy (unique buffers) or memmove
(possibly overlapping buffers) so you can just as well stop trying. All
the differences you see are differences in the boilerplate overhead.

Sturla


Antonio Suriano

unread,
Dec 1, 2015, 8:23:31 AM12/1/15
to cython-users, sturla...@gmail.com
I am satisfied.

Of course I was not interested in writing my own copy function.
I was just wondering if I was using badly the tools or the compiler flags or whatever with this toy examples.

It is ok for me to call np.copyto or np.dot in my real code.
Using cython I managed to shrink some running time from 8 seconds to 45 ms and it is really impressive.

Moreover I did some testing with compilers. Unfortunately for a set of reasons I have to use 32 bit open source compilers on windows.

My choice for production has been mingw gcc 4.7.1 (I overwrote the anaconda version because it misses openmp).

I had a very bad experience with 5.1.0 TDM-GCC because it has some issues with objects embedded in classes (random segmentation faults in complex objects).
At home I tried also 5.2.0 mingw64 (32 bit generation) with simple classes (and negligible performance gain) with no seg faults problems.

Antonio Suriano

unread,
Dec 1, 2015, 8:37:20 AM12/1/15
to cython-users

You cannot move memory faster than memcpy (unique buffers) or memmove
(possibly overlapping buffers) so you can just as well stop trying. All
the differences you see are differences in the boilerplate overhead.

Sturla



By the way, in my jobs I always used high level languages like MPI Fortran 95, abstract c++ (design pattern and awful template metaprogramming stuff), Matlab, VBA and I never  got my hands dirty on the metal.

Just for intellectual curiosity what kind of book would you reccomend to learn C programming? 
When I was a student the reference was the tome written in the seventies with the famous big blue C on white cover: K&R. Is it still valid?


Sturla Molden

unread,
Dec 1, 2015, 8:43:43 AM12/1/15
to cython...@googlegroups.com
On 01/12/15 14:23, Antonio Suriano wrote:

> Moreover I did some testing with compilers. Unfortunately for a set of
> reasons I have to use 32 bit open source compilers on windows.
>
> My choice for production has been mingw gcc 4.7.1 (I overwrote the
> anaconda version because it misses openmp).
>
> I had a very bad experience with 5.1.0 TDM-GCC because it has some
> issues with objects embedded in classes (random segmentation faults in
> complex objects).
> At home I tried also 5.2.0 mingw64 (32 bit generation) with simple
> classes (and negligible performance gain) with no seg faults problems.

You can expect random segfaults with any of these 32-bit compilers
because they allocate the stack at four bytes instead of two bytes like
MSVC. This is the default effective from GCC 4.6. GCC 4.4 was compatible
with MSVC.

Combine this with 32-bit Python compiled with MSVC and it can blow up in
your face at any moment. We therefore need to use a special build of
MinGW which is modified to align the stack like MSVC.

On 64-bit systems both MinGW and MSVC align the stack at four bytes.

For 32-bit Python you either need to use this:

https://github.com/numpy/numpy/wiki/Mingw-static-toolchain

or a MinGW based on GCC 4.4.



Sturla

Antonio Suriano

unread,
Dec 1, 2015, 9:01:40 AM12/1/15
to cython-users

You can expect random segfaults with any of these 32-bit compilers
because they allocate the stack at four bytes instead of two bytes like
MSVC. This is the default effective from GCC 4.6. GCC 4.4 was compatible
with MSVC.

Combine this with 32-bit Python compiled with MSVC and it can blow up in
your face at any moment. We therefore need to use a special build of
MinGW which is modified to align the stack like MSVC.

On 64-bit systems both MinGW and MSVC align the stack at four bytes.

For 32-bit Python you either need to use this:

https://github.com/numpy/numpy/wiki/Mingw-static-toolchain

or a MinGW based on GCC 4.4.


Very good to know this. At present I never had any issue with 4.7 (it is shipped with Anaconda by the way) and python 3.5. The only things I had to do were to patch cygwin.py distutil configuration file and to generate my own libpython35.a and libmsvcp140.a with dlltools.

I am not going to change it because it works now but I'll remember your suggestion in case something happens.

Sturla Molden

unread,
Dec 1, 2015, 9:05:44 AM12/1/15
to cython...@googlegroups.com
On 01/12/15 14:37, Antonio Suriano wrote:

> Just for intellectual curiosity what kind of book would you reccomend to
> learn C programming?
> When I was a student the reference was the tome written in the seventies
> with the famous big blue C on white cover: K&R. Is it still valid?


Yes, K&R is good for ANSI C, but not for ISO C. C has not changed as
much as C++ though. We are still using ANSI C for NumPy.

There are also "C A Reference Manual" by Samuel P. Harbison and Guy L.
Steele (Prentice Hall) and "C in a Nutshell" by O'Reilly. Both of them
are quite good and also cover ISO C.


Sturla


Sturla Molden

unread,
Dec 2, 2015, 1:45:39 AM12/2/15
to cython...@googlegroups.com
On 01/12/15 14:37, Antonio Suriano wrote:

> By the way, in my jobs I always used high level languages like MPI
> Fortran 95, abstract c++ (design pattern and awful template
> metaprogramming stuff), Matlab, VBA and I never got my hands dirty on
> the metal.

CPUs have evolved to run Java, .NET and C++ efficiently. So today we
typically see that the ranking of performance is

C++ > C > Fortran

This is the opposite of what was the case 20 years ago. If performance
is what you are after, C++ seems the safest bet.

C++ can also be more productive than C, particularly if you use the STL.
The problem with C++ is that substandard programmers can case huge
problems if you have them in your team. They can do less harm with C, or
at least it will be easier to weed them out. C also much lower rate of
gotchas and WTFs. Unless you use setjmp/longjmp, the order of program
flow will be what you see at your screen.

This is a good summary:

http://250bpm.com/blog:4
http://250bpm.com/blog:8


Sturla




Antonio Suriano

unread,
Dec 2, 2015, 3:03:57 AM12/2/15
to cython-users, sturla...@gmail.com


Il giorno mercoledì 2 dicembre 2015 07:45:39 UTC+1, Sturla Molden ha scritto:
On 01/12/15 14:37, Antonio Suriano wrote:

> By the way, in my jobs I always used high level languages like MPI
> Fortran 95, abstract c++ (design pattern and awful template
> metaprogramming stuff), Matlab, VBA and I never  got my hands dirty on
> the metal.

CPUs have evolved to run Java, .NET and C++ efficiently. So today we
typically see that the ranking of performance is

     C++ > C > Fortran


Well in the far future I bet that Rust, Scala and Julia (or equivalent evolutions) will replace present day languages.

I am not a fan of OOP. It creates much more problems and complexity that necessary. 

Antonio Suriano

unread,
Dec 2, 2015, 1:13:19 PM12/2/15
to cython-users, sturla...@gmail.com

You can expect random segfaults with any of these 32-bit compilers
because they allocate the stack at four bytes instead of two bytes like
MSVC. This is the default effective from GCC 4.6. GCC 4.4 was compatible
with MSVC.


The trick with 5.1 and mingw64 5.2 was to add in distutils  the flag -mincoming-stack-boundary=2 and everything is fine even on complex objects.

Sturla Molden

unread,
Dec 2, 2015, 2:18:15 PM12/2/15
to cython...@googlegroups.com
On 02/12/15 19:13, Antonio Suriano wrote:

> The trick with 5.1 and mingw64 5.2 was to add in distutils the
> flag -mincoming-stack-boundary=2 and everything is fine even on complex
> objects.

Only use this when compiling for 32 bit Python.



Reply all
Reply to author
Forward
0 new messages