[newb] poor numpy performance

Neal Becker

unread,

Jul 17, 2012, 11:15:10 AM7/17/12

to cython...@googlegroups.com

I tried benchmarking a cython version of a simple function, and the cython
version is 3x slower than the python.

It's a simple function to do spreading of a sequence.

----- python version
def pyspread (x, z, s):
return np.repeat (x, s) * z

---- cython version
import numpy as np
cimport numpy as np
cimport cython # boundscheck decorator

@cython.boundscheck(False)
@cython.wraparound(False)
def spread_cmplx (np.complex[:] x, np.complex[:] z, int s):
cdef np.ndarray w = np.zeros([z.shape[0]], dtype=np.complex)
cdef int i
cdef int L = z.shape[0]
for i in range (L):
w[i] = x[i/s] * z[i]
return w

Bradley Froehle

unread,

Jul 17, 2012, 11:55:32 AM7/17/12

to cython...@googlegroups.com

In this case you might be best off with the "python" version of the code. It is simple, concise, almost all of the processing will be done at the C level by NumPy.

As to your Cython code, you can use the `-a` flag to produce an annotated listing of the source (e.g., `cython -a spread.pyx`) which can help you determine what is slow. In this case, it looks like we are having to call a function for each `w[i] = ...` line.

Try providing more information in the `cdef ndarray w` line. For example, `cdef ndarray[np.complex, ndim=1, mode="c"] = ...`, that should speed things up.

-Brad

Bradley Froehle

unread,

Jul 17, 2012, 11:59:46 AM7/17/12

to cython...@googlegroups.com

In addition, you'll probably want to use `@cython.cdivision(True)` (and check that `s > 0`).

Neal Becker

unread,

Jul 17, 2012, 12:04:23 PM7/17/12

to cython...@googlegroups.com

Thanks for the suggestion, I now have:

----------------

import numpy as np
cimport numpy as np
cimport cython # boundscheck decorator

@cython.boundscheck(False)
@cython.wraparound(False)
def spread_cmplx (np.complex[:] x, np.complex[:] z, int s):

cdef np.ndarray[np.complex, ndim=1, mode="c"] w = np.zeros([z.shape[0]],

dtype=np.complex)
cdef int i
cdef int L = z.shape[0]
for i in range (L):
w[i] = x[i/s] * z[i]
return w

-------------
But this gives runtime error:

File "spread.pyx", line 10, in spread.spread_cmplx (spread.cpp:1786)
cdef np.ndarray[np.complex, ndim=1, mode="c"] w = np.zeros([z.shape[0]],
dtype=np.complex)
ValueError: Buffer dtype mismatch, expected 'complex object' but got 'complex
double'

Robert Bradshaw

unread,

Jul 17, 2012, 1:04:08 PM7/17/12

to cython...@googlegroups.com

On Tue, Jul 17, 2012 at 9:04 AM, Neal Becker <ndbe...@gmail.com> wrote:
> Thanks for the suggestion, I now have:
>
> ----------------
> import numpy as np
> cimport numpy as np
> cimport cython # boundscheck decorator
>
> @cython.boundscheck(False)
> @cython.wraparound(False)
> def spread_cmplx (np.complex[:] x, np.complex[:] z, int s):
> cdef np.ndarray[np.complex, ndim=1, mode="c"] w = np.zeros([z.shape[0]],
> dtype=np.complex)
> cdef int i
> cdef int L = z.shape[0]
> for i in range (L):
> w[i] = x[i/s] * z[i]
> return w
> -------------
> But this gives runtime error:
>
> File "spread.pyx", line 10, in spread.spread_cmplx (spread.cpp:1786)
> cdef np.ndarray[np.complex, ndim=1, mode="c"] w = np.zeros([z.shape[0]],
> dtype=np.complex)
> ValueError: Buffer dtype mismatch, expected 'complex object' but got 'complex
> double'

Try np.ndarray[double complex, ndim=1, mode="c"]. Also, no need to
initialize it to zeros.

Interestingly, I was unable to get your 3x slowdown, though it was
slightly slower (which is worrisome for the efficiency of np.repeat(),
'cause I'd echo Bradley's sentiment that the Python one should drop
down to fast enough pure C). You can also be faster by avoiding the
division (division is quite expensive, even if you turn on cdivision.

- Robert

import numpy as np
s = 5; n = 100
x = np.arange(n, dtype=np.complex)
z = np.arange(s * n, dtype=np.complex)

timeit("pyspread(x, z, s)")
timeit("spread_cmplx(x, z, s)")
timeit("spread_cmplx_typed(x, z, s)")
timeit("spread_cmplx_no_div(x, z, s)")

625 loops, best of 3: 41.1 µs per loop
625 loops, best of 3: 41.4 µs per loop
625 loops, best of 3: 8.72 µs per loop
625 loops, best of 3: 6.19 µs per loop

---------

def pyspread (x, z, s):
return np.repeat (x, s) * z

import numpy as np
cimport numpy as np
cimport cython # boundscheck decorator

@cython.boundscheck(False)
@cython.wraparound(False)

def spread_cmplx (np.ndarray[double complex] x, np.ndarray[double
complex] z, int s):

cdef np.ndarray w = np.zeros([z.shape[0]], dtype=np.complex)

cdef int ix

cdef int L = z.shape[0]

for ix in range (L):
w[ix] = x[ix/s] * z[ix]
return w

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def spread_cmplx_typed (np.ndarray[double complex] x,
np.ndarray[double complex] z, int s):
cdef np.ndarray[double complex, mode="c"] w =
np.ndarray([z.shape[0]], dtype=np.complex)
cdef int ix

cdef int L = z.shape[0]

for ix in range (L):
w[ix] = x[ix/s] * z[ix]
return w

@cython.boundscheck(False)
@cython.wraparound(False)
def spread_cmplx_no_div (np.ndarray[double complex] x,
np.ndarray[double complex] z, int s):
cdef np.ndarray[double complex, mode="c"] w =
np.ndarray([z.shape[0]], dtype=np.complex)
cdef int i, j, ix

cdef int L = z.shape[0]

for i in range (x.shape[0]):
for j in range(s):
ix = i*s + j
w[ix] = x[i] * z[ix]
return w

Neal Becker

unread,

Jul 17, 2012, 1:26:54 PM7/17/12

to cython...@googlegroups.com

Thanks. Now I'm getting:

python test_spread.py
4.79088401794 << numpy
4.05955004692 << cython

But,

python test_spread.py
4.7059340477 << numpy
2.57622599602 << boost::python + boost::ndarray

One thing I wonder about,

I've always been told that indexed addressing is to be avoided. The
boost::python code uses iterators instead of indexing. Perhaps that accounts
for the difference?

OTOH, in my own recent (but very limited) benchmarking, I haven't seen
meaningful speed differences between index and iterator addressing.

Of course, numpy, cython, and engineering education in general teaches indexing.
And code using indexing is often easier to understand.

Any thoughts?

Dag Sverre Seljebotn

unread,

Jul 17, 2012, 1:59:12 PM7/17/12

to cython...@googlegroups.com

My bet is currently on different C++ compiler flags.

Did you post your C++ code anywhere?

Indexing is not supposed to be slower than iterators in most situations;
any advice you've heard is almost certainly about how C++ purists like
their C++ code to be, i.e. just based on aesthetics and not some
fundamental difference.

(OK, I guess theoretically you save one CPU register which can enter in
some situations, but that's certainly not the cause of such a
differencee...)

Dag

Robert Bradshaw

unread,

Jul 17, 2012, 2:14:14 PM7/17/12

to cython...@googlegroups.com

On Tue, Jul 17, 2012 at 10:26 AM, Neal Becker <ndbe...@gmail.com> wrote:
> Thanks. Now I'm getting:
>
> python test_spread.py
> 4.79088401794 << numpy
> 4.05955004692 << cython

Post code? If by "numpy" you mean the pure Python version, I was
seeing a 7x speedup with Cython. Also, how big are your arrays? Numpy
has a fairly high per-array overhead which may account for the
difference.

Neal Becker

unread,

Jul 17, 2012, 2:26:48 PM7/17/12

to cython...@googlegroups.com

This is test code for comparing numpy version vs. c code, where c code is either
the cython version, or the one using boost::python (depending which .so I build)

--- test code (approximately)

import numpy as np
import spread

def pyspread (x, z, s):
return np.repeat (x, s) * z

def pydespread (x, z, s):
return (x * z).reshape (len(x)/s,s).sum (axis=1)/s

from timeit import timeit

def test (fnc):
x = np.zeros (100000/4, dtype=complex)
s = 4
z = np.zeros (100000, dtype=complex)

w = fnc (x, z, s)

def test2 (fnc):
x = np.zeros (100000, dtype=complex)
s = 4
z = np.zeros (100000, dtype=complex)

w = fnc (x, z, s)

print timeit ('test(pyspread)', 'from __main__ import test,pyspread; import
numpy as np;', number=1000)
print timeit ('test(spread_cmplx)', 'from __main__ import test; from spread
import spread_cmplx;import numpy as np;', number=1000)

-----------------------

Python code:
shown above

-------------------------
Cython code:

import numpy as np
cimport numpy as np
cimport cython # boundscheck decorator

@cython.boundscheck(False)
@cython.wraparound(False)

def spread_cmplx (double complex[:] x, double complex[:] z, int s):
cdef np.ndarray[double complex, ndim=1, mode="c"] w=

np.ndarray([z.shape[0]], dtype=np.complex)
cdef int ix

cdef int i
cdef int L = z.shape[0]
for i in range (L):
w[i] = x[i/s] * z[i]
return w

---------------------
boost::python c++ code

I'm only showing a piece, because you can't build it without various other
libraries, but hopefully gives the idea:

template<typename out_t, typename in_t, typename sin_t, typename spread_t>
inline out_t spread (in_t const& in, sin_t const& sin, spread_t const& spread,
int spreading) {
if (boost::size (in) * spreading != boost::size (sin))
throw std::runtime_error ("spread size mismatch");

out_t out (boost::size (in) * spreading);

typename boost::range_const_iterator<in_t>::type i = boost::begin (in);
typename boost::range_const_iterator<in_t>::type e = boost::end (in);
typename boost::range_const_iterator<sin_t>::type s = boost::begin (sin);
typename boost::range_iterator<out_t>::type o = boost::begin (out);

for (; i != e; ++i) {
for (int c = 0; c < spreading; ++c, ++o, ++s) {
*o = *i * spread (*s);
}
}

return out;
------------------------------------

Dag Sverre Seljebotn

unread,

Jul 17, 2012, 2:33:05 PM7/17/12

to cython...@googlegroups.com

On 07/17/2012 08:26 PM, Neal Becker wrote:
> This is test code for comparing numpy version vs. c code, where c code is either
> the cython version, or the one using boost::python (depending which .so I build)

Did you read Robert Bradshaw's post at all? It doesn't seem like it from
the test code you posted.

a) using @cdivision(True) (or use unsigned integers); this is likely
the cause of the speed difference between C++ and Cython, because Cython
uses Python-style division of negative numbers which is more costly

b) Robert's spread_cmplx_no_div code which avoids integer division and
should be much faster than the C++ code posted below (which doesn't
include the spread() function, so I have to guess what it is doing).

Dag

Bradley Froehle

unread,

Jul 17, 2012, 2:38:29 PM7/17/12

to cython...@googlegroups.com

Your speed up from the boost code is due to the reorganization of the loops to avoid any divisions. Try benchmarking:

@cython.boundscheck(False)

@cython.wraparound(False)

def cyspread (np.complex[::1] x, np.complex[::1] z, int s):

cdef np.ndarray[double complex, ndim=1, mode="c"] w = \

np.zeros(len(z), dtype=np.complex)

cdef int i, j

for j in range(len(z)/s):

for i in range(j*s, (j+1)*s):

w[i] = x[j] * z[i]

return w

Robert Bradshaw

unread,

Jul 17, 2012, 2:43:59 PM7/17/12

to cython...@googlegroups.com

You're timing the creation of x and z, which (given that the code
you're trying to test is O(n) as well) will obscure the difference in
timings the code you actually care about. And you're still doing
division, so it's not even comparing equivalent code.

What is test2? Is this needed for your boost "version" (that isn't
really computing the same thing?)

Are you sure you're even calculating the same thing here? What is spread?

Neal Becker

unread,

Jul 17, 2012, 3:18:37 PM7/17/12

to cython...@googlegroups.com

I confirm that the integer division makes a big diff.

I moved the array creation functions out of the test loop as well.

31.5656440258 << numpy
20.485667944 << cython with division + cdivision(False)
15.0117650032 << cython with division + cdivision(True)
8.26358604431 << cython no division
6.03341507912 << c++

Well, at least it's a lot closer.

mark florisson

unread,

Jul 17, 2012, 3:23:56 PM7/17/12

to cython...@googlegroups.com

It's still unpacking a bunch of buffers for every test call, which has
nontrivial overhead (depending on your data size, this may or may not
make a difference).

Neal Becker

unread,

Jul 17, 2012, 3:32:53 PM7/17/12

to cython...@googlegroups.com

The test code looks like this:
spread2 is the cython module
spread is the c++ module

import numpy as np

def pyspread (x, z, s):
return np.repeat (x, s) * z

from timeit import timeit

x = np.zeros (100000/4, dtype=complex)

z = np.zeros (100000, dtype=complex)

def test (fnc):
s = 4

w = fnc (x, z, s)

print timeit ('test(pyspread)', 'from __main__ import test,pyspread,x,z; import
numpy as np;', number=10000)
print timeit ('test(spread_cmplx)', 'from __main__ import test,x,z; from spread2
import spread_cmplx;import numpy as np;', number=10000)
print timeit ('test(spread_cmplx_no_div)', 'from __main__ import test,x,z; from
spread2 import spread_cmplx_no_div;import numpy as np;', number=10000)
print timeit ('test(spread_cmplx)', 'from __main__ import test,x,z; from spread
import spread_cmplx;import numpy as np;', number=10000)

Robert Bradshaw

unread,

Jul 17, 2012, 5:05:37 PM7/17/12

to cython...@googlegroups.com

True. Do you have a pointer to your final "cython no division" code?

- Robert

Neal Becker

unread,

Jul 18, 2012, 6:54:06 AM7/18/12

to cython...@googlegroups.com

cython no division is your version:

-------------
def spread_cmplx_no_div (double complex[:] x, double complex[:] z, int s):
cdef np.ndarray[double complex, mode="c"] w = np.ndarray([z.shape[0]],
dtype=np.complex)
cdef int i, j, ix

cdef int L = z.shape[0]

for i in range (x.shape[0]):
for j in range(s):
ix = i*s + j
w[ix] = x[i] * z[ix]
return w

-----------------

I changed the arguments from np.ndarray[double complex] to the newer double
complex[:] just to verify there was no difference in speed between these two,
and there wasn't.

I'm surprised about the division. As a long-time EE, and having worked closely
with hardware engineers, I know they try to avoid division - but I had assumed
that on modern processors they had thrown sufficient resources at it so that
integer division would be single cycle. I guess I was wrong.

Dag Sverre Seljebotn

unread,

Jul 18, 2012, 7:13:35 AM7/18/12

to cython...@googlegroups.com

On 07/18/2012 12:54 PM, Neal Becker wrote:
> I'm surprised about the division. As a long-time EE, and having worked closely
> with hardware engineers, I know they try to avoid division - but I had assumed
> that on modern processors they had thrown sufficient resources at it so that
> integer division would be single cycle. I guess I was wrong.

Definitely this hasn't changed. Division is rarely used, and you can
always find other uses for the transistors needed. No matter how many
transistors you pack, we'd all rather have a few kb more cache, or more
floating point units, than fast division.

Under item 4) here you can find throughput and latency for DIV on
various platforms:

http://www.agner.org/optimize/

It's very slow.

Dag

Robert Bradshaw

unread,

Jul 18, 2012, 12:57:18 PM7/18/12

to cython...@googlegroups.com

I was just using this because I was using a Sage notebook that has an
older release of Cython.

I was able to get a modest speedup by manually pulling things out of the loop

@cython.boundscheck(False)
@cython.wraparound(False)
def spread_cmplx_no_div2 (np.ndarray[double complex, mode="c"] x,
np.ndarray[double complex, mode="c"] z, int s):

cdef np.ndarray[double complex, mode="c"] w =
np.ndarray([z.shape[0]], dtype=np.complex)
cdef int i, j, ix
cdef int L = z.shape[0]

cdef double complex xi

for i in range (x.shape[0]):

xi = x[i]
for ix in range(i*s, i*s + s):
w[ix] = xi * z[ix]
return w

which may bring it closer to the C++ version (in form and performance)
I am surprised the C compiler doesn't do this itself.

> I'm surprised about the division. As a long-time EE, and having worked closely
> with hardware engineers, I know they try to avoid division - but I had assumed
> that on modern processors they had thrown sufficient resources at it so that
> integer division would be single cycle. I guess I was wrong.

Yep, it's an expensive operation.

- Robert

Reply all

Reply to author

Forward