Trying to speed up numpy array loop, getting a ValueError

312 views
Skip to first unread message

tastygroove

unread,
Jun 3, 2011, 5:43:22 AM6/3/11
to cython-users
Hello,

I am new to Cython, so I apologize in advance if I am doing something
dumb.

I am writing some imaging software, right now in pure Python, that
includes several loops over (large) numpy arrays. These operations are
painfully slow in pure Python, so I thought that I would use Cython to
speed things up. Some of the functions are stupid simple, so this
should be a breeze. No luck there so far. Here is my problem.

I have a simple function that takes two real values from a 4D array
and stores them as real and imaginary parts of a complex valued 3D
array. I just have to loop over the input array and store the values
in a new complex valued array. Simple. I have re-written the
function in Cython, following the "Working with NumPy" tutorial.

First I simply typedefed all of the variables and that produced a ~30%
speed increase on the benchmarks I had done. Not good enough. So I
continued to further improve using "Efficient indexing" and this is
where I ran into trouble. I have attached my Cython code at the
bottom.

I am using Cython 1.3 that is in the Ubuntu 11.04 repositories. I
compile using the following commands:

> cython get_f_cube.pyx
> gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o get_f_cube.so get_f_cube.c

I get the following warnings during compilation:

/usr/include/python2.7/numpy/__multiarray_api.h:1187:1: warning:
‘_import_array’ defined but not used
/usr/include/python2.7/numpy/__ufunc_api.h:196:1: warning:
‘_import_umath’ defined but not used

When I import into python and try to use the function, I get a
ValueError:

In [1]: import get_f_cube as gfc
...
In [6]: test = gfc.get_f_cube(model)
---------------------------------------------------------------------------
ValueError Traceback (most recent call
last)

/home/mrbell/Work/code/mockobs/<ipython console> in <module>()

/home/mrbell/Work/code/mockobs/get_f_cube.so in get_f_cube.get_f_cube
(get_f_cube.c:796)()
9 @cython.wraparound(False)
10
---> 11 def get_f_cube(np.ndarray[fdtype_t, ndim=4] model):
12 """ The model has stokes I,Q,U, but I want complex P"""
13

ValueError: Big-endian buffer not supported on little-endian compiler


I am stumped. Does anyone have any ideas? Thanks! (See code below)


________________________________________

import numpy as np
cimport numpy as np
cimport cython

ctypedef np.complex64_t cdtype_t
ctypedef np.float64_t fdtype_t

@cython.boundscheck(False)
@cython.wraparound(False)

def get_f_cube(np.ndarray[fdtype_t, ndim=4] model):

if model is None:
raise ValueError("Input array can not be None")

cdef int nphi = model.shape[1]
cdef int ndec = model.shape[2]
cdef int nra = model.shape[3]
cdef np.ndarray[cdtype_t, ndim=3] fcube \
= np.zeros([nphi, ndec, nra], dtype=np.complex64)
cdef Py_ssize_t i,j,k

for i in range(nphi):
for j in range(ndec):
for k in range(nra):
fcube[i,j,k] = np.complex(model[1,i,j,k],
model[2,i,j,k])

return fcube
____________________________________

Robert Bradshaw

unread,
Jun 3, 2011, 6:39:07 AM6/3/11
to cython...@googlegroups.com
On Fri, Jun 3, 2011 at 2:43 AM, tastygroove <bel...@gmail.com> wrote:
> Hello,
>
> I am new to Cython, so I apologize in advance if I am doing something
> dumb.
>
> I am writing some imaging software, right now in pure Python, that
> includes several loops over (large) numpy arrays. These operations are
> painfully slow in pure Python, so I thought that I would use Cython to
> speed things up.  Some of the functions are stupid simple, so this
> should be a breeze.  No luck there so far.  Here is my problem.
>
> I have a simple function that takes two real values from a 4D array
> and stores them as real and imaginary parts of a complex valued 3D
> array.  I just have to loop over the input array and store the values
> in a new complex valued array.  Simple.  I have re-written the
> function in Cython, following the "Working with NumPy" tutorial.
>
> First I simply typedefed all of the variables and that produced a ~30%
> speed increase on the benchmarks I had done.  Not good enough.  So I
> continued to further improve using "Efficient indexing" and this is
> where I ran into trouble.  I have attached my Cython code at the
> bottom.

Try using "cython -a" to get an annotated html of your file, e.g.
http://sage.math.washington.edu/home/robertwb/cython/get_f_cube.html
You can see that your inner loop is very yellow. Something like
http://sage.math.washington.edu/home/robertwb/cython/get_f_cube2.html
will be much better. (As an aside, we really should optimize
complex(...)). Note that np.complex64_t is actually "float complex,"
i.e. a pair of 32-bit floats.

Where are you getting your data from? You need to re-encode your data
into your machine's native format first, I bet there's a numpy command
for that.

- Robert

tastygroove

unread,
Jun 3, 2011, 10:29:56 AM6/3/11
to cython-users
On Jun 3, 12:39 pm, Robert Bradshaw <rober...@math.washington.edu>
wrote:
> On Fri, Jun 3, 2011 at 2:43 AM, tastygroove <bel...@gmail.com> wrote:
> > Hello,
>
> > I am new to Cython, so I apologize in advance if I am doing something
> > dumb.
>
> > I am writing some imaging software, right now in pure Python, that
> > includes several loops over (large) numpy arrays. These operations are
> > painfully slow in pure Python, so I thought that I would use Cython to
> > speed things up.  Some of the functions are stupid simple, so this
> > should be a breeze.  No luck there so far.  Here is my problem.
>
> > I have a simple function that takes two real values from a 4D array
> > and stores them as real and imaginary parts of a complex valued 3D
> > array.  I just have to loop over the input array and store the values
> > in a new complex valued array.  Simple.  I have re-written the
> > function in Cython, following the "Working with NumPy" tutorial.
>
> > First I simply typedefed all of the variables and that produced a ~30%
> > speed increase on the benchmarks I had done.  Not good enough.  So I
> > continued to further improve using "Efficient indexing" and this is
> > where I ran into trouble.  I have attached my Cython code at the
> > bottom.
>
> Try using "cython -a" to get an annotated html of your file, e.g.http://sage.math.washington.edu/home/robertwb/cython/get_f_cube.html
> You can see that your inner loop is very yellow. Something likehttp://sage.math.washington.edu/home/robertwb/cython/get_f_cube2.html
> will be much better. (As an aside, we really should optimize
> complex(...)). Note that np.complex64_t is actually "float complex,"
> i.e. a pair of 32-bit floats.
>

Thanks for the tip! It appears that the yellow indicates that the
conversion to c required a lot of code, yes? I suppose that would
somehow mean that the yellow lines are inefficiently implemented? Is
there anything more subtle going on here? Sorry for the noob
questions, but I'm a noob.
Success! There is, in fact, a numpy command for that. If I first use
the numpy array method newbyteorder() before passing the data to the
function it works. So it appears that the data was in the wrong byte
order although I don't know why. I need to get to the bottom of why
the byte order is wrong though, because it will be a real pain to have
to check all the time whether my data is Big or Little endian.

I get my data from a FITS file using pyfits. The data had been
written into said FITS file using my own code that produces a NumPy
hypercube and sticks it into a FITS file, again using pyfits. I would
naively assume that the data should therefore be in my machine's
native format already. Is this possibly a problem with the way that I
compiled by .pyx file?

Regarding the optimization... I now get a ~100X improvement over pure
python (0.01 sec vs 1 sec to process a 100^3 array). Just what I was
hoping for.

Sturla Molden

unread,
Jun 3, 2011, 10:40:28 AM6/3/11
to cython...@googlegroups.com
Den 03.06.2011 16:29, skrev tastygroove:
>
> Regarding the optimization... I now get a ~100X improvement over pure
> python (0.01 sec vs 1 sec to process a 100^3 array). Just what I was
> hoping for.
>

You were hoping for saving 99 milliseconds?

Sturla

Sturla Molden

unread,
Jun 3, 2011, 10:43:34 AM6/3/11
to cython...@googlegroups.com
Den 03.06.2011 16:40, skrev Sturla Molden:
>
> You were hoping for saving 99 milliseconds?

990 of course, still a lot of milliseconds!

With a modern processor, that's a couple of billion CPU cycles. Astonishing.

Almost enough time to swallow a sip of tea.


Sturla

tastygroove

unread,
Jun 3, 2011, 10:57:53 AM6/3/11
to cython-users
No, I wasn't hoping to save 99 ms, but to speed up my code by a factor
of 100. The 100^3 array was a small, test data set. The arrays that
I will be working with in reality are quite a bit bigger than this.
This kind of improvement will save me serious time when processing a
1000^3 array for example.

Mike

Robert Bradshaw

unread,
Jun 3, 2011, 3:50:08 PM6/3/11
to cython...@googlegroups.com

The darker the yellow, the more calls to the Python-C API. It's a very
imprecise metric, but can be helpful for spotting where you may be
spending some time or insufficiently-typing stuff, especially if a
line of, e.g. all arithmetic in an inner loop is yellow. In this case
you were creating Python objects for the two doubles, making a Python
function call to create an np.complex object, then unpacking that
complex object into a C complex object. (You can click on the line in
question to see the corresponding C code.)

Of course, sometimes lots of yellow is just fine (e.g. in non-critical
parts of your code like the Python-C boundaries. It just means Cython
is doing a lot of work for you :).

No idea why it's this byte order, but I bet a call to newbyteorder is
free if it's already correct, so you might as well throw it in there
all the time.

> I get my data from a FITS file using pyfits.  The data had been
> written into said FITS file using my own code that produces a NumPy
> hypercube and sticks it into a FITS file, again using pyfits.  I would
> naively assume that the data should therefore be in my machine's
> native format already.  Is this possibly a problem with the way that I
> compiled by .pyx file?

No, this wouldn't impact it, it's probably something in pyfits (and
possibly configurable).

> Regarding the optimization... I now get a ~100X improvement over pure
> python (0.01 sec vs 1 sec to process a 100^3 array).  Just what I was
> hoping for.

Excellent.

- Robert

Jon Olav Vik

unread,
Jun 4, 2011, 3:22:28 AM6/4/11
to cython-users
On Jun 3, 11:43 am, tastygroove <bel...@gmail.com> wrote:
> I am new to Cython, so I apologize in advance if I am doing something
> dumb.

Conversely, my apologies if I completely misunderstood your problem
8-)

> I am writing some imaging software, right now in pure Python, that
> includes several loops over (large) numpy arrays. These operations are
> painfully slow in pure Python, so I thought that I would use Cython to
> speed things up.  Some of the functions are stupid simple, so this
> should be a breeze.  No luck there so far.  Here is my problem.
>
> I have a simple function that takes two real values from a 4D array
> and stores them as real and imaginary parts of a complex valued 3D
> array.  I just have to loop over the input array and store the values
> in a new complex valued array.  Simple.  I have re-written the
> function in Cython, following the "Working with NumPy" tutorial.

If you're really going to be doing simple elementwise operations,
maybe simple Numpy operations will suffice? The following seems
equivalent to your code:

def f(model):
fcube = np.zeros(model.shape[1:], dtype=np.complex64)
fcube.real = model[1]
fcube.imag = model[2]
return fcube

In [11]: arrtest.f(np.arange(72.).reshape(3,2,3,4))
Out[11]:
array([[[ 24.+48.j, 25.+49.j, 26.+50.j, 27.+51.j],
[ 28.+52.j, 29.+53.j, 30.+54.j, 31.+55.j],
[ 32.+56.j, 33.+57.j, 34.+58.j, 35.+59.j]],

[[ 36.+60.j, 37.+61.j, 38.+62.j, 39.+63.j],
[ 40.+64.j, 41.+65.j, 42.+66.j, 43.+67.j],
[ 44.+68.j, 45.+69.j, 46.+70.j, 47.+71.j]]],
dtype=complex64)

In [12]: arrtest.get_f_cube(np.arange(72.).reshape(3,2,3,4))
Out[12]:
array([[[ 24.+48.j, 25.+49.j, 26.+50.j, 27.+51.j],
[ 28.+52.j, 29.+53.j, 30.+54.j, 31.+55.j],
[ 32.+56.j, 33.+57.j, 34.+58.j, 35.+59.j]],

[[ 36.+60.j, 37.+61.j, 38.+62.j, 39.+63.j],
[ 40.+64.j, 41.+65.j, 42.+66.j, 43.+67.j],
[ 44.+68.j, 45.+69.j, 46.+70.j, 47.+71.j]]],
dtype=complex64)

In [21]: a = np.arange(72.0).reshape(3,2,3,4)

In [22]: timeit arrtest.get_f_cube(a)
100000 loops, best of 3: 11.2 us per loop

In [23]: timeit arrtest.f(a)
100000 loops, best of 3: 6.38 us per loop

In [18]: a = np.arange(7200000.0).reshape(3,2,3,400000)

In [19]: timeit arrtest.get_f_cube(a)
1 loops, best of 3: 791 ms per loop

In [20]: timeit arrtest.f(a)
10 loops, best of 3: 31.1 ms per loop


You can see that the relative performance of Numpy is better with
large arrays. This is compiled from your code without any improvements
suggested in other posts.

I may not fully understand what your code is supposed to do -- but you
do realize that Python indexing is zero-based? Your code seems to
ignore the entire model[0].

Hope this helps,
Jon Olav

Robert Bradshaw

unread,
Jun 4, 2011, 3:55:19 AM6/4/11
to cython...@googlegroups.com

I expected the modified version to be exactly the same speed as NumPy,
but surprisingly NumPy vectorized operations are significantly slower
than the naive Cython loops here:

In [4]: a = np.arange(7200000.0).reshape(3,2,3,400000)

In [5]: timeit f(a)
10 loops, best of 3: 26.8 ms per loop

In [6]: timeit get_f_cube(a) # v2
100 loops, best of 3: 18.6 ms per loop

- Robert

Yury V. Zaytsev

unread,
Jun 7, 2011, 3:44:44 AM6/7/11
to cython...@googlegroups.com
On Fri, 2011-06-03 at 16:43 +0200, Sturla Molden wrote:

> 990 of course, still a lot of milliseconds!
> With a modern processor, that's a couple of billion CPU cycles. Astonishing.
> Almost enough time to swallow a sip of tea.

Your irony is hardly appropriate.

I used Cython to gain a 60x speed improvement which on small datasets
could very well translate into 1 second or even less, but if you
multiply this by 100 000 optimization steps per each of 1000
concurrently running processes, this has already saved me at least 3
years of number crunching.

--
Sincerely yours,
Yury V. Zaytsev


Reply all
Reply to author
Forward
0 new messages