Try using "cython -a" to get an annotated html of your file, e.g.
http://sage.math.washington.edu/home/robertwb/cython/get_f_cube.html
You can see that your inner loop is very yellow. Something like
http://sage.math.washington.edu/home/robertwb/cython/get_f_cube2.html
will be much better. (As an aside, we really should optimize
complex(...)). Note that np.complex64_t is actually "float complex,"
i.e. a pair of 32-bit floats.
Where are you getting your data from? You need to re-encode your data
into your machine's native format first, I bet there's a numpy command
for that.
- Robert
You were hoping for saving 99 milliseconds?
Sturla
990 of course, still a lot of milliseconds!
With a modern processor, that's a couple of billion CPU cycles. Astonishing.
Almost enough time to swallow a sip of tea.
Sturla
The darker the yellow, the more calls to the Python-C API. It's a very
imprecise metric, but can be helpful for spotting where you may be
spending some time or insufficiently-typing stuff, especially if a
line of, e.g. all arithmetic in an inner loop is yellow. In this case
you were creating Python objects for the two doubles, making a Python
function call to create an np.complex object, then unpacking that
complex object into a C complex object. (You can click on the line in
question to see the corresponding C code.)
Of course, sometimes lots of yellow is just fine (e.g. in non-critical
parts of your code like the Python-C boundaries. It just means Cython
is doing a lot of work for you :).
No idea why it's this byte order, but I bet a call to newbyteorder is
free if it's already correct, so you might as well throw it in there
all the time.
> I get my data from a FITS file using pyfits. The data had been
> written into said FITS file using my own code that produces a NumPy
> hypercube and sticks it into a FITS file, again using pyfits. I would
> naively assume that the data should therefore be in my machine's
> native format already. Is this possibly a problem with the way that I
> compiled by .pyx file?
No, this wouldn't impact it, it's probably something in pyfits (and
possibly configurable).
> Regarding the optimization... I now get a ~100X improvement over pure
> python (0.01 sec vs 1 sec to process a 100^3 array). Just what I was
> hoping for.
Excellent.
- Robert
I expected the modified version to be exactly the same speed as NumPy,
but surprisingly NumPy vectorized operations are significantly slower
than the naive Cython loops here:
In [4]: a = np.arange(7200000.0).reshape(3,2,3,400000)
In [5]: timeit f(a)
10 loops, best of 3: 26.8 ms per loop
In [6]: timeit get_f_cube(a) # v2
100 loops, best of 3: 18.6 ms per loop
- Robert
> 990 of course, still a lot of milliseconds!
> With a modern processor, that's a couple of billion CPU cycles. Astonishing.
> Almost enough time to swallow a sip of tea.
Your irony is hardly appropriate.
I used Cython to gain a 60x speed improvement which on small datasets
could very well translate into 1 second or even less, but if you
multiply this by 100 000 optimization steps per each of 1000
concurrently running processes, this has already saved me at least 3
years of number crunching.
--
Sincerely yours,
Yury V. Zaytsev