Yet another Numba optimization question

Juan Nunez-Iglesias

unread,

Apr 7, 2017, 3:34:20 AM4/7/17

to numba...@continuum.io

Hi all,

I’ve been trying to make an n-dimensional image moments calculation with Numba. Currently in scikit-image we have image moments computed in 2D, as follows:

cpdef moments_central(image_t image, double cr, double cc, Py_ssize_t order):

cdef Py_ssize_t p, q, r, c

cdef double[:, ::1] mu = np.zeros((order + 1, order + 1), dtype=np.double)

cdef double val, dr, dc, dcp, drq

for r in range(image.shape[0]):

dr = r - cr

for c in range(image.shape[1]):

dc = c - cc

val = image[r, c]

dcp = 1

for p in range(order + 1):

drq = 1

for q in range(order + 1):

mu[p, q] += val * drq * dcp

drq *= dr

dcp *= dc

return np.asarray(mu)

My Numba implementation is:

import numpy as np

import numba

def moments_central_nb(image, center=None, order=3):

mu = np.zeros((order + 1,) * image.ndim)

if center is None:

center = np.mean(np.column_stack(np.nonzero(image)), axis=0)

_moments_central_nb_loop(image, center, mu)

return mu

@numba.njit

def _moments_central_nb_loop(image, center, mu):

for coord, val in np.ndenumerate(image):

if val > 0:

for powers in np.ndindex(mu.shape):

curval = val

for c, ctr, p in zip(coord, center, powers):

dc = c - ctr

cr = 1

for pcount in range(p):

cr *= dc

curval *= cr

mu[powers] += curval

If you want test code:

from skimage import data

horse = data.horse().astype(float)

from skimage.measure import moments_central

center = np.mean(np.column_stack(np.nonzero(horse)), axis=0)

_ = moments_central_nb(horse, center=center)

%timeit moments_central_nb(horse, center=center, order=3)

%timeit moments_central(horse, *center, 3)

The problem is that Numba is currently about 10x slower than Cython on this image. I see that Cython avoids a few computations, but not 10x as many computations!

As always, suggestions on obvious optimizations are welcome, as well as on the optimization process.

Thanks,

Juan.

Stuart Berg

unread,

Apr 8, 2017, 8:32:22 AM4/8/17

to numba...@continuum.io

Hi Juan,

I don't know how to get a 10x speedup on this function, but I tinkered
with it for a bit, so I'll report my findings.

- A pure numpy solution is ~3.8x faster than your posted numba code,
at least for the small example image you suggested. (But it is much
less straightforward to read.)

- If I eliminate zip() from your numba solution, I see a ~5.2x speedup.

- I tried computing 'cr' with np.power() instead of an explicit loop,
but that made it ~3x SLOWER.

- Just out of curiosity, I tried inverting the nesting order of the
outer two loops. As expected, that slowed things down, just slightly.

So, the best optimization I could find here is to avoid zip(), but
that still doesn't achieve cython speeds.

Best,
Stuart

PS -- For reference, my (barely-tested) numpy implementation is below:

import numpy as np
def moments_central_numpy(image, center=None, order=3):

mu = np.zeros((order + 1,) * image.ndim)
if center is None:
center = np.mean(np.column_stack(np.nonzero(image)), axis=0)

# pre-compute centralized coordinates along each axis
offsets = []
for d in range(image.ndim):
axis_offset = np.arange(image.shape[d]) - center[d]
axis_offset.shape = (1,)*d + (image.shape[d],) + (1,)*(image.ndim-d-1)
offsets.append( axis_offset )

# pre-compute (c - ctr)^p for each axis and each p
offset_powers = {}
for d in range(image.ndim):
for p in range(mu.shape[d]):
offset_powers[(d,p)] = np.power(offsets[d], p)

prods = np.zeros_like(image)
for powers in np.ndindex(mu.shape):
prods[:] = image
for d, p in enumerate(powers):
prods[:] *= offset_powers[(d,p)]
mu[powers] = prods.sum()

return mu

> --
> You received this message because you are subscribed to the Google Groups
> "Numba Public Discussion - Public" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to numba-users...@continuum.io.
> To post to this group, send email to numba...@continuum.io.
> To view this discussion on the web visit
> https://groups.google.com/a/continuum.io/d/msgid/numba-users/d6acd47e-6406-4112-8ad1-c8f380f154d2%40Spark.
> For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Juan Nunez-Iglesias

unread,

Apr 9, 2017, 9:31:22 PM4/9/17

to numba...@continuum.io

Hi Stuart!

Thanks for the tinkering! Actually I posted a numpy-focused version of the question to the scikit-image list, and Jaime Fernandez del Rio posted an incredible response! See it here:

https://mail.python.org/pipermail/scikit-image/2017-April/005169.html

It's actually faster than the original Cython code in some cases. But, as you say, quite hard to read. So I would still be interested in a Numba version.

Regarding your attempts, when you say "eliminate zip", do you mean replacing it with "for i in range(n_items): c, ctr, p = coord[i], center[i], powers[i]"? That seems like a pretty obvious optimization for the compiler to make? Although I guess with zip it needs to worry about unequal-length lists...

Finally, I'll keep asking this question periodically on the list: any tips on optimizing Numba performance in general? For example, the effect of removing zip is surprising to me. Something akin to Cython's html output would be extremely handy! But I don't understand JIT internals well enough to know whether that would even be possible.

Thanks everyone!

Juan.

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CALi1z7MtGK9HYi6jy65tHgMALTAZ9yGO%2BDD0eUo8u2u%3DRcybuw%40mail.gmail.com.

Stanley Seibert

unread,

Apr 10, 2017, 9:35:44 AM4/10/17

to Numba Public Discussion - Public

On thing to keep in mind is that Numba's internal optimizations are fairly limited. Most of what Numba is doing is translating a type-annotated version of your function directly to LLVM's internal representation, and then letting the LLVM optimizers do all the work. This can result in some pretty amazing things (Antoine discovered that a for loop computing an arithmetic series is converted to the closed form solution!), but also some non-obvious performance bugs when LLVM's optimization passes are confused by the IR that Numba emits. We periodically need to tweak the format of the LLVM IR we generate to better match the patterns that LLVM's optimizers are looking for, usually in non-obvious ways. It is quite likely that the way we translate the zip() function is impeding the optimizer.

If either you or Stuart can open an issue with a small example that shows the zip() performance discrepancy, we can take a look to see what is happening.

email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit
https://groups.google.com/a/continuum.io/d/msgid/numba-users/d6acd47e-6406-4112-8ad1-c8f380f154d2%40Spark.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CALi1z7MtGK9HYi6jy65tHgMALTAZ9yGO%2BDD0eUo8u2u%3DRcybuw%40mail.gmail.com.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

--

You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/cb272db4-4b43-41b1-9400-8eec7d294414%40Spark.

Stuart Berg

unread,

Apr 10, 2017, 1:02:51 PM4/10/17

to numba...@continuum.io

>when you say "eliminate zip", do you mean replacing it with "for i in range(n_items): c, ctr, p = coord[i], center[i], powers[i]"?

Yes, exactly.

>If either you or Stuart can open an issue with a small example that shows the zip() performance discrepancy, we can take a look to see what is happening.

OK, I tried to come up with a simpler example function that exhibits
the performance penalty of using zip() instead of range(), but was
unsuccessful. So I guess the problem it isn't merely zip(), but some
subtle characteristic of the function that prevents easy optimization.

Juan Nunez-Iglesias

unread,

Apr 11, 2017, 3:45:33 AM4/11/17

to numba...@continuum.io

Hi Stuart, Stanley,

I've raised the issue here:

https://github.com/numba/numba/issues/2333

I realised after reading that Stuart couldn't reproduce a more minimal example that the loop is over the dimensions of the image, ie we are zipping together lists of size 2. It's easy to imagine some small overhead of zip over range that could become significant for tiny argument sizes. If so, feel free to close the issue.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

To post to this group, send email to numba...@continuum.io.

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CALi1z7Pog%3DPrxLT4iRGHj-tnz70GNCf0V3PO4D4G504QMbDhig%40mail.gmail.com.

Reply all

Reply to author

Forward