Unrealiable median performance with Numba

4 views
Skip to first unread message

Rutger Kassies

unread,
Jul 12, 2017, 5:39:39 AM7/12/17
to Numba Public Discussion - Public
Dear list,

While trying to implement a Numba version of "repeated medians regression" (a form of robust linear regression) i noticed some performance issues when using np.median in Numba njitted functions.

I have been trying to pin-point the exact issue, so far a minimal example showing at least part of the issue is demonstrated in this notebook:
https://gist.github.com/RutgerK/57297c37cd6a62349e789bc23c8f04ec

The example is an extreme case where the median is calculated over an array containing all identical values. Based on my real data i suspect it also slows down when some data is identical but not all, although i haven't been able to capture this in a nice reproducible example.

The timings in the notebook are done with some other processes running as well, so not too accurate. But i ran it at least ten times and getting a consistent performance difference of >150x between random data and identical data. And pure Numpy is not affected, which sort of proves that it should be possible to get similar performance for both cases. Does this have something to do with a difference between sorting algorithms in Numpy and Numba?

I also created a median function myself using Numba, it's a lot slower than Numba with random data, but still much faster than Numba with identical data.

Is there a way to get the consistent Numba performance for all cases?

If this is an actual issue i will file one on Github, but i'm not sure if its by design or some unwanted regression.



Regards,
Rutger












Peter Zsohar

unread,
Jul 12, 2017, 8:43:35 AM7/12/17
to numba...@continuum.io
Could it be the performance of sorting behind all this?

The documentation clearly gives you a warning: "Sorting may be slightly slower than Numpy’s implementation."

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/b6fc7939-729e-4677-bce8-90e94c4198d9%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Rutger Kassies

unread,
Jul 12, 2017, 8:57:05 AM7/12/17
to Numba Public Discussion - Public
Hey,

I don't think so. It probably explains why my custom median function is a few times slower as pure Numpy, and i would have even have noticed it if this was the case. But Numba's median function (by using np.median) is over 150 times slower, calling that 'slightly' would require a very liberal definition. :)

If it was solely due to sorting, my custom median function (in Numba) should suffer as much as Numba's own, since it uses np.sort().

Regards,
Rutger


On Wednesday, July 12, 2017 at 2:43:35 PM UTC+2, Peter Zsohar wrote:
Could it be the performance of sorting behind all this?

The documentation clearly gives you a warning: "Sorting may be slightly slower than Numpy’s implementation."
On Wed, Jul 12, 2017 at 11:39 AM, Rutger Kassies <kas...@gmail.com> wrote:
Dear list,

While trying to implement a Numba version of "repeated medians regression" (a form of robust linear regression) i noticed some performance issues when using np.median in Numba njitted functions.

I have been trying to pin-point the exact issue, so far a minimal example showing at least part of the issue is demonstrated in this notebook:
https://gist.github.com/RutgerK/57297c37cd6a62349e789bc23c8f04ec

The example is an extreme case where the median is calculated over an array containing all identical values. Based on my real data i suspect it also slows down when some data is identical but not all, although i haven't been able to capture this in a nice reproducible example.

The timings in the notebook are done with some other processes running as well, so not too accurate. But i ran it at least ten times and getting a consistent performance difference of >150x between random data and identical data. And pure Numpy is not affected, which sort of proves that it should be possible to get similar performance for both cases. Does this have something to do with a difference between sorting algorithms in Numpy and Numba?

I also created a median function myself using Numba, it's a lot slower than Numba with random data, but still much faster than Numba with identical data.

Is there a way to get the consistent Numba performance for all cases?

If this is an actual issue i will file one on Github, but i'm not sure if its by design or some unwanted regression.



Regards,
Rutger












--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

Antoine Pitrou

unread,
Jul 12, 2017, 9:04:12 AM7/12/17
to numba...@continuum.io
On Wed, 12 Jul 2017 14:43:34 +0200
Peter Zsohar <peter....@gmail.com>
wrote:
> Could it be the performance of sorting behind all this?

No, it's clearly a bug in Numba's implementation of np.median().

Numba's quicksort uses a partitioning function that has been carefully
tuned (and measured) to make it well-behaved with uniform arrays:
https://github.com/numba/numba/blob/master/numba/targets/quicksort.py#L85

Numba's median implementation uses a slightly different implementation
that's probably responsible for the performance degradation (it
probably degrades into O(n**2) performance):
https://github.com/numba/numba/blob/master/numba/targets/arraymath.py#L450

An issue should be reported.

Regards

Antoine.


Rutger Kassies

unread,
Jul 12, 2017, 11:27:03 AM7/12/17
to Numba Public Discussion - Public, soli...@pitrou.net
Thanks for your explanation, i created an issue at Github:
https://github.com/numba/numba/issues/2458

Regards,
Rutger

Antoine Pitrou

unread,
Jul 12, 2017, 12:12:12 PM7/12/17
to numba...@continuum.io
On Wed, 12 Jul 2017 08:27:02 -0700 (PDT)
Rutger Kassies <kas...@gmail.com> wrote:
> Thanks for your explanation, i created an issue at Github:
> https://github.com/numba/numba/issues/2458

And, for the anecdote, this is all my fault :-)

I first wrote the quicksort code, ensuring it does well on
"pathological" input. Later I wrote the median code, even copying over
the quicksort code's comments about pathological input, but changed the
implementation slightly and did not bother testing the comments still
held true... which you discovered they didn't!

Regards

Antoine.


>
> Regards,
> Rutger
>
> On Wednesday, July 12, 2017 at 3:04:12 PM UTC+2, Antoine Pitrou wrote:
> >
> > On Wed, 12 Jul 2017 14:43:34 +0200
> > Peter Zsohar <peter....@gmail.com <javascript:>>

Antoine Pitrou

unread,
Jul 12, 2017, 12:21:22 PM7/12/17
to numba...@continuum.io

One last note:

we do have benchmarks for np.median():
https://github.com/numba/numba-benchmark/blob/master/benchmarks/bench_sorting.py#L99

(including a benchmark with many duplicates which may reasonably
showcase the issue at hand)

but for some reason they don't show up here:
http://numba.pydata.org/numba-benchmark/

:-)

Regards

Antoine.


stuart

unread,
Jul 19, 2017, 9:14:22 AM7/19/17
to Numba Public Discussion - Public, soli...@pitrou.net
This is fixed in: https://github.com/numba/numba/pull/2470

Thanks for reporting.

Antoine, thanks for the notes.
Reply all
Reply to author
Forward
0 new messages