Dear list,
While trying to implement a Numba version of "repeated medians regression" (a form of robust linear regression) i noticed some performance issues when using np.median in Numba njitted functions.
I have been trying to pin-point the exact issue, so far a minimal example showing at least part of the issue is demonstrated in this notebook:
https://gist.github.com/RutgerK/57297c37cd6a62349e789bc23c8f04ecThe example is an extreme case where the median is calculated over an array containing all identical values. Based on my real data i suspect it also slows down when some data is identical but not all, although i haven't been able to capture this in a nice reproducible example.
The timings in the notebook are done with some other processes running as well, so not too accurate. But i ran it at least ten times and getting a consistent performance difference of >150x between random data and identical data. And pure Numpy is not affected, which sort of proves that it should be possible to get similar performance for both cases. Does this have something to do with a difference between sorting algorithms in Numpy and Numba?
I also created a median function myself using Numba, it's a lot slower than Numba with random data, but still much faster than Numba with identical data.
Is there a way to get the consistent Numba performance for all cases?
If this is an actual issue i will file one on Github, but i'm not sure if its by design or some unwanted regression.
Regards,
Rutger