>> a = np.arange(10).astype('float64')
>> a[3] = np.nan
>> roly.doubleheap3.move_median(a, 5)
array([ nan, nan, nan, nan, 2., 4., 5., 6., 6., 7.])
>> roly.doubleheap2.move_median(a, 5)
array([ nan, nan, nan, nan, 1., 2., 4., 5., 6., 7.])
>> roly.doubleheap.move_median(a, 5)
array([ nan, nan, nan, nan, 2., 5., 6., 7., 7., 7.])
David
Let's say the window currently contains no NaNs and the next value is
a NaN. Then the old value leaves (if it is not a NaN which it isn't in
our example) but no new value enters. The heap that contained the old
exiting value would shrink by one. So then there'd have to be some
sort of rebalancing or else one heap could end up with, say, 1 element
and the other 10. In that case we'd no longer be finding the median.
If the window contains all NaN, then the median would be NaN.
Sounds difficult.
David
That would be very nice.
BTW, I'll start to take a look into porting DH3 to bottleneck. You
don't expect the public functions to change signatures do you?
A separate issue to move_nanmedian is what the behavior of move_median
should be with NaNs. I think the ideal behavior would be to return NaN
if there are any NaNs in the window. That would mean keeping count of
NaNs in mm_update and mm_insert_init and returning NaN in
mm_get_median if the count is greater than zero.
I might see how much that slows things down. Or I might just go with a
warning in the docstring so I can get bottleneck 0.5 released.
On a different sub-topic
U have rankdata in bn
Have u considered percentileofscore and scoreofpercentile ? Scipy implementations don't consider nans and am sure r much slower than a cython impl (and also missing the hint that an array is already sorted)
- eg if I repeatedly need to call percentileofscore would be nice to have an option (is_sorted = False), which could avoid the resorting
Just a thought
Jeff
I can be reached on my cell 917-971-6387
> Have u considered percentileofscore and scoreofpercentile ? Scipy implementations don't consider nans and am sure r
> much slower than a cython impl (and also missing the hint that an array is already sorted)
> - eg if I repeatedly need to call percentileofscore would be nice to have an option (is_sorted = False), which could avoid
> the resorting
It would be fast for when you only input one value because then I
could do a partial sort instead of a full sort. However if you want to
find a lot of percentiles, like 20, 40, 60, 80, then it is probably
faster to do a full sort of the input array once. That makes the
implementation more complex.
Another difficultly is interpolation since the value is likely to fall
between two values.
Numpy has np.percentile which takes multiple inputs and only sorts
once but does not handle NaNs. Bottleneck has partsort which is a fast
way to find the Nth smallest element of an array but also does not
handle NaNs.
In short, I think scoreatpercentile and percentileofscore (and perhaps
nanscoreatpercentile and nanpercentileofscore) would make good
additions to bottleneck. I'd be willing to help anyone who would like
to implement but I won't have a chance to do it myself for a while.