roly: moving median with NaNs

43 views
Skip to first unread message

Keith Goodman

unread,
Jun 3, 2011, 4:06:47 PM6/3/11
to bottl...@googlegroups.com
I'd like to one day have a move_nanmedian function that can handle
NaNs in the input array. For now I'll just note that all three double
heap implementations of the moving median give different results when
the input contains NaNs:

>> a = np.arange(10).astype('float64')
>> a[3] = np.nan
>> roly.doubleheap3.move_median(a, 5)
array([ nan, nan, nan, nan, 2., 4., 5., 6., 6., 7.])
>> roly.doubleheap2.move_median(a, 5)
array([ nan, nan, nan, nan, 1., 2., 4., 5., 6., 7.])
>> roly.doubleheap.move_median(a, 5)
array([ nan, nan, nan, nan, 2., 5., 6., 7., 7., 7.])

J. David Lee

unread,
Jun 3, 2011, 4:09:17 PM6/3/11
to bottl...@googlegroups.com
What is the appropriate action when there is a nan?

David

Keith Goodman

unread,
Jun 3, 2011, 4:19:40 PM6/3/11
to bottl...@googlegroups.com

Let's say the window currently contains no NaNs and the next value is
a NaN. Then the old value leaves (if it is not a NaN which it isn't in
our example) but no new value enters. The heap that contained the old
exiting value would shrink by one. So then there'd have to be some
sort of rebalancing or else one heap could end up with, say, 1 element
and the other 10. In that case we'd no longer be finding the median.

If the window contains all NaN, then the median would be NaN.

Sounds difficult.

J. David Lee

unread,
Jun 3, 2011, 4:24:53 PM6/3/11
to bottl...@googlegroups.com
Maybe not as difficult as you'd think. The code currently keeps track of
the size of each heap, so it's easy to see which heap to remove a node
from at any time. You could set the value of the nan-node to a min or
max value to cause the node to propagate to the bottom of the correct
heap before you shrunk it.

David

Keith Goodman

unread,
Jun 3, 2011, 4:34:15 PM6/3/11
to bottl...@googlegroups.com

That would be very nice.

BTW, I'll start to take a look into porting DH3 to bottleneck. You
don't expect the public functions to change signatures do you?

Keith Goodman

unread,
Jun 4, 2011, 12:17:15 PM6/4/11
to bottl...@googlegroups.com

A separate issue to move_nanmedian is what the behavior of move_median
should be with NaNs. I think the ideal behavior would be to return NaN
if there are any NaNs in the window. That would mean keeping count of
NaNs in mm_update and mm_insert_init and returning NaN in
mm_get_median if the count is greater than zero.

I might see how much that slows things down. Or I might just go with a
warning in the docstring so I can get bottleneck 0.5 released.

Jeff Reback

unread,
Jun 4, 2011, 1:06:36 PM6/4/11
to bottl...@googlegroups.com
Keith

On a different sub-topic

U have rankdata in bn

Have u considered percentileofscore and scoreofpercentile ? Scipy implementations don't consider nans and am sure r much slower than a cython impl (and also missing the hint that an array is already sorted)
- eg if I repeatedly need to call percentileofscore would be nice to have an option (is_sorted = False), which could avoid the resorting

Just a thought

Jeff

I can be reached on my cell 917-971-6387

Keith Goodman

unread,
Jun 4, 2011, 1:24:50 PM6/4/11
to bottl...@googlegroups.com
On Sat, Jun 4, 2011 at 10:06 AM, Jeff Reback <jeffr...@gmail.com> wrote:

> Have u considered percentileofscore and scoreofpercentile ? Scipy implementations don't consider nans and am sure r
> much slower than a cython impl (and also missing the hint that an array is already sorted)
> - eg if I repeatedly need to call percentileofscore would be nice to have an option (is_sorted = False), which could avoid
> the resorting

It would be fast for when you only input one value because then I
could do a partial sort instead of a full sort. However if you want to
find a lot of percentiles, like 20, 40, 60, 80, then it is probably
faster to do a full sort of the input array once. That makes the
implementation more complex.

Another difficultly is interpolation since the value is likely to fall
between two values.

Numpy has np.percentile which takes multiple inputs and only sorts
once but does not handle NaNs. Bottleneck has partsort which is a fast
way to find the Nth smallest element of an array but also does not
handle NaNs.

In short, I think scoreatpercentile and percentileofscore (and perhaps
nanscoreatpercentile and nanpercentileofscore) would make good
additions to bottleneck. I'd be willing to help anyone who would like
to implement but I won't have a chance to do it myself for a while.

Reply all
Reply to author
Forward
0 new messages