[SciPy-User] scipy.stats.nanmedian

67 views
Skip to first unread message

Keith Goodman

unread,
Jan 21, 2010, 8:54:28 PM1/21/10
to SciPy Users List
I noticed a couple of issues with nanmedian in scipy.stats:

>> from scipy.stats import nanmedian
>> nanmedian(1)
---------------------------------------------------------------------------
ValueError: axis must be less than arr.ndim; axis=0, rank=0.
>> nanmedian(True)
---------------------------------------------------------------------------
ValueError: axis must be less than arr.ndim; axis=0, rank=0.
>> nanmedian(np.array(1))
---------------------------------------------------------------------------
ValueError: axis must be less than arr.ndim; axis=0, rank=0.
>> nanmedian(np.array([1, 2, 3]))
array(2.0)

Changing the function from the original:

def nanmedian(x, axis=0):
x, axis = _chk_asarray(x,axis)
x = x.copy()
return np.apply_along_axis(_nanmedian,axis,x)

to this (I know, it is not pretty):

def nanmedian(x, axis=0):
if np.isscalar(x):
return float(x)
x, axis = _chk_asarray(x, axis)
if x.ndim == 0:
return float(x.tolist())
x = x.copy()
x = np.apply_along_axis(_nanmedian, axis, x)
if x.ndim == 0:
x = float(x.tolist())
return x

gives the expected results:

>> nanmedian(1)
1.0
>> nanmedian(True)
1.0
>> nanmedian(np.array(1))
1.0
>> nanmedian(np.array([1, 2, 3]))
2.0

which agree with numpy:

>> np.median(1)
1.0
>> np.median(True)
1.0
>> np.median(np.array(1))
1.0
>> np.median(np.array([1, 2, 3]))
2.0

I'm keeping a local copy of the changes I made for my own package. But
it would be nice (for me) if this was fixed upstream. Are the changes
above good enough for scipy?

(Another difference from np.median that I noticed is that the default
axis for np.median is None and for scipy.stats.nanmean it is 0. But
perhaps it is too late to change that.)
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

josef...@gmail.com

unread,
Jan 21, 2010, 9:15:59 PM1/21/10
to SciPy Users List

Can you open a ticket, so that I don't forget to look at it?

I will need to play with it. There are some things that I don't
understand right away.
What's the difference between isscalar and ndim=0 ?
Why do you have the tolist()

Thanks,
Josef

Keith Goodman

unread,
Jan 21, 2010, 9:28:47 PM1/21/10
to SciPy Users List
On Thu, Jan 21, 2010 at 6:15 PM, <josef...@gmail.com> wrote:
> Can you open a ticket, so that I don't forget to look at it?

Sure.

> I will need to play with it. There are some things that I don't
> understand right away.
> What's the difference between isscalar and ndim=0 ?

A scalar doesn't have a ndim method. But now I see that there is a
ndim function. I'll use that instead.

> Why do you have the tolist()

That's the only was I was able to figure out how to pull 1.0 out of
np.array(1.0). Is there a better way?

>> np.array(1.0).tolist()
1.0

Pierre GM

unread,
Jan 21, 2010, 9:41:37 PM1/21/10
to SciPy Users List
On Jan 21, 2010, at 9:28 PM, Keith Goodman wrote:
> On Thu, Jan 21, 2010 at 6:15 PM, <josef...@gmail.com> wrote:
>> Can you open a ticket, so that I don't forget to look at it?
>
> Sure.
>
>> I will need to play with it. There are some things that I don't
>> understand right away.
>> What's the difference between isscalar and ndim=0 ?
>
> A scalar doesn't have a ndim method. But now I see that there is a
> ndim function. I'll use that instead.
>
>> Why do you have the tolist()
>
> That's the only was I was able to figure out how to pull 1.0 out of
> np.array(1.0). Is there a better way?


.item()

josef...@gmail.com

unread,
Jan 21, 2010, 9:42:14 PM1/21/10
to SciPy Users List
On Thu, Jan 21, 2010 at 9:28 PM, Keith Goodman <kwgo...@gmail.com> wrote:
> On Thu, Jan 21, 2010 at 6:15 PM,  <josef...@gmail.com> wrote:
>> Can you open a ticket, so that I don't forget to look at it?
>
> Sure.
>
>> I will need to play with it. There are some things that I don't
>> understand right away.
>> What's the difference between isscalar and ndim=0 ?
>
> A scalar doesn't have a ndim method. But now I see that there is a
> ndim function. I'll use that instead.
>
>> Why do you have the tolist()
>
> That's the only was I was able to figure out how to pull 1.0 out of
> np.array(1.0). Is there a better way?
>
>>> np.array(1.0).tolist()
>   1.0

>>> np.array(1.0)
array(1.0)
>>> np.array(1.0)[()]
1.0

Josef

Keith Goodman

unread,
Jan 21, 2010, 10:01:17 PM1/21/10
to SciPy Users List
On Thu, Jan 21, 2010 at 6:41 PM, Pierre GM <pgmde...@gmail.com> wrote:
> On Jan 21, 2010, at 9:28 PM, Keith Goodman wrote:
>> That's the only was I was able to figure out how to pull 1.0 out of
>> np.array(1.0). Is there a better way?
>
>
> .item()

Thanks. item() looks better than tolist().

I simplified the function:

def nanmedian(x, axis=0):
x, axis = _chk_asarray(x,axis)

if x.ndim == 0:
return float(x.item())


x = x.copy()
x = np.apply_along_axis(_nanmedian,axis,x)
if x.ndim == 0:

x = float(x.item())
return x

and opened a ticket:

http://projects.scipy.org/scipy/ticket/1098

josef...@gmail.com

unread,
Jan 21, 2010, 11:18:31 PM1/21/10
to SciPy Users List
On Thu, Jan 21, 2010 at 10:01 PM, Keith Goodman <kwgo...@gmail.com> wrote:
> On Thu, Jan 21, 2010 at 6:41 PM, Pierre GM <pgmde...@gmail.com> wrote:
>> On Jan 21, 2010, at 9:28 PM, Keith Goodman wrote:
>>> That's the only was I was able to figure out how to pull 1.0 out of
>>> np.array(1.0). Is there a better way?
>>
>>
>> .item()
>
> Thanks. item() looks better than tolist().
>
> I simplified the function:
>
> def nanmedian(x, axis=0):
>    x, axis = _chk_asarray(x,axis)
>    if x.ndim == 0:
>        return float(x.item())
>    x = x.copy()
>    x = np.apply_along_axis(_nanmedian,axis,x)
>    if x.ndim == 0:
>        x = float(x.item())
>    return x
>
> and opened a ticket:
>
> http://projects.scipy.org/scipy/ticket/1098


How about getting rid of apply_along_axis? see attachment

I don't know whether or how much faster it is, but there is a ticket
that the current version is slow.
No hidden bug or corner case guarantee yet.


Josef

try_nanmedian_0.py

Bruce Southey

unread,
Jan 22, 2010, 10:58:07 AM1/22/10
to scipy...@scipy.org
_______________________________________________ SciPy-User mailing list SciPy...@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
Personally, I think using masked arrays is a far better solution than the various nan methods in stats.py. Is the call to _chk_asarray that much different than the call to ma? Both require conversion or checking if the input is a np array.

As stated in the documentation, _nanmedian only works on 1d arrays. So any 'axis' argument without changing the main function is perhaps a hack at best.

Is it possible to adapt Sturla's version?
http://projects.scipy.org/numpy/ticket/1213
I do not know the algorithm to suggest anything but perhaps the select method could be adapted to handle nan.

Bruce


Keith Goodman

unread,
Jan 22, 2010, 11:09:17 AM1/22/10
to SciPy Users List

It is faster. But here is one case it does not handle:

>> nanmedian([1, 2])
array([ 1.5])
>> np.median([1, 2])
1.5

I'm sure it could be fixed. But having to fix it, and the fact that it
is a larger change, decreases the likelihood that it will make it into
the next version of scipy. One option is to make the small bug fix I
suggested (ticket #1098) and add the corresponding unit tests. Then we
can take our time to design a better version of nanmedian.

josef...@gmail.com

unread,
Jan 22, 2010, 11:14:56 AM1/22/10
to SciPy Users List

I like arrays with nans better than masked arrays, and I checked,
np.ma.median also uses apply_along_axis and I wanted to see if I can
get a vectorized version.

>
> As stated in the documentation, _nanmedian only works on 1d arrays. So any
> 'axis' argument without changing the main function is perhaps a hack at
> best.

_nanmedian is an internal function, stats.nanmedian and my rewritten
version are supposed to handle any dimension and any axis.

>
> Is it possible to adapt Sturla's version?
> http://projects.scipy.org/numpy/ticket/1213
> I do not know the algorithm to suggest anything but perhaps the select
> method could be adapted to handle nan.

I guess Sturla's version is a lot better, but not my kind of fish,
it's more for the algorithm and c experts. I will gladly use it once
it or something similar is in numpy.

Josef

>
> Bruce

Keith Goodman

unread,
Jan 22, 2010, 11:17:48 AM1/22/10
to SciPy Users List
On Fri, Jan 22, 2010 at 7:58 AM, Bruce Southey <bsou...@gmail.com> wrote:
> Is it possible to adapt Sturla's version?
> http://projects.scipy.org/numpy/ticket/1213
> I do not know the algorithm to suggest anything but perhaps the select
> method could be adapted to handle nan.

I recently needed to calculate the median in an inner loop. It would
have been nice to have a median function that doesn't do a full sort.
I wanted to compile Sturla's version, but I didn't even know which of
the attachments to download. I've never compiled a cython function.

josef...@gmail.com

unread,
Jan 22, 2010, 11:46:10 AM1/22/10
to SciPy Users List

I didn't see the difference to np.median for this case, I think I was
taking the shape answer from the other thread on the return of splines
and interpolation.

If I change the last 3 lines to
if nanmed.size == 1:
return nanmed.item()
return nanmed

then I get agreement with numpy for the following test cases

print nanmedian(1), np.median(1)
print nanmedian(np.array(1)), np.median(1)
print nanmedian(np.array([1])), np.median(np.array([1]))
print nanmedian(np.array([[1]])), np.median(np.array([[1]]))
print nanmedian(np.array([1,2])), np.median(np.array([1,2]))
print nanmedian(np.array([[1,2]])), np.median(np.array([[1,2]]),axis=0)
print nanmedian([1]), np.median([1])
print nanmedian([[1]]), np.median([[1]])
print nanmedian([1,2]), np.median([1,2])
print nanmedian([[1,2]]), np.median([[1,2]],axis=0)
print nanmedian([1j,2]), np.median([1j,2])

Am I still missing any cases?

The vectorized version should be faster for this case
http://projects.scipy.org/scipy/ticket/740
but maybe not for long and narrow arrays.

Josef

Keith Goodman

unread,
Jan 22, 2010, 11:52:50 AM1/22/10
to SciPy Users List

Here is an odd one:

>> nanmedian(True)
1.0
>> nanmedian([True])
0.5 # <--- strange

>> np.median(True)
1.0
>> np.median([True])
1.0

Keith Goodman

unread,
Jan 22, 2010, 11:58:21 AM1/22/10
to SciPy Users List

Another one:

>> x = np.random.randn(3,4,5)
>> nanmedian(x)
ValueError: shape mismatch: objects cannot be broadcast to a single shape

If anything we should add a full set of unit tests for nanmedian. One
reason why the current unit tests did not catch the problem I ran into
is that

>> np.array(2.0) == 2.0
True

So nanmedian was returning np.array(2.0) and np.median was returning
2.0 which when compared passed the unit test.

josef...@gmail.com

unread,
Jan 22, 2010, 12:03:26 PM1/22/10
to SciPy Users List

definitely weird

>>> (np.array(True)+np.array(True))/2.
0.5
>>> np.array([True, True]).sum()
2
>>> np.array([True, True]).mean()
1.0

I assumed mean (is used by np.ma.median) is the same as adding and dividing by 2

Josef

josef...@gmail.com

unread,
Jan 22, 2010, 4:08:23 PM1/22/10
to SciPy Users List

It got a bit ugly, too many shapes for a single number, and tricky
axis handling.
Your idea of making just the smallish changes looks more attractive now.

and almost correct

>>> np.median(np.array([[[1]]]),axis=0).shape
(1, 1)
>>> nanmedian(np.array([[[1]]]),axis=0).shape
(1, 1)
>>> np.median(np.array([[[1]]]),axis=-1).shape
(1, 1)
>>> nanmedian(np.array([[[1]]]),axis=-1).shape
(1, 1)

but there slipped a python in

>>> nanmedian(np.array([[[1]]]),axis=None).__class__
<type 'float'>
>>> np.median(np.array([[[1]]]),axis=None).__class__
<type 'numpy.float64'>


.item() returns a python number not a numpy number

>>> np.array([[[1]]]).item().__class__
<type 'int'>
>>> np.array([[[1]]]).flat[0].__class__
<type 'numpy.int32'>


I also didn't know this:

>>> None < 0
True

Josef

try_nanmedian_1.py

Keith Goodman

unread,
Jan 22, 2010, 4:44:56 PM1/22/10
to SciPy Users List
On Fri, Jan 22, 2010 at 1:08 PM, <josef...@gmail.com> wrote:
> .item() returns a python number not a numpy number
>
>>>> np.array([[[1]]]).item().__class__
> <type 'int'>
>>>> np.array([[[1]]]).flat[0].__class__
> <type 'numpy.int32'>

Good catch. Thanks. I'll update my local copy of nanmedian.

Keith Goodman

unread,
Jan 22, 2010, 4:52:04 PM1/22/10
to SciPy Users List
On Fri, Jan 22, 2010 at 1:44 PM, Keith Goodman <kwgo...@gmail.com> wrote:
> On Fri, Jan 22, 2010 at 1:08 PM,  <josef...@gmail.com> wrote:
>> .item() returns a python number not a numpy number
>>
>>>>> np.array([[[1]]]).item().__class__
>> <type 'int'>
>>>>> np.array([[[1]]]).flat[0].__class__
>> <type 'numpy.int32'>
>
> Good catch. Thanks. I'll update my local copy of nanmedian.

Looks like we don't even need .tolist(), .item(), or [()]. This should
do the trick:

>> np.float64(np.array(1))
1.0
>> type(np.float64(np.array(1)))
<type 'numpy.float64'>

And what if the input is float32? Well, numpy turns that into float64
so nothing to worry about:

>> x = np.array(1, dtype=np.float32)
>> m = np.median(x)
>> type(m)
<type 'numpy.float64'>

josef...@gmail.com

unread,
Jan 22, 2010, 4:55:03 PM1/22/10
to SciPy Users List
On Fri, Jan 22, 2010 at 4:52 PM, Keith Goodman <kwgo...@gmail.com> wrote:
> On Fri, Jan 22, 2010 at 1:44 PM, Keith Goodman <kwgo...@gmail.com> wrote:
>> On Fri, Jan 22, 2010 at 1:08 PM,  <josef...@gmail.com> wrote:
>>> .item() returns a python number not a numpy number
>>>
>>>>>> np.array([[[1]]]).item().__class__
>>> <type 'int'>
>>>>>> np.array([[[1]]]).flat[0].__class__
>>> <type 'numpy.int32'>
>>
>> Good catch. Thanks. I'll update my local copy of nanmedian.
>
> Looks like we don't even need .tolist(), .item(), or [()]. This should
> do the trick:
>
>>> np.float64(np.array(1))
>   1.0
>>> type(np.float64(np.array(1)))
>   <type 'numpy.float64'>
>
> And what if the input is float32? Well, numpy turns that into float64
> so nothing to worry about:
>
>>> x = np.array(1, dtype=np.float32)
>>> m = np.median(x)
>>> type(m)
>   <type 'numpy.float64'>

but it breaks complex
>>> np.float64(np.array(1.j))
0.0

I did one more shape correction in my version

Josef

Keith Goodman

unread,
Jan 22, 2010, 5:19:10 PM1/22/10
to SciPy Users List

Crap. So many corner cases.

After collecting all the input cases (all without NaNs) we could
extend your automated test by adding an outer loop over [nanmean,
nanmedian, nanstd] and make sure it gives the same results as the
numpy versions. It might be good to check for dtype too since 1 ==
1.0.

Reply all
Reply to author
Forward
0 new messages