>> from scipy.stats import nanmedian
>> nanmedian(1)
---------------------------------------------------------------------------
ValueError: axis must be less than arr.ndim; axis=0, rank=0.
>> nanmedian(True)
---------------------------------------------------------------------------
ValueError: axis must be less than arr.ndim; axis=0, rank=0.
>> nanmedian(np.array(1))
---------------------------------------------------------------------------
ValueError: axis must be less than arr.ndim; axis=0, rank=0.
>> nanmedian(np.array([1, 2, 3]))
array(2.0)
Changing the function from the original:
def nanmedian(x, axis=0):
x, axis = _chk_asarray(x,axis)
x = x.copy()
return np.apply_along_axis(_nanmedian,axis,x)
to this (I know, it is not pretty):
def nanmedian(x, axis=0):
if np.isscalar(x):
return float(x)
x, axis = _chk_asarray(x, axis)
if x.ndim == 0:
return float(x.tolist())
x = x.copy()
x = np.apply_along_axis(_nanmedian, axis, x)
if x.ndim == 0:
x = float(x.tolist())
return x
gives the expected results:
>> nanmedian(1)
1.0
>> nanmedian(True)
1.0
>> nanmedian(np.array(1))
1.0
>> nanmedian(np.array([1, 2, 3]))
2.0
which agree with numpy:
>> np.median(1)
1.0
>> np.median(True)
1.0
>> np.median(np.array(1))
1.0
>> np.median(np.array([1, 2, 3]))
2.0
I'm keeping a local copy of the changes I made for my own package. But
it would be nice (for me) if this was fixed upstream. Are the changes
above good enough for scipy?
(Another difference from np.median that I noticed is that the default
axis for np.median is None and for scipy.stats.nanmean it is 0. But
perhaps it is too late to change that.)
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
Can you open a ticket, so that I don't forget to look at it?
I will need to play with it. There are some things that I don't
understand right away.
What's the difference between isscalar and ndim=0 ?
Why do you have the tolist()
Thanks,
Josef
Sure.
> I will need to play with it. There are some things that I don't
> understand right away.
> What's the difference between isscalar and ndim=0 ?
A scalar doesn't have a ndim method. But now I see that there is a
ndim function. I'll use that instead.
> Why do you have the tolist()
That's the only was I was able to figure out how to pull 1.0 out of
np.array(1.0). Is there a better way?
>> np.array(1.0).tolist()
1.0
.item()
>>> np.array(1.0)
array(1.0)
>>> np.array(1.0)[()]
1.0
Josef
Thanks. item() looks better than tolist().
I simplified the function:
def nanmedian(x, axis=0):
x, axis = _chk_asarray(x,axis)
if x.ndim == 0:
return float(x.item())
x = x.copy()
x = np.apply_along_axis(_nanmedian,axis,x)
if x.ndim == 0:
x = float(x.item())
return x
and opened a ticket:
http://projects.scipy.org/scipy/ticket/1098
How about getting rid of apply_along_axis? see attachment
I don't know whether or how much faster it is, but there is a ticket
that the current version is slow.
No hidden bug or corner case guarantee yet.
Josef
Personally, I think using masked arrays is a far better solution than the various nan methods in stats.py. Is the call to _chk_asarray that much different than the call to ma? Both require conversion or checking if the input is a np array._______________________________________________ SciPy-User mailing list SciPy...@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
It is faster. But here is one case it does not handle:
>> nanmedian([1, 2])
array([ 1.5])
>> np.median([1, 2])
1.5
I'm sure it could be fixed. But having to fix it, and the fact that it
is a larger change, decreases the likelihood that it will make it into
the next version of scipy. One option is to make the small bug fix I
suggested (ticket #1098) and add the corresponding unit tests. Then we
can take our time to design a better version of nanmedian.
I like arrays with nans better than masked arrays, and I checked,
np.ma.median also uses apply_along_axis and I wanted to see if I can
get a vectorized version.
>
> As stated in the documentation, _nanmedian only works on 1d arrays. So any
> 'axis' argument without changing the main function is perhaps a hack at
> best.
_nanmedian is an internal function, stats.nanmedian and my rewritten
version are supposed to handle any dimension and any axis.
>
> Is it possible to adapt Sturla's version?
> http://projects.scipy.org/numpy/ticket/1213
> I do not know the algorithm to suggest anything but perhaps the select
> method could be adapted to handle nan.
I guess Sturla's version is a lot better, but not my kind of fish,
it's more for the algorithm and c experts. I will gladly use it once
it or something similar is in numpy.
Josef
>
> Bruce
I recently needed to calculate the median in an inner loop. It would
have been nice to have a median function that doesn't do a full sort.
I wanted to compile Sturla's version, but I didn't even know which of
the attachments to download. I've never compiled a cython function.
I didn't see the difference to np.median for this case, I think I was
taking the shape answer from the other thread on the return of splines
and interpolation.
If I change the last 3 lines to
if nanmed.size == 1:
return nanmed.item()
return nanmed
then I get agreement with numpy for the following test cases
print nanmedian(1), np.median(1)
print nanmedian(np.array(1)), np.median(1)
print nanmedian(np.array([1])), np.median(np.array([1]))
print nanmedian(np.array([[1]])), np.median(np.array([[1]]))
print nanmedian(np.array([1,2])), np.median(np.array([1,2]))
print nanmedian(np.array([[1,2]])), np.median(np.array([[1,2]]),axis=0)
print nanmedian([1]), np.median([1])
print nanmedian([[1]]), np.median([[1]])
print nanmedian([1,2]), np.median([1,2])
print nanmedian([[1,2]]), np.median([[1,2]],axis=0)
print nanmedian([1j,2]), np.median([1j,2])
Am I still missing any cases?
The vectorized version should be faster for this case
http://projects.scipy.org/scipy/ticket/740
but maybe not for long and narrow arrays.
Josef
Here is an odd one:
>> nanmedian(True)
1.0
>> nanmedian([True])
0.5 # <--- strange
>> np.median(True)
1.0
>> np.median([True])
1.0
Another one:
>> x = np.random.randn(3,4,5)
>> nanmedian(x)
ValueError: shape mismatch: objects cannot be broadcast to a single shape
If anything we should add a full set of unit tests for nanmedian. One
reason why the current unit tests did not catch the problem I ran into
is that
>> np.array(2.0) == 2.0
True
So nanmedian was returning np.array(2.0) and np.median was returning
2.0 which when compared passed the unit test.
definitely weird
>>> (np.array(True)+np.array(True))/2.
0.5
>>> np.array([True, True]).sum()
2
>>> np.array([True, True]).mean()
1.0
I assumed mean (is used by np.ma.median) is the same as adding and dividing by 2
Josef
It got a bit ugly, too many shapes for a single number, and tricky
axis handling.
Your idea of making just the smallish changes looks more attractive now.
and almost correct
>>> np.median(np.array([[[1]]]),axis=0).shape
(1, 1)
>>> nanmedian(np.array([[[1]]]),axis=0).shape
(1, 1)
>>> np.median(np.array([[[1]]]),axis=-1).shape
(1, 1)
>>> nanmedian(np.array([[[1]]]),axis=-1).shape
(1, 1)
but there slipped a python in
>>> nanmedian(np.array([[[1]]]),axis=None).__class__
<type 'float'>
>>> np.median(np.array([[[1]]]),axis=None).__class__
<type 'numpy.float64'>
.item() returns a python number not a numpy number
>>> np.array([[[1]]]).item().__class__
<type 'int'>
>>> np.array([[[1]]]).flat[0].__class__
<type 'numpy.int32'>
I also didn't know this:
>>> None < 0
True
Josef
Good catch. Thanks. I'll update my local copy of nanmedian.
Looks like we don't even need .tolist(), .item(), or [()]. This should
do the trick:
>> np.float64(np.array(1))
1.0
>> type(np.float64(np.array(1)))
<type 'numpy.float64'>
And what if the input is float32? Well, numpy turns that into float64
so nothing to worry about:
>> x = np.array(1, dtype=np.float32)
>> m = np.median(x)
>> type(m)
<type 'numpy.float64'>
but it breaks complex
>>> np.float64(np.array(1.j))
0.0
I did one more shape correction in my version
Josef
Crap. So many corner cases.
After collecting all the input cases (all without NaNs) we could
extend your automated test by adding an outer loop over [nanmean,
nanmedian, nanstd] and make sure it gives the same results as the
numpy versions. It might be good to check for dtype too since 1 ==
1.0.