groupby().apply(scipy.stats.scoreatpercentile)

221 views
Skip to first unread message

andreash

unread,
Oct 11, 2011, 9:46:45 AM10/11/11
to pystatsmodels
Hi there,

the groupby().apply() method doesn't seem to work on the
scipy.stats.scoreatpercentile function (see stacktrace below). On
other stats functions, like scipy.stats.gmean(), everything works
fine.

grouped_by_lat['local time'] is of data type np.float

Any idea what could be going wrong?

Thanks for your help,
Andreas.

In [82]: grouped_by_lat['local
time'].apply(scipy.stats.scoreatpercentile, per=.05)
---------------------------------------------------------------------------
Exception Traceback (most recent call
last)
/home/hilboll/Desktop/<ipython-input-82-eecaac43d6ae> in <module>()
----> 1 grouped_by_lat['local time
microseconds'].apply(scipy.stats.scoreatpercentile, per=.05)

/home/hilboll/.virtualenvs/pydoas/lib/python2.7/site-packages/pandas/
core/groupby.pyc in apply(self, func, *args, **kwargs)
268 applied : type depending on grouped object and
function
269 """
--> 270 return self._python_apply_general(func, *args,
**kwargs)
271
272 def aggregate(self, func, *args, **kwargs):

/home/hilboll/.virtualenvs/pydoas/lib/python2.7/site-packages/pandas/
core/groupby.pyc in _python_apply_general(self, func, *args, **kwargs)
412 for key, group in self:
413 group.name = key
--> 414 res = func(group, *args, **kwargs)
415 if not _is_indexed_like(res, group):
416 not_indexed_same = True

/home/hilboll/lib/epd-7.1-1-x86_64/lib/python2.7/site-packages/scipy/
stats/stats.pyc in scoreatpercentile(a, per, limit)
1367 return values[idx]
1368 else:
-> 1369 return _interpolate(values[int(idx)], values[int(idx)
+ 1], idx % 1)
1370
1371

/home/hilboll/.virtualenvs/pydoas/lib/python2.7/site-packages/pandas/
core/series.pyc in __getitem__(self, key)
237 values = self.values
238 try:
--> 239 return values[self.index.get_loc(key)]
240 except KeyError, e1:
241 try:

/home/hilboll/.virtualenvs/pydoas/lib/python2.7/site-packages/pandas/
core/index.pyc in get_loc(self, key)
340 loc : int
341 """
--> 342 self._verify_integrity()
343 return self.indexMap[key]
344

/home/hilboll/.virtualenvs/pydoas/lib/python2.7/site-packages/pandas/
core/index.pyc in _verify_integrity(self)
100
101 def _verify_integrity(self):
--> 102 if len(self.indexMap) < len(self):
103 raise Exception('Index cannot contain duplicate
values!')
104

/home/hilboll/.virtualenvs/pydoas/lib/python2.7/site-packages/pandas/
core/index.pyc in indexMap(self)
85 if self._indexMap is None:
86 self._indexMap = lib.map_indices_object(self)
---> 87 self._verify_integrity()
88
89 return self._indexMap

/home/hilboll/.virtualenvs/pydoas/lib/python2.7/site-packages/pandas/
core/index.pyc in _verify_integrity(self)
101 def _verify_integrity(self):
102 if len(self.indexMap) < len(self):
--> 103 raise Exception('Index cannot contain duplicate
values!')
104
105 def __iter__(self):

Exception: Index cannot contain duplicate values!

josef...@gmail.com

unread,
Oct 11, 2011, 10:04:03 AM10/11/11
to pystat...@googlegroups.com
On Tue, Oct 11, 2011 at 9:46 AM, andreash <hil...@gmail.com> wrote:
> Hi there,
>
> the groupby().apply() method doesn't seem to work on the
> scipy.stats.scoreatpercentile function (see stacktrace below). On
> other stats functions, like scipy.stats.gmean(), everything works
> fine.

I just checked:
in scipy 0.9 scipy.stats.scoreatpercentile doesn't contain a cast to
array, np.asarray, that's a bug in scipy.stats.

>>> type(weeklymax)
<class 'pandas.core.series.TimeSeries'>
>>> type(np.sort(weeklymax))
<class 'pandas.core.series.TimeSeries'>

>>> type(np.sort(np.ma.arange(5)))
<class 'numpy.ma.core.MaskedArray'>

I guess, before the assumption was that np.sort(some subclass) returns
an ndarray, but obviously np.sort doesn't change the type of a array
subtype.

If you are on numpy >= 1.6, there is a similar new function in numpy
that should work. (Skipper used it, but I'm on numpy 1.5)

Josef

josef...@gmail.com

unread,
Oct 11, 2011, 10:10:11 AM10/11/11
to pystat...@googlegroups.com
On Tue, Oct 11, 2011 at 10:04 AM, <josef...@gmail.com> wrote:
> On Tue, Oct 11, 2011 at 9:46 AM, andreash <hil...@gmail.com> wrote:
>> Hi there,
>>
>> the groupby().apply() method doesn't seem to work on the
>> scipy.stats.scoreatpercentile function (see stacktrace below). On
>> other stats functions, like scipy.stats.gmean(), everything works
>> fine.
>
> I just checked:
> in scipy 0.9 scipy.stats.scoreatpercentile doesn't contain a cast to
> array, np.asarray, that's a bug in scipy.stats.
>
>>>> type(weeklymax)
> <class 'pandas.core.series.TimeSeries'>
>>>> type(np.sort(weeklymax))
> <class 'pandas.core.series.TimeSeries'>
>
>>>> type(np.sort(np.ma.arange(5)))
> <class 'numpy.ma.core.MaskedArray'>
>
> I guess, before the assumption was that np.sort(some subclass) returns
> an ndarray, but obviously np.sort doesn't change the type of a array
> subtype.

although, it still works on TimeSeries

>>> type(weeklymax)
<class 'pandas.core.series.TimeSeries'>
>>> stats.scoreatpercentile(weeklymax, per=50)
0.020191766815031542

Josef

andreash

unread,
Oct 11, 2011, 10:32:09 AM10/11/11
to pystatsmodels
Josef,

thanks! I now use something like

return Series({'q05':scipy.stats.scoreatpercentile(np.array(times),
5),})

in the function I pass to groupby().apply(), and everything works just
fine.

> I just checked:
> in scipy 0.9 scipy.stats.scoreatpercentile doesn't contain a cast to
> array, np.asarray, that's a bug in scipy.stats.

Should I file that one, or have you already done that?

btw, I was a bit surprised that scipy.stats.scoreatpercentile's per
argument actually expects the percent value [0,100] instead of a value
[0,1]. Maybe it would be nice to add this to the docs.

Cheers,
Andreas.

josef...@gmail.com

unread,
Oct 11, 2011, 10:53:12 AM10/11/11
to pystat...@googlegroups.com
On Tue, Oct 11, 2011 at 10:32 AM, andreash <hil...@gmail.com> wrote:
> Josef,
>
> thanks! I now use something like
>
>   return Series({'q05':scipy.stats.scoreatpercentile(np.array(times),
> 5),})
>
> in the function I pass to groupby().apply(), and everything works just
> fine.
>
>> I just checked:
>> in scipy 0.9 scipy.stats.scoreatpercentile doesn't contain a cast to
>> array, np.asarray, that's a bug in scipy.stats.
>
> Should I file that one, or have you already done that?

Please file the ticket, I haven't done it yet.

adding np.asarray is a bug fix,
but there are also enhancement that could be done, Skipper and I
worked on it in statsmodels, but it still needs to be reviewed and I
think the options are not settled yet.

>
> btw, I was a bit surprised that scipy.stats.scoreatpercentile's per
> argument actually expects the percent value [0,100] instead of a value
> [0,1]. Maybe it would be nice to add this to the docs.

I was surprised also, but since the function uses percentile in the
name and in the documentation for the per argument, it shouldn't be
too surprising, unless you expect a `quantile` function as I did
initially.

The example in the docstring also uses 50 not 0.5.

Josef

>
> Cheers,
> Andreas.

josef...@gmail.com

unread,
Oct 11, 2011, 10:54:44 AM10/11/11
to pystat...@googlegroups.com
On Tue, Oct 11, 2011 at 10:53 AM, <josef...@gmail.com> wrote:
> On Tue, Oct 11, 2011 at 10:32 AM, andreash <hil...@gmail.com> wrote:
>> Josef,
>>
>> thanks! I now use something like
>>
>>   return Series({'q05':scipy.stats.scoreatpercentile(np.array(times),
>> 5),})

forgot to add:

Wes,
Does np.asarray on a pandas object make a copy or a view?

Josef

Wes McKinney

unread,
Oct 11, 2011, 11:53:37 AM10/11/11
to pystat...@googlegroups.com
On Tue, Oct 11, 2011 at 10:54 AM, <josef...@gmail.com> wrote:
> On Tue, Oct 11, 2011 at 10:53 AM,  <josef...@gmail.com> wrote:
>> On Tue, Oct 11, 2011 at 10:32 AM, andreash <hil...@gmail.com> wrote:
>>> Josef,
>>>
>>> thanks! I now use something like
>>>
>>>   return Series({'q05':scipy.stats.scoreatpercentile(np.array(times),
>>> 5),})
>
> forgot to add:
>
> Wes,
> Does np.asarray on a pandas object make a copy or a view?
>
> Josef

It returns a view.

BTW you should use the quantile function on Series or DataFrame, so you can do:

grouped_by_lat['local time'].quantile(0.05)

this just uses scoreatpercentile under the hood but much more concise and no bug

josef...@gmail.com

unread,
Oct 11, 2011, 12:11:06 PM10/11/11
to pystat...@googlegroups.com
On Tue, Oct 11, 2011 at 11:53 AM, Wes McKinney <wesm...@gmail.com> wrote:
> On Tue, Oct 11, 2011 at 10:54 AM,  <josef...@gmail.com> wrote:
>> On Tue, Oct 11, 2011 at 10:53 AM,  <josef...@gmail.com> wrote:
>>> On Tue, Oct 11, 2011 at 10:32 AM, andreash <hil...@gmail.com> wrote:
>>>> Josef,
>>>>
>>>> thanks! I now use something like
>>>>
>>>>   return Series({'q05':scipy.stats.scoreatpercentile(np.array(times),
>>>> 5),})
>>
>> forgot to add:
>>
>> Wes,
>> Does np.asarray on a pandas object make a copy or a view?
>>
>> Josef
>
> It returns a view.
>
> BTW you should use the quantile function on Series or DataFrame, so you can do:
>
> grouped_by_lat['local time'].quantile(0.05)
>
> this just uses scoreatpercentile under the hood but much more concise and no bug

But if you hide/avoid the bug, we don't get the report .)

Thanks,
Josef

Wes McKinney

unread,
Oct 11, 2011, 4:17:28 PM10/11/11
to pystat...@googlegroups.com

I think scoreatpercentile also takes a number between 0 and 100 which
has always struck me as a little odd (or at least inconsistent with
other quantile functions).

Reply all
Reply to author
Forward
0 new messages