decile analysis in pandas

John Hunter

unread,

Feb 7, 2012, 10:20:32 PM2/7/12

to pystat...@googlegroups.com

I have a dataframe and I would like to see the series summarized by the decile ranking of one of the variables. In the example below, I manually do this by computing the decile rank using the kludgy argsort.argsort trick, and then use groupby. I'm guessing there is a built in "right way' to do a percentile analysis in pandas, but it is escaping me.

In [59]: X = np.random.randn(1000,4)

In [60]: df = pandas.DataFrame(X, columns=['A', 'B', 'C', 'D'])

In [61]: df['decile'] = (df['A'].argsort().argsort()/float(len(df))*10).astype(int)

In [62]: print df[:10]
A B C D decile
0 -0.3964 0.69569 0.8716 -1.43534 3
1 -0.4580 0.93329 1.2022 0.07479 3
2 1.0979 0.09120 -0.6873 -0.83076 8
3 1.0404 -0.11422 -0.9206 1.26041 8
4 -0.1441 -0.43059 1.0272 0.75833 4
5 -0.5834 -0.24975 1.1138 -0.10635 2
6 -1.2045 -0.07822 0.7174 0.59708 1
7 -0.9144 0.88238 0.4237 1.02930 1
8 0.6109 -1.23409 0.8883 -1.74292 7
9 1.1194 -0.22935 1.9032 -1.48211 8

In [64]: df.groupby(df.decile).mean()
Out[64]:
        A       B        C         D          decile
decile
0      -1.7870 -0.01020 -0.007341 0.0848120 0
1      -1.0538 0.04770 -0.021771 0.0005092 1
2      -0.7119 0.06395 0.001817 0.0157715 2
3      -0.4040 0.08031 0.052442 -0.0021550 3
4      -0.1159 0.05916 0.070572 -0.0061261 4
5       0.1507 0.11948 0.137877 -0.0221219 5
6       0.4164 0.02087 -0.033935 0.0973658 6
7       0.7162 -0.11327 -0.043447 -0.0523838 7
8       1.0959 -0.08120 0.039738 0.1278259 8
9       1.8438 0.03043 0.174767 0.0645459 9

John Hunter

unread,

Feb 7, 2012, 10:59:30 PM2/7/12

to pystat...@googlegroups.com

On Tue, Feb 7, 2012 at 9:20 PM, John Hunter <jdh...@gmail.com> wrote:

I have a dataframe and I would like to see the series summarized by the decile ranking of one of the variables. In the example below, I manually do this by computing the decile rank using the kludgy argsort.argsort trick, and then use groupby. I'm guessing there is a built in "right way' to do a percentile analysis in pandas, but it is escaping me.

OK, made some progress on my own. Discovered "rank"

In [92]: df['decile'] = (df.A.rank()/float(len(df))*10.).astype(int)

In [93]: print df.groupby(df.decile).mean()

Wes McKinney

unread,

Feb 7, 2012, 11:35:42 PM2/7/12

to pystat...@googlegroups.com

I think that's about as good as you can do-- the result of rank is
always float64, at least.

- Wes

John Hunter

unread,

Feb 8, 2012, 9:25:59 AM2/8/12

to pystat...@googlegroups.com

On Tue, Feb 7, 2012 at 10:35 PM, Wes McKinney <wesm...@gmail.com> wrote:

> OK, made some progress on my own. Discovered "rank"
>
> In [92]: df['decile'] = (df.A.rank()/float(len(df))*10.).astype(int)
>
> In [93]: print df.groupby(df.decile).mean()
>
>

I think that's about as good as you can do-- the result of rank is
always float64, at least.

What do you think about a kwarg to rank, like "levels=None"? If None, it does the default, which is the 1...n rank. If levels=k, then it does the 1..k, eg 1..10. Then you could do

df.groupby(df.A.rank(levels=10)).mean()

which is a hell of a lot prettier than (df.A.rank()/float(len(df))*10.).astype(int)

Actually, what I am after is a bit more than this. I have some tile summary code that I use all the time that takes as input x and y and returns a record array of the tile label, the min and max for x, and descriptive statistics for y in each tile. If looks like this in use:

In [14]: rt = nansafe.tile_summary(df.A, df.B, levels=np.arange(10, 101, 10))

In [15]: print mlab.rec2txt(rt)

label xmin xmax qcount qmean qmedian qmin qmax qcimin qcimax

tile 0 -3.466 -1.259 100 -0.067 -0.105 -2.424 2.308 -0.165 0.030

tile 1 -1.258 -0.821 100 0.031 0.014 -3.017 2.016 -0.067 0.129

tile 2 -0.820 -0.533 100 -0.013 0.024 -2.924 2.476 -0.117 0.091

tile 3 -0.531 -0.258 100 0.190 0.236 -2.449 2.454 0.091 0.289

tile 4 -0.252 0.016 100 0.008 -0.063 -1.717 1.789 -0.073 0.090

tile 5 0.025 0.240 100 0.014 -0.018 -2.116 2.513 -0.081 0.109

tile 6 0.243 0.494 100 0.050 0.006 -1.957 1.618 -0.030 0.130

tile 7 0.494 0.820 100 0.170 -0.011 -1.924 2.327 0.082 0.257

tile 8 0.826 1.244 100 -0.036 -0.131 -2.317 2.360 -0.137 0.065

tile 9 1.246 3.055 100 -0.153 -0.123 -3.225 1.845 -0.249 -0.058

tile ALL -3.466 3.055 1000 0.019 -0.020 -3.225 2.513 -0.011 0.049

and the output y columns are customizable::

Definition: nansafe.tile_summary(x, y, stats=(<function count at 0x98cebc4>, <function mean at 0x98ce994>, <function median at 0x98ce9cc>, <function min at 0x98ceca4>, <function max at 0x98cecdc>, <function cimin at 0x98cec34>, <function cimax at 0x98cec6c>), levels=(20.0, 40.0, 60.0, 80.0, 100.0), breakpoints=None, tilelabel='tile', names=None, catd=None, verbose_output=False, force_categorical=False)

Do you think something like this has a place in pandas?

JDH

Wes McKinney

unread,

Feb 8, 2012, 3:26:44 PM2/8/12

to pystat...@googlegroups.com

That is cool. Any kind of summary statistics / slicing-and-dicing has
a place in pandas. For example I've planned to add some related things
like R has (https://github.com/wesm/pandas/issues/415) for
binning/histogramming.

- Wes

Reply all

Reply to author

Forward