decile analysis in pandas

1,996 views
Skip to first unread message

John Hunter

unread,
Feb 7, 2012, 10:20:32 PM2/7/12
to pystat...@googlegroups.com
I have a dataframe and I would like to see the series summarized by the decile ranking of one of the variables.  In the example below, I manually do this by computing the decile rank using the kludgy argsort.argsort trick, and then use groupby.  I'm guessing there is a built in "right way' to do a percentile analysis in pandas, but it is escaping me.

In [59]: X = np.random.randn(1000,4)

In [60]: df = pandas.DataFrame(X, columns=['A', 'B', 'C', 'D'])

In [61]: df['decile'] = (df['A'].argsort().argsort()/float(len(df))*10).astype(int)

In [62]: print df[:10]
  A       B        C       D        decile
0 -0.3964  0.69569  0.8716 -1.43534  3    
1 -0.4580  0.93329  1.2022  0.07479  3    
2  1.0979  0.09120 -0.6873 -0.83076  8    
3  1.0404 -0.11422 -0.9206  1.26041  8    
4 -0.1441 -0.43059  1.0272  0.75833  4    
5 -0.5834 -0.24975  1.1138 -0.10635  2    
6 -1.2045 -0.07822  0.7174  0.59708  1    
7 -0.9144  0.88238  0.4237  1.02930  1    
8  0.6109 -1.23409  0.8883 -1.74292  7    
9  1.1194 -0.22935  1.9032 -1.48211  8    

In [64]: df.groupby(df.decile).mean()
Out[64]:
        A       B        C         D          decile
decile                                             
0      -1.7870 -0.01020 -0.007341  0.0848120  0    
1      -1.0538  0.04770 -0.021771  0.0005092  1    
2      -0.7119  0.06395  0.001817  0.0157715  2    
3      -0.4040  0.08031  0.052442 -0.0021550  3    
4      -0.1159  0.05916  0.070572 -0.0061261  4    
5       0.1507  0.11948  0.137877 -0.0221219  5    
6       0.4164  0.02087 -0.033935  0.0973658  6    
7       0.7162 -0.11327 -0.043447 -0.0523838  7    
8       1.0959 -0.08120  0.039738  0.1278259  8    
9       1.8438  0.03043  0.174767  0.0645459  9    

John Hunter

unread,
Feb 7, 2012, 10:59:30 PM2/7/12
to pystat...@googlegroups.com
On Tue, Feb 7, 2012 at 9:20 PM, John Hunter <jdh...@gmail.com> wrote:
I have a dataframe and I would like to see the series summarized by the decile ranking of one of the variables.  In the example below, I manually do this by computing the decile rank using the kludgy argsort.argsort trick, and then use groupby.  I'm guessing there is a built in "right way' to do a percentile analysis in pandas, but it is escaping me.


OK, made some progress on my own.  Discovered "rank"

In [92]: df['decile'] = (df.A.rank()/float(len(df))*10.).astype(int)

In [93]: print df.groupby(df.decile).mean()


Wes McKinney

unread,
Feb 7, 2012, 11:35:42 PM2/7/12
to pystat...@googlegroups.com

I think that's about as good as you can do-- the result of rank is
always float64, at least.

- Wes

John Hunter

unread,
Feb 8, 2012, 9:25:59 AM2/8/12
to pystat...@googlegroups.com
On Tue, Feb 7, 2012 at 10:35 PM, Wes McKinney <wesm...@gmail.com> wrote:
> OK, made some progress on my own.  Discovered "rank"
>
> In [92]: df['decile'] = (df.A.rank()/float(len(df))*10.).astype(int)
>
> In [93]: print df.groupby(df.decile).mean()
>
>

I think that's about as good as you can do-- the result of rank is
always float64, at least.


What do you think about a kwarg to rank, like "levels=None"?  If None, it does the default, which is the 1...n rank.  If levels=k, then it does the 1..k, eg 1..10.  Then you could do

  df.groupby(df.A.rank(levels=10)).mean() 

which is a hell of a lot prettier than (df.A.rank()/float(len(df))*10.).astype(int)

Actually, what I am after is a bit more than this.  I have some tile summary code that I use all the time that takes as input x and y and returns a record array of the tile label, the min and max for x, and descriptive statistics for y in each tile.  If looks like this in use:

 In [14]: rt = nansafe.tile_summary(df.A, df.B, levels=np.arange(10, 101, 10))

In [15]: print mlab.rec2txt(rt)
label        xmin     xmax   qcount    qmean   qmedian     qmin    qmax   qcimin   qcimax
tile 0     -3.466   -1.259      100   -0.067    -0.105   -2.424   2.308   -0.165    0.030
tile 1     -1.258   -0.821      100    0.031     0.014   -3.017   2.016   -0.067    0.129
tile 2     -0.820   -0.533      100   -0.013     0.024   -2.924   2.476   -0.117    0.091
tile 3     -0.531   -0.258      100    0.190     0.236   -2.449   2.454    0.091    0.289
tile 4     -0.252    0.016      100    0.008    -0.063   -1.717   1.789   -0.073    0.090
tile 5      0.025    0.240      100    0.014    -0.018   -2.116   2.513   -0.081    0.109
tile 6      0.243    0.494      100    0.050     0.006   -1.957   1.618   -0.030    0.130
tile 7      0.494    0.820      100    0.170    -0.011   -1.924   2.327    0.082    0.257
tile 8      0.826    1.244      100   -0.036    -0.131   -2.317   2.360   -0.137    0.065
tile 9      1.246    3.055      100   -0.153    -0.123   -3.225   1.845   -0.249   -0.058
tile ALL   -3.466    3.055     1000    0.019    -0.020   -3.225   2.513   -0.011    0.049


and the output y columns are customizable::

Definition: nansafe.tile_summary(x, y, stats=(<function count at 0x98cebc4>, <function mean at 0x98ce994>, <function median at 0x98ce9cc>, <function min at 0x98ceca4>, <function max at 0x98cecdc>, <function cimin at 0x98cec34>, <function cimax at 0x98cec6c>), levels=(20.0, 40.0, 60.0, 80.0, 100.0), breakpoints=None, tilelabel='tile', names=None, catd=None, verbose_output=False, force_categorical=False)

Do you think something like this has a place in pandas?

JDH

Wes McKinney

unread,
Feb 8, 2012, 3:26:44 PM2/8/12
to pystat...@googlegroups.com

That is cool. Any kind of summary statistics / slicing-and-dicing has
a place in pandas. For example I've planned to add some related things
like R has (https://github.com/wesm/pandas/issues/415) for
binning/histogramming.

- Wes

Reply all
Reply to author
Forward
0 new messages