Pandas row binning

Adam Hughes

unread,

Oct 1, 2012, 6:29:17 PM10/1/12

to pyd...@googlegroups.com

Hi guys,

I am interested in doing row binning in an efficient manner.

I have a dataframe, say 100 rows by 20 columns. I'd like to reduce this to a 10 by 20 dataframe, where the new row values are the average, sum or other mapped operation that the user passes in. I'm having trouble finding the best way to do this, without resorting to row-by-row iterations.

I've been trying to follow the API here, but I know that it is not completely up to date. How would you guys make a function that takes in a bin number and does this type of operation?

I thought the resample method might be what I wanted, but don't understand the "rule" argument.

PS, i'm avoiding histograms because my arrays are intentionally out of order, and I need to preserve this.

Thanks.

Wouter Overmeire

unread,

Oct 2, 2012, 7:13:27 AM10/2/12

to pyd...@googlegroups.com

2012/10/2 Adam Hughes <hughes...@gmail.com>

The "rule" argument indicates the bin size.
For example, daily bins -- 'D', 12 hour bins -- '12H', or monthly bins -- 'M', etc.

In [73]: index = pandas.date_range('2012-10-10', periods=100, freq='H') # 100 hourly spaced timestamps

In [74]: import random

In [75]: index = random.sample(index, 100) # index is unsorted now

In [76]: df = pandas.DataFrame(np.random.randn(100,4), index, list('ABCD'))

In [77]: df.resample(rule='D', how='mean')
Out[77]:
                   A         B         C         D
2012-10-10 0.190634 -0.010331 -0.533944 2.019264
2012-10-11 -0.042690 -0.025202 0.340528 -0.337808
2012-10-12 0.046622 0.136284 -0.292848 0.261602
2012-10-13 0.283887 -0.084499 -0.468463 0.104838
2012-10-14 -0.343621 -0.031466 -0.166999 0.215456
2012-10-15 -0.702906 1.069791 0.464696 -0.049558

In [78]: df.resample(rule='12H', how='mean')
Out[78]:
                            A         B         C         D
2012-10-10 00:00:00 0.190634 -0.010331 -0.533944 2.019264
2012-10-10 12:00:00 0.033740 0.386743 0.553311 -0.144823
2012-10-11 00:00:00 -0.119121 -0.437148 0.127744 -0.530792
2012-10-11 12:00:00 -0.150271 0.404516 0.034835 0.475245
2012-10-12 00:00:00 0.243515 -0.131948 -0.620531 0.047958
2012-10-12 12:00:00 0.153480 -0.061286 -0.546608 -0.074879
2012-10-13 00:00:00 0.414294 -0.107713 -0.390319 0.284555
2012-10-13 12:00:00 -0.188024 -0.298300 0.105004 0.405880
2012-10-14 00:00:00 -0.499219 0.235367 -0.439002 0.025032
2012-10-14 12:00:00 -0.702906 1.069791 0.464696 -0.049558

In [79]: df.resample(rule='M', how='mean')
Out[79]:
                   A         B         C         D
2012-10-31 -0.032573 0.030818 -0.132466 0.077287

Adam Hughes

unread,

Oct 2, 2012, 6:08:48 PM10/2/12

to pyd...@googlegroups.com

Thanks Wouter. So this operation is only valid for TimeSeries?

Adam Hughes

unread,

Oct 2, 2012, 6:39:39 PM10/2/12

to pyd...@googlegroups.com

This seems like a common operation, eg row binning and column binning. I actually was surprised to find I couldn't turn up much about this in numpy searches which always go back to histogramming. This is not ideal if you need to preserve ordering (aka, each column of the data frame would need its own, non-sequential set of bins bins), so I'm going to write a quick program to do this. If anyone is interested, I'll post it here later.

The operations would simply first reshape the array, and then do a vectorized average. For example, if I had 100 rows by 100 columns and wanted to bin the rows by 5, I merely reshape into 20 separate 5*100 arrays:

100 x 100 ---> 20 (5x100)

Then I do a vectorized average on each of the 5x100 arrays and pop into the dataframe. This should be the fastest way to do it, no? Will try to be clever on handling non-perfect division.

Wouter Overmeire

unread,

Oct 3, 2012, 3:02:48 AM10/3/12

to pyd...@googlegroups.com

2012/10/3 Adam Hughes <hughes...@gmail.com>

Note that groupby can be used to do the "reshaping", from this its easy to compute the means and get a new frame.

In the example below (python 2.7) i just reshape by "each 5 consecutive rows is a group" over which the mean is taken. You can use whatever groupby function to do the reshaping.

In [34]: df = pandas.DataFrame(np.random.randn(100,100))

In [35]: df_resampled = df.groupby(lambda x:x/5).mean()

In [36]: df_resampled
Out[36]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Columns: 100 entries, 0 to 99
dtypes: float64(100)

Adam Hughes

unread,

Oct 3, 2012, 3:19:21 PM10/3/12

to pyd...@googlegroups.com

Ah this is perfect thanks a lot!

Alvaro Tejero Cantero

unread,

Oct 4, 2012, 4:15:43 AM10/4/12

to pyd...@googlegroups.com

just a pedantic note (hammering on Wouter's remark "for Python 2.7")
to make it more robust: use the integer division (// instead of /) so
that the grouping lambda survives the transition to Python 3 (or a
from __future__ import division statement at the top of the module)

-á.

> --
>
>

Adam Hughes

unread,

Oct 4, 2012, 2:12:57 PM10/4/12

to pyd...@googlegroups.com

On Thu, Oct 4, 2012 at 4:15 AM, Alvaro Tejero Cantero <alv...@minin.es> wrote:

So, in my application, I actually am passing floats into the divisor (aka x / 4.55), and letting pandas automatically figure out the closest integer by which to split the table. It seems like pandas will convert to an integer internally, regardless of if one passes a float or integer into the divisor.

Alvaro Tejero Cantero

unread,

Oct 4, 2012, 2:21:04 PM10/4/12

to pyd...@googlegroups.com

I actually tried it before writing, on Wouter's example, and this was
the result:

In [5]: df_resampled = df.groupby(lambda x:x/5.).mean()
In [6]: df_resampled
Out[6]:
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 0.0 to 19.8

Columns: 100 entries, 0 to 99
dtypes: float64(100)

note index contains 100 entries, not 20. This was on 0.8.1 right now
and on 0.9.rcsomething on the morning.

-á.

> --
>
>

Wouter Overmeire

unread,

Oct 4, 2012, 2:39:38 PM10/4/12

to pyd...@googlegroups.com

2012/10/4 Alvaro Tejero Cantero <alv...@minin.es>

I`m somehow lost on the point your trying to make here.

Let`s recap with my example (using your excellent comment, // iso /)

In [15]: df = pandas.DataFrame(np.random.randn(100,100))

In [16]: df_resampled = df.groupby(lambda x:x//5).mean()

In [17]: df_resampled

Out[17]:

Int64Index: 20 entries, 0 to 19

Columns: 100 entries, 0 to 99

dtypes: float64(100)

df is 100x100, df_resampled is 20x100.

Because df has a int index = range(100), if you would do the grouping by lamba x: x / 5., which is int / float division, there will be 100 groups.

Let`s compare 10 first grouping keys for x //5 and, x / 5.

In [18]: [x // 5 for x in range(100)][:10]

Out[18]: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

In [19]: [x / 5. for x in range(100)][:10]

Out[19]: [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8]

Wouter Overmeire

unread,

Oct 4, 2012, 2:42:48 PM10/4/12

to pyd...@googlegroups.com

2012/10/4 Adam Hughes <hughes...@gmail.com>

This is unexpected, i don`t think groupby tries to work with the closest integer when the key is a float.

Adam Hughes

unread,

Oct 4, 2012, 4:39:54 PM10/4/12

to pyd...@googlegroups.com

Wow, you guys are both absolutely right. I would have certainly overlooked this and had a bug in my code. Thanks a lot for the heads up. Apparently, I was just not paying much attention last night when testing float divisors!

I made two functions out of this that I will post later for completeness. Thanks again for your help.

Adam Hughes

unread,

Oct 4, 2012, 5:42:59 PM10/4/12

to pyd...@googlegroups.com

Here's how I ended up using the info in this thread. Basically, it's a function that lets the user rebin by rows/columns based on mean, sum or weighting. Weighting actually can be done intrinsically (max value in column becomes weight=1.00). Strictly speaking, this is not real weighting, but something useful to my analysis. The function is redundant because I was lazy and made one function to handle dataframe input and series input.

def df_rebin(df, binwidth, axis=0, avg_fcn='mean', weight_max=None):
    ''' Pass in an array, this slices and averages it along some spacing increment (bins).
    Axis=0 means averages are computed along row. axis=1 means averages are computed along column.
    Dataframe already handles most issues, such as if binning in unequal and/or binning is larger than
    actual length of data along axis. Aka bin 100 rows by 200 rows/bin.
    Binwidth is the spacing as in every X entries, take an average. Width of 3 would be
    every 3 entries, take an average.

    Redundant because if series is passed in, the axis keyword causes errors.

    If using avg_fcn='weighted', one can pass an upper limmit into the "weight_max" category
    so that weighting is to a fixed value and not to the max of the dataset. This is
    useful when comparing datasets objectively. For dataframes, weight_max may be a 1d
    array of the normalization constant to each column in the dataframe. If a single value
    is entered, or for a series, that value will be divided through to every row.'''

    if len(df.shape)==1:
        if avg_fcn.lower() == 'mean':
            dfout=df.groupby(lambda x:x//binwidth).mean() #// is importanmt

        elif avg_fcn.lower() == 'sum':
            dfout=df.groupby(lambda x:x//binwidth).sum()

        ### Rebins according to the sum, and then divides axis or rows by their maxes.
        ### If I want a normalized array, can call this with bindwidth=1.0
        elif avg_fcn.lower() == 'weighted':
            dfout=df.groupby(lambda x:x//binwidth).sum()
            if weight_max:
                dfout.apply(lambda x: x // x.max())
            else:
                dfout.apply(lambda x: x// weight_max)

        else:
            raise NotImplementedError('%s is not a valid key to df_rebin, must \
                                     be mean, sum or norm'%avg_fcn)

    elif len(df.shape)==2:
        if avg_fcn.lower() == 'mean':
            dfout=df.groupby(lambda x:x//binwidth, axis=axis).mean()

        elif avg_fcn.lower() == 'sum':
            dfout=df.groupby(lambda x:x//binwidth, axis=axis).sum()

        ### Rebins according to the sum, and then divides axis or rows by their maxes.
        ### If I want a normalized array, can call this with bindwidth=1.0
        elif avg_fcn.lower() == 'weighted':
            dfout=df.groupby(lambda x:x//binwidth, axis=axis).sum()
            if weight_max:
                dfout.apply(lambda x:x//weight_max, axis=axis)
            else:
                dfout.apply(lambda x: x// x.max(), axis=axis)

        else:
            raise NotImplementedError('%s is not a valid key to df_rebin, must \
                                     be mean, sum or norm'%avg_fcn)

    else:
        raise NotImplementedError('df_rebin only works with 1-d or 2-d arrays')

    return dfout

Adam Hughes

unread,

Oct 8, 2012, 1:38:11 PM10/8/12

to pyd...@googlegroups.com

Hi Wouter,

Just a quick followup to this thread. At one point, you shows that by doing integer division, one gets an array of indicies. You said:

Let`s compare 10 first grouping keys for x //5 and, x / 5.

In [18]: [x // 5 for x in range(100)][:10]

Out[18]: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

I was wondering if it is possible to still do the same grouping/averaging but passing in a such an array of indicies. For example, I do some hisotogram binning in another program, and pass in a list of digitized indicies, and now want to group my dataframe along these indicies. Is something like this possible?

I tried doing something like:

>>> df=DataFrame(randn(10,10))
>>> dfresamp=df.groupby([0,0,0,0,0,1,1,1,1,1,1]).mean()
>>> df.shape, dfresamp.shape
((10, 10), (10, 8))

It doesn't really work.

Wouter Overmeire

unread,

Oct 8, 2012, 2:51:50 PM10/8/12

to pyd...@googlegroups.com

2012/10/8 Adam Hughes <hughes...@gmail.com>

If you give a list, it should refer to the columns you wish to group on. For your example the columns have the default keys of [0, 1, 2, 3 ... 9], so the input to groupby is not really sensible and will not produce what you need.

This can be easily fixed by making a Series from the array, be sure to give it the same index as the DataFrame.

In [13]: from string import ascii_lowercase, ascii_uppercase

In [14]: df = pd.DataFrame(randn(10,10), columns=list(ascii_uppercase[:10]), index=list(ascii_lowercase[:10]))

In [15]: df.groupby(pd.Series([0,0,0,0,0,1,1,1,1,1], index=df.index)).mean()

Out[15]:

A B C D E F G H I J

0 -0.317461 -0.273652 -0.029996 0.402345 -0.454925 0.411764 0.203002 -0.317178 0.036881 -0.042777

1 0.069365 0.400778 -0.270512 -0.485882 0.487505 -0.464670 -0.101800 0.156909 0.371478 -0.234284

In your example, default index is used.

In [16]: df = pd.DataFrame(randn(10,10))

In [17]: df.groupby(pd.Series([0,0,0,0,0,1,1,1,1,1])).mean()

Out[17]:

0 1 2 3 4 5 6 7 8 9

0 0.620309 0.674822 -0.154680 -1.150960 0.092368 0.160989 0.147444 0.111853 -0.084692 -0.556367

1 0.068149 -0.273187 0.388405 0.046407 -0.054020 -0.395190 0.509529 0.095781 -0.152507 0.036615

Another options is to use a function that takes as input an axis label (from the axis to be grouped on) and output a groupby key.

In [19]: df.groupby(lambda x: [0,0,0,0,0,1,1,1,1,1][x]).mean()

Out[19]:

0 1 2 3 4 5 6 7 8 9

0 0.620309 0.674822 -0.154680 -1.150960 0.092368 0.160989 0.147444 0.111853 -0.084692 -0.556367

1 0.068149 -0.273187 0.388405 0.046407 -0.054020 -0.395190 0.509529 0.095781 -0.152507 0.036615

Adam Hughes

unread,

Oct 8, 2012, 2:58:01 PM10/8/12

to pyd...@googlegroups.com

Thanks. You sure know your pandas. You're right, I need labels, and actually am using rows not columns for this particular application. Let me play with the information you sent me and see if I can understand.

Adam Hughes

unread,

Oct 8, 2012, 3:38:15 PM10/8/12

to pyd...@googlegroups.com

Thanks Wouter, these work excellently. I'm going to try using the lambda method, but I understand this much better now from your explanation on converting to a series first.

Adam Hughes

unread,

Oct 8, 2012, 4:03:51 PM10/8/12

to pyd...@googlegroups.com

If you have time, can you clarify one last aspect of this for me. You said:

In [19]: df.groupby(lambda x: [0,0,0,0,0,1,1,1,1,1][x]).mean()

Out[19]:

0 1 2 3 4 5 6 7 8 9

0 0.620309 0.674822 -0.154680 -1.150960 0.092368 0.160989 0.147444 0.111853 -0.084692 -0.556367

1 0.068149 -0.273187 0.388405 0.046407 -0.054020 -0.395190 0.509529 0.095781 -0.152507 0.036615

If I have row and column labels, this should take those in, no?

In [14]: df = pd.DataFrame(randn(10,10), columns=list(ascii_uppercase[:10]), index=list(ascii_lowercase[:10]))

>>> df.groupby(lambda x: ['a','a','a','a','a','b','b','b','b','b'][x], axis=1).mean()

This is not correct, it gives an error. Am I misunderstanding?

Wouter Overmeire

unread,

Oct 8, 2012, 4:23:11 PM10/8/12

to pyd...@googlegroups.com

2012/10/8 Adam Hughes <hughes...@gmail.com>

If a function is used as 'by' argument for groupby. This function is called for each label of the axis on which groupby runs. The functions returns a groupby key/label, there will be as many groups as the number of unique returned vaues of this function when called by all the axis labels. The above example ran on axis=0 and got as input 0, 1, 2, 3, 4, ... It gave back two unique values 0 and 1, so there are two groups, labeled 0, 1

Now for the example below, it runs on axis=1, this axis has labels 'A', 'B', 'C', 'D', 'E', ... and the lambda function get`s as input the same values 'A', 'B', 'C', 'D', 'E', ... this will result in a TypeError, you need an integer to index a list.

I would use a Series here. If you want to use a lambda function, it needs to be converted such that it can take as input a string and output whatever groupby label you need.

Adam Hughes

unread,

Oct 8, 2012, 5:24:56 PM10/8/12

to pyd...@googlegroups.com

Ok thanks for the explanation. I will use Series.

Reply all

Reply to author

Forward