data.table like group by transform with multiple columns

785 views
Skip to first unread message

Josiah Olson

unread,
Aug 10, 2016, 5:55:43 PM8/10/16
to PyData
I'm trying to do a group by transform in a pandas oneliner where both the group by and transform are functions of multiple columns. Below is the slickest way I've been able to do this in pandas:

import pandas as pd
test = pd.DataFrame({'x': range(10, 20), 'y': range(30, 50, 2), 'z': range(40, 50)})
test = (test.assign(temp = lambda x: (x.x + x.y) % 2)
        .pipe(lambda df: df.groupby('temp')
              .apply(lambda x: pd.Series({'new': np.sum(x.x + x.y)}))
              .merge(df, left_index=True, right_on='temp'))
        .drop('temp', axis=1))
test

And the equivalent in data.table:

library(data.table)
test = data.table(x=seq(10, 19), y=seq(30, 48, 2), z=seq(40, 49))
test[, new := sum(x + y), by=(x + y) %% 2]
     x  y  z new
 1: 10 30 40 260
 2: 11 32 41 275
 3: 12 34 42 260
 4: 13 36 43 275
 5: 14 38 44 260
 6: 15 40 45 275
 7: 16 42 46 260
 8: 17 44 47 275
 9: 18 46 48 260
10: 19 48 49 275

The operation I'm trying to do is trival in pandas if your groupby and apply steps are only a function of one column. Am I missing some obvious pandas functionality? Is this something that could be added to the pandas api easily and if so how?

D. S. McNeil

unread,
Aug 10, 2016, 7:06:23 PM8/10/16
to pyd...@googlegroups.com
I don't quite get the interest in one-lining everything -- sometimes that makes things clearer, and often it doesn't -- but in your particular case something like

In [47]: test["new"] = test.groupby((test.x + test.y) % 2)[["x","y"]].transform(sum).sum(axis=1)

In [48]: test
Out[48]: 
    x   y   z  new
0  10  30  40  260
1  11  32  41  275
2  12  34  42  260
3  13  36  43  275
4  14  38  44  260
5  15  40  45  275
6  16  42  46  260
7  17  44  47  275
8  18  46  48  260
9  19  48  49  275

seems to reproduce the desired target frame.


Doug

Josiah Olson

unread,
Aug 11, 2016, 9:56:55 AM8/11/16
to PyData
Thanks for the reply Doug. I think your method of using two reductions back to back the first reducing by row than reducing by column is very clear. I can imagine more complex functions than sum where the reductions may have an interaction by row and column but in that complex case it's probably best to do it in several steps.

Quick followup. is there any way in transform to have a wildcard transform. For example you can do 
 test.groupby((test.x + test.y) % 2)[["x","y"]].transform({'x': np.sum, 'y': np.mean})

Is there a wildcard where all other remaining columns in the data frame could get some other reductions maybe like so for example:
 test.groupby((test.x + test.y) % 2).transform({'x': np.sum, 'y': np.mean, '*': lambda x: x.first()})
Reply all
Reply to author
Forward
0 new messages