data.table like group by to multi-column transform

26 views
Skip to first unread message

Josiah Olson

unread,
Aug 18, 2016, 10:24:00 AM8/18/16
to PyData
I'm trying to use pandas to do a group-by-transform operation but the normal syntax for this process doesn't appear to work if both you groupby and your transform steps are functions of two or more columns in your data.

I have composed the following example in R to demonstrate what I'm trying to accomplish:
library(data.table)

test = data.table(x=seq(10, 20), y=seq(30, 50, 2), z=seq(40, 50))

test[, new := sum(x + y), by=(x + y) %% 2]

test
     x  y  z new
 1: 10 30 40 260
 2: 11 32 41 275
 3: 12 34 42 260
 4: 13 36 43 275
 5: 14 38 44 260
 6: 15 40 45 275
 7: 16 42 46 260
 8: 17 44 47 275
 9: 18 46 48 260
10: 19 48 49 275

The following is the best way I've been able to do this in pandas without create intermediate columns or copies of my data that I need to go back and delete after my oneliner:

test = pd.DataFrame({'x': range(10, 20), 'y': range(30, 50, 2), 'z': range(40, 50)})
test = (test.assign(temp = lambda x: (x.x + x.y) % 2)
        .pipe(lambda df: df.groupby('temp')
              .apply(lambda x: pd.Series({'new': np.sum(x.x + x.y)}))
              .merge(df, left_index=True, right_on='temp'))
        .drop('temp', axis=1))
test
newxyz
0260103040
2260123442
4260143844
6260164246
8260184648
1275113241
3275133643
5275154045
7275174447
9275194849

Does anyone have a suggestion of a slicker and less verbose way to do this in pandas? I've explored using agg or transform but they only appear to work on one column at a time and I need to have some function that operates on the interaction between columns. Additionally, the example I've shown above in pandas is still not ideal as I have to create the 'temp' column to group by a function of two columns. I'd even be interested to know if anyone thinks there should be additional syntax added to pandas to accommodate this type of operation, or am I just trying to do something that no one ever does? I can say this is a fairly common step in my data processing pipelines.

Joris Van den Bossche

unread,
Aug 18, 2016, 4:34:53 PM8/18/16
to PyData
You can do it like this:

In [19]: test['new'] = test.groupby((test.x + test.y) % 2)[['x', 'y']].transform('sum').sum(axis=1)

In [20]: test
Out[20]:
    x   y   z  new
0  10  30  40  260
1  11  32  41  275
2  12  34  42  260
3  13  36  43  275
4  14  38  44  260
5  15  40  45  275
6  16  42  46  260
7  17  44  47  275
8  18  46  48  260
9  19  48  49  275


Note:
- you can pass a series to groupby, so no need to create the temp column
- I first sum x and y separately (using .transform('sum') instead of .sum()/.agg('sum') to keep the original shape), and then sum the x and y columns.

Regards,
Joris


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages