data.table like group by to multi-column transform

26 views

Skip to first unread message

Josiah Olson

unread,

Aug 18, 2016, 10:24:00 AM8/18/16

to PyData

I'm trying to use pandas to do a group-by-transform operation but the normal syntax for this process doesn't appear to work if both you groupby and your transform steps are functions of two or more columns in your data.

I have composed the following example in R to demonstrate what I'm trying to accomplish:

library(data.table)

test = data.table(x=seq(10, 20), y=seq(30, 50, 2), z=seq(40, 50))

test[, new := sum(x + y), by=(x + y) %% 2]

test
     x  y  z new
 1: 10 30 40 260
 2: 11 32 41 275
 3: 12 34 42 260
 4: 13 36 43 275
 5: 14 38 44 260
 6: 15 40 45 275
 7: 16 42 46 260
 8: 17 44 47 275
 9: 18 46 48 260
10: 19 48 49 275

The following is the best way I've been able to do this in pandas without create intermediate columns or copies of my data that I need to go back and delete after my oneliner:

test = pd.DataFrame({'x': range(10, 20), 'y': range(30, 50, 2), 'z': range(40, 50)})
test = (test.assign(temp = lambda x: (x.x + x.y) % 2)
        .pipe(lambda df: df.groupby('temp')
              .apply(lambda x: pd.Series({'new': np.sum(x.x + x.y)}))
              .merge(df, left_index=True, right_on='temp'))
        .drop('temp', axis=1))
test
new x y z
0 260 10 30 40
2 260 12 34 42
4 260 14 38 44
6 260 16 42 46
8 260 18 46 48
1 275 11 32 41
3 275 13 36 43
5 275 15 40 45
7 275 17 44 47
9 275 19 48 49

	new	x	y	z
0	260	10	30	40
2	260	12	34	42
4	260	14	38	44
6	260	16	42	46
8	260	18	46	48
1	275	11	32	41
3	275	13	36	43
5	275	15	40	45
7	275	17	44	47
9	275	19	48	49

Does anyone have a suggestion of a slicker and less verbose way to do this in pandas? I've explored using agg or transform but they only appear to work on one column at a time and I need to have some function that operates on the interaction between columns. Additionally, the example I've shown above in pandas is still not ideal as I have to create the 'temp' column to group by a function of two columns. I'd even be interested to know if anyone thinks there should be additional syntax added to pandas to accommodate this type of operation, or am I just trying to do something that no one ever does? I can say this is a fairly common step in my data processing pipelines.

Joris Van den Bossche

unread,

Aug 18, 2016, 4:34:53 PM8/18/16

to PyData

You can do it like this:

In [19]: test['new'] = test.groupby((test.x + test.y) % 2)[['x', 'y']].transform('sum').sum(axis=1)

In [20]: test
Out[20]:
x y z new
0 10 30 40 260
1 11 32 41 275
2 12 34 42 260
3 13 36 43 275
4 14 38 44 260
5 15 40 45 275
6 16 42 46 260
7 17 44 47 275
8 18 46 48 260
9 19 48 49 275

Note:

- you can pass a series to groupby, so no need to create the temp column

- I first sum x and y separately (using .transform('sum') instead of .sum()/.agg('sum') to keep the original shape), and then sum the x and y columns.

Regards,

Joris

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages

	new	x	y	z
0	260	10	30	40
2	260	12	34	42
4	260	14	38	44
6	260	16	42	46
8	260	18	46	48
1	275	11	32	41
3	275	13	36	43
5	275	15	40	45
7	275	17	44	47
9	275	19	48	49

	new	x	y	z
0	260	10	30	40
2	260	12	34	42
4	260	14	38	44
6	260	16	42	46
8	260	18	46	48
1	275	11	32	41
3	275	13	36	43
5	275	15	40	45
7	275	17	44	47
9	275	19	48	49

	new	x	y	z
0	260	10	30	40
2	260	12	34	42
4	260	14	38	44
6	260	16	42	46
8	260	18	46	48
1	275	11	32	41
3	275	13	36	43
5	275	15	40	45
7	275	17	44	47
9	275	19	48	49