subsample rows by group with pandas

Uri Laserson

unread,

Sep 27, 2011, 9:51:31 PM9/27/11

to pystatsmodels

Hi all,

I am trying to subsample rows of a DataFrame according to a grouping.
Here is an example. Say I define the following data:

In [189]: df = DataFrame({'group1' :
["a","b","a","a","b","c","c","c","c",
.....:
"c","a","a","a","b","b","b","b"],
.....: 'group2' :
[1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
.....: 'value' :
["apple","pear","orange","apple",
.....:
"banana","durian","lemon","lime",
.....:
"raspberry","durian","peach","nectarine",
.....:
"banana","lemon","guava","blackberry","grape"]})

If I group by `group1` and `group2`, then the number of rows in each
group is here:

In [190]: df.groupby(['group1','group2'])['value'].agg({'count':len})
Out[190]:
count
a 1 2
2 1
3 2
4 1
b 1 2
2 2
3 1
4 1
c 3 1
4 1
5 2
6 1

(If there is an even more concise way to compute that, please tell.)

I now want to construct a DataFrame that has one randomly selected row
from each group. My proposal is to do it like so:

In [215]: from random import choice
In [216]: grouped = df.groupby(['group1','group2'])
In [217]: subsampled = grouped.apply(lambda x:
df.reindex(index=[choice(range(len(x)))]))

In [218]: subsampled.index = range(len(subsampled))
In [219]: subsampled
Out[219]:
group1 group2 value
0 b 2 pear
1 a 1 apple
2 b 2 pear
3 a 1 apple
4 a 1 apple
5 a 1 apple
6 a 1 apple
7 a 1 apple
8 a 1 apple
9 a 1 apple
10 a 1 apple
11 a 1 apple

which works. However, my real data has about 2.5 million rows. If I
do this the dirty way by building my own data structures, I can
complete this operation in a matter of seconds. However, my
implementation above does not finish within 20 minutes. As a side
note, when I tried implementing this in R, I first tried `plyr`, which
also did not finish in a reasonable amount of time; however, a
solution using `data.table` finished very rapidly as well.

I want to love this package, so please help!

Thanks!
Uri

Wes McKinney

unread,

Sep 27, 2011, 10:19:59 PM9/27/11

to pystat...@googlegroups.com

Hello again. Well you came to the right place. I can get this op down
to about 1.2 seconds on a sample dataset with 2601 unique
group1/group2 pairs, just have to know a bit of pandas-kata. Don't
worry, it will come in time =) I suspect that the runtime will scale
roughly linearly with the number of unique group pairs.

from pandas import *
import numpy as np
import string

g1 = np.array(list(string.letters))[:-1]
g2 = np.arange(51)
df_small = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",

"c","a","a","a","b","b","b","b"],

'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],

'value' : ["apple","pear","orange","apple",

"banana","durian","lemon","lime",

"raspberry","durian","peach","nectarine",

"banana","lemon","guava","blackberry",
"grape"]})

value = df_small['value'].values.repeat(3)
df = DataFrame({'group1' : g1.repeat(40000),
'group2' : np.tile(g2, 40000),
'value' : value.repeat(40000)})

def random_sample():
grouped = df.groupby(['group1','group2'])['value']
from random import choice
choose = lambda group: choice(group.index)
indices = grouped.apply(choose)
return df.reindex(indices)

Let me know if this works and does what you want.

best,
Wes

Uri Laserson

unread,

Sep 28, 2011, 2:06:56 AM9/28/11

to pystat...@googlegroups.com

Your way is much cleaner/prettier. However, there are about 500k unique groups in the 2.5M rows. When I ran your way, and also waited for my proposed way to finish, they both took about 40 min (though yours was slightly faster). Do you think this is the lower bound? Or is there a way to speed up even more? Does the implementation scan through the rows once to build the groups or multiple times?

Uri

...................................................................................

Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
lase...@mit.edu

Wes McKinney

unread,

Sep 28, 2011, 9:13:58 AM9/28/11

to pystat...@googlegroups.com

Hey, would you mind bottom replying?

Try this version, should be *way* faster:

def random_sample_v2():

grouped = df.groupby(['group1','group2'])['value']
from random import choice

indices = [choice(v) for k, v in grouped.groups.iteritems()]
return df.reindex(indices)

Let me know how it goes. What platform are you on and how much RAM do
you have BTW? You might take a peak at your memory usage and make sure
you aren't paging

- Wes

Uri Laserson

unread,

Sep 28, 2011, 11:46:18 AM9/28/11

to pystat...@googlegroups.com

Try this version, should be *way* faster:

def random_sample_v2():

grouped = df.groupby(['group1','group2'])['value']
from random import choice

indices = [choice(v) for k, v in grouped.groups.iteritems()]
return df.reindex(indices)

Let me know how it goes. What platform are you on and how much RAM do
you have BTW? You might take a peak at your memory usage and make sure
you aren't paging

- Wes

Virtually instantaneous :) Even more excited now.

Regarding the other implementation, I'm on some brand new Intel processor with 8 GB RAM on Ubuntu. The python process never takes more than about 40% of the RAM.

Thanks for all the help!

Uri

Wes McKinney

unread,

Sep 28, 2011, 11:55:17 AM9/28/11

to pystat...@googlegroups.com

No problem-- you're helping me expand my bag of tricks as well :)

- Wes

Reply all

Reply to author

Forward