Hello again. Well you came to the right place. I can get this op down
to about 1.2 seconds on a sample dataset with 2601 unique
group1/group2 pairs, just have to know a bit of pandas-kata. Don't
worry, it will come in time =) I suspect that the runtime will scale
roughly linearly with the number of unique group pairs.
from pandas import *
import numpy as np
import string
g1 = np.array(list(string.letters))[:-1]
g2 = np.arange(51)
df_small = DataFrame({'group1' : ["a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b"],
'group2' : [1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1],
'value' : ["apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry",
"grape"]})
value = df_small['value'].values.repeat(3)
df = DataFrame({'group1' : g1.repeat(40000),
'group2' : np.tile(g2, 40000),
'value' : value.repeat(40000)})
def random_sample():
grouped = df.groupby(['group1','group2'])['value']
from random import choice
choose = lambda group: choice(group.index)
indices = grouped.apply(choose)
return df.reindex(indices)
Let me know if this works and does what you want.
best,
Wes
Hey, would you mind bottom replying?
Try this version, should be *way* faster:
def random_sample_v2():
grouped = df.groupby(['group1','group2'])['value']
from random import choice
indices = [choice(v) for k, v in grouped.groups.iteritems()]
return df.reindex(indices)
Let me know how it goes. What platform are you on and how much RAM do
you have BTW? You might take a peak at your memory usage and make sure
you aren't paging
- Wes
Try this version, should be *way* faster:
def random_sample_v2():
grouped = df.groupby(['group1','group2'])['value']indices = [choice(v) for k, v in grouped.groups.iteritems()]
from random import choice
return df.reindex(indices)
Let me know how it goes. What platform are you on and how much RAM do
you have BTW? You might take a peak at your memory usage and make sure
you aren't paging
- Wes
No problem-- you're helping me expand my bag of tricks as well :)
- Wes