Stratified permutations with pandas

1,420 views
Skip to first unread message

Uri Laserson

unread,
Oct 18, 2011, 5:25:04 PM10/18/11
to pystatsmodels
Hi all,

I am trying to perform stratified permutation sampling. Take the
following data frame:

df2 = pd.DataFrame({'label':
['a','a','a','a','a','a','a','b','b','b','b','b','b'],
'value': [1,2,3,4,5,6,7,8,9,10,11,12,13],
'state': ['n','n','a',np.nan,np.nan,'a','a','n','a','a','n','a','a']})

I want to create a new column called "state_perm", where for each
value of `label`, I permute the values of `state`, but leave the
values in `value` unchanged.

Here is what I would think would work:

grouped = df2.groupby('label')
for idxs in grouped.groups.itervalues():
df2.ix[idxs]['state_permuted'] = df2['state']
[np.random.permutation(idxs)]

Though perhaps this is exactly what is meant by "Setting values on a
mixed-type DataFrame or Panel is supported when using scalar values,
though setting arbitrary vectors is not yet supported".

If this is the case, what is the best method to get the same result in
terms of performance?

Thanks!
Uri

Uri Laserson

unread,
Oct 18, 2011, 6:36:29 PM10/18/11
to pystatsmodels
Of course, please tell me if you have a better strategy for doing the permutation sampling by group.  Thanks!

Uri

...................................................................................
Uri Laserson
Graduate Student, Biomedical Engineering
Harvard-MIT Division of Health Sciences and Technology
M +1 917 742 8019
lase...@mit.edu

Wes McKinney

unread,
Oct 22, 2011, 12:07:45 PM10/22/11
to pystat...@googlegroups.com

hi Uri,

I would do this personally:

def do_shuffle(arr):
from random import shuffle
result = arr.copy().values
shuffle(result)
return result

df2['state_perm'] = df2.groupby('label')['state'].transform(do_shuffle)

let me know how that works out

- Wes

Uri Laserson

unread,
Nov 4, 2011, 5:20:01 PM11/4/11
to pystat...@googlegroups.com
So before I saw your response I came up with my own method, which looks like this:

def shuffle_uri(df,grouped):
    perm = np.r_[tuple([np.random.permutation(idxs) for idxs in grouped.groups.itervalues()])]
    df['state_permuted'] = np.asarray(df.ix[perm]['state'])

Each permutation took about 6 seconds using my method.  Using yours, it took about 2.5 seconds.  What are the lessons learned here?

Uri

.......................................................................................
Uri Laserson
Graduate Student | Biomedical Engineering | Church Lab

Harvard-MIT Division of Health Sciences and Technology

Wes McKinney

unread,
Nov 19, 2011, 9:43:25 PM11/19/11
to pystat...@googlegroups.com

I looked at the profile of both versions. With latest pandas the
difference (at least with my test dataset) does not seem to be as
significant (in fact about the same), at least with the latest git
version of pandas (where numerous optimizations have been made, etc.):


Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 1.022 1.022 <string>:1(<module>)
1 0.032 0.032 1.020 1.020 groupby_sample.py:41(shuffle_uri)
1 0.000 0.000 0.486 0.486 indexing.py:26(__getitem__)
1 0.000 0.000 0.486 0.486 indexing.py:129(_getitem_axis)
1 0.000 0.000 0.486 0.486 indexing.py:158(_getitem_iterable)
1 0.000 0.000 0.486 0.486 frame.py:1206(reindex)
1 0.001 0.001 0.486 0.486 frame.py:1247(_reindex_index)
51 0.484 0.009 0.484 0.009 {method 'permutation' of
'mtrand.RandomState' objects}
1 0.001 0.001 0.481 0.481 internals.py:602(reindex_axis)
1 0.000 0.000 0.361 0.361 index.py:484(reindex)
1 0.000 0.000 0.361 0.361 index.py:717(get_indexer)
1 0.208 0.208 0.208 0.208
{pandas._tseries.merge_indexer_int64}
1 0.000 0.000 0.153 0.153 index.py:98(indexMap)

vs. what i suggested:


Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 0.990 0.990 <string>:1(<module>)
1 0.011 0.011 0.970 0.970 groupby.py:811(transform)
51 0.001 0.000 0.677 0.013 groupby_sample.py:35(do_shuffle)
51 0.599 0.012 0.672 0.013 random.py:276(shuffle)
51 0.000 0.000 0.228 0.004 index.py:717(get_indexer)
51 0.000 0.000 0.116 0.002 index.py:98(indexMap)
51 0.116 0.002 0.116 0.002 {method 'get_mapping' of
'pandas._engines.DictIndexEngine' objects}
51 0.111 0.002 0.111 0.002
{pandas._tseries.merge_indexer_int64}
1019949 0.073 0.000 0.073 0.000 {method 'random' of
'_random.Random' objects}
51 0.000 0.000 0.025 0.000 fromnumeric.py:346(put)
51 0.025 0.000 0.025 0.000 {method 'put' of
'numpy.ndarray' objects}
52 0.000 0.000 0.022 0.000 groupby.py:204(__iter__)
51 0.000 0.000 0.021 0.000 groupby.py:197(get_group)

so it looks like my do_shuffle function could be improved by using
np.random.permutation . it's not clear to me why doing 51 get_indexer
operations is faster than one big one, something to investigate in a
bit more detail.

aside i want to make reindexing a lot faster per
http://wesmckinney.com/blog/?p=345.

- Wes

Reply all
Reply to author
Forward
0 new messages