hi Uri,
I would do this personally:
def do_shuffle(arr):
from random import shuffle
result = arr.copy().values
shuffle(result)
return result
df2['state_perm'] = df2.groupby('label')['state'].transform(do_shuffle)
let me know how that works out
- Wes
I looked at the profile of both versions. With latest pandas the
difference (at least with my test dataset) does not seem to be as
significant (in fact about the same), at least with the latest git
version of pandas (where numerous optimizations have been made, etc.):
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.002 0.002 1.022 1.022 <string>:1(<module>)
1 0.032 0.032 1.020 1.020 groupby_sample.py:41(shuffle_uri)
1 0.000 0.000 0.486 0.486 indexing.py:26(__getitem__)
1 0.000 0.000 0.486 0.486 indexing.py:129(_getitem_axis)
1 0.000 0.000 0.486 0.486 indexing.py:158(_getitem_iterable)
1 0.000 0.000 0.486 0.486 frame.py:1206(reindex)
1 0.001 0.001 0.486 0.486 frame.py:1247(_reindex_index)
51 0.484 0.009 0.484 0.009 {method 'permutation' of
'mtrand.RandomState' objects}
1 0.001 0.001 0.481 0.481 internals.py:602(reindex_axis)
1 0.000 0.000 0.361 0.361 index.py:484(reindex)
1 0.000 0.000 0.361 0.361 index.py:717(get_indexer)
1 0.208 0.208 0.208 0.208
{pandas._tseries.merge_indexer_int64}
1 0.000 0.000 0.153 0.153 index.py:98(indexMap)
vs. what i suggested:
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.004 0.004 0.990 0.990 <string>:1(<module>)
1 0.011 0.011 0.970 0.970 groupby.py:811(transform)
51 0.001 0.000 0.677 0.013 groupby_sample.py:35(do_shuffle)
51 0.599 0.012 0.672 0.013 random.py:276(shuffle)
51 0.000 0.000 0.228 0.004 index.py:717(get_indexer)
51 0.000 0.000 0.116 0.002 index.py:98(indexMap)
51 0.116 0.002 0.116 0.002 {method 'get_mapping' of
'pandas._engines.DictIndexEngine' objects}
51 0.111 0.002 0.111 0.002
{pandas._tseries.merge_indexer_int64}
1019949 0.073 0.000 0.073 0.000 {method 'random' of
'_random.Random' objects}
51 0.000 0.000 0.025 0.000 fromnumeric.py:346(put)
51 0.025 0.000 0.025 0.000 {method 'put' of
'numpy.ndarray' objects}
52 0.000 0.000 0.022 0.000 groupby.py:204(__iter__)
51 0.000 0.000 0.021 0.000 groupby.py:197(get_group)
so it looks like my do_shuffle function could be improved by using
np.random.permutation . it's not clear to me why doing 51 get_indexer
operations is faster than one big one, something to investigate in a
bit more detail.
aside i want to make reindexing a lot faster per
http://wesmckinney.com/blog/?p=345.
- Wes