Reindex with MultiIndex, unexpected behavior

854 views
Skip to first unread message

Lorenzo De Leo

unread,
Jun 13, 2012, 5:49:26 AM6/13/12
to pyd...@googlegroups.com
Hi all,

I'm on pandas 0.7.2 (cannot upgrade to 0.8 yet) and I have a quite involved use case: I have a dataframe with a 3-level multiindex

In [1]: import pandas as pd

In [2]: arrays = [['bar', 'bar', 'bar', 'bar', 'foo', 'foo', 'foo', 'foo'], ['t1', 't1', 't2', 't2', 't3', 't3', 't4', 't4'], [0,1,0,1,0,1,0,1]]

In [3]: tuples = zip(*arrays)

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second', 'third'])

In [5]: df = pd.DataFrame(randn(8,2), index=index)

In [6]: df
Out[6]:
                           0         1
first second third
bar   t1     0     -0.785934  1.118058
             1      0.690336 -0.293743
      t2     0     -0.677024 -0.889262
             1      0.068096 -1.390154
foo   t3     0      1.510566 -0.843813
             1      1.238095  0.791679
      t4     0      0.468408 -1.103306
             1     -0.159713  0.983538

and a series with a 2-level multiindex composed of levels 0 and 2 of the df.multiindex:

In [7]: arrays = [['bar', 'bar', 'foo', 'foo'], [0,1,0,1]]

In [8]: tuples = zip(*arrays)

In [9]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'third'])

In [10]: s = pd.Series(randn(4), index = index)

In [11]: s
Out[11]:
first  third
bar    0       -1.014172
       1       -1.008454
foo    0        1.088346
       1       -0.845830

I would like to reindex the series to the dataframe multiindex. That means in practice adding a level to the series index and replicating the data to fill it up.
After some tweaking I found that this is possible by doing (never mind the swap in levels, I can deal with it later):

In [12]: s.reindex(index=df.reorder_levels([0,2,1]).sort_index().index, method='pad')
Out[12]:
first  third  second
bar    0      t1       -1.014172
              t2       -1.014172
       1      t1       -1.008454
              t2       -1.008454
foo    0      t3        1.088346
              t4        1.088346
       1      t3       -0.845830
              t4       -0.845830

This is the desired result, although it is not really clear to me how I get it. In fact by removing the method='pad' I get a list of nans:

In [13]: s.reindex(index=df.reorder_levels([0,2,1]).sort_index().index)
Out[13]:
first  third  second
bar    0      t1       NaN
              t2       NaN
       1      t1       NaN
              t2       NaN
foo    0      t3       NaN
              t4       NaN
       1      t3       NaN
              t4       NaN

I tried to follow the code through all the calls but I lost the track at the _merge_indexer function in index.py
I'm mostly wondering if I'm just getting the correct result only through a glimpse in the code.

I'm also interested in case somebody can come up with a different implementation of the same procedure.

Thanks in advance.

Cheers,

Lorenzo

Lorenzo De Leo

unread,
Jun 13, 2012, 5:55:20 AM6/13/12
to pyd...@googlegroups.com
On Wednesday, June 13, 2012 11:49:26 AM UTC+2, Lorenzo De Leo wrote:

I'm mostly wondering if I'm just getting the correct result only through a glimpse in the code.


Sorry, glitch, not glimpse x(

L

Wouter Overmeire

unread,
Jun 13, 2012, 8:48:22 AM6/13/12
to pyd...@googlegroups.com


2012/6/13 Lorenzo De Leo <lorenz...@gmail.com>

What you get looks normal, this is not a glitch.
reindex method argument indicates what to fill in potential gaps (leading to NaN values if not filled) when reindexing.

Another way of handling this problem could be as following:

In [156]: df
Out[156]:

                           0         1
first second third
bar   t1     0     -1.271942  0.696284
             1     -0.171996 -0.595483
      t2     0      0.913158  3.032118
             1     -0.234497 -1.215946
foo   t3     0     -1.355672  1.324525
             1      0.706871 -0.249740
      t4     0      0.363943  0.363174
             1      0.302854  1.165288

In [157]: s
Out[157]:
first  third
bar    0        1.196714
       1        0.364001
foo    0        0.230713
       1       -1.860832

In [158]: pandas.DataFrame.from_items([(second, s) for second in np.unique(df.index.get_level_values('second'))]).stack().reorder_levels([0, 2, 1]).reindex(df.index)
Out[158]:
first  second  third
bar    t1      0        1.196714
               1        0.364001
       t2      0        1.196714
               1        0.364001
foo    t3      0        0.230713
               1       -1.860832
       t4      0        0.230713
               1       -1.860832

             

Lorenzo De Leo

unread,
Jun 13, 2012, 9:24:50 AM6/13/12
to pyd...@googlegroups.com


On Wednesday, June 13, 2012 2:48:22 PM UTC+2, Wouter Overmeire wrote:


2012/6/13 Lorenzo De Leo

What I'm uncomfortable with is the following.

In [16]: df.reorder_levels([0,2,1]).sort_index().index
Out[16]:
MultiIndex([('bar', 0, 't1'), ('bar', 0, 't2'), ('bar', 1, 't1'),
       ('bar', 1, 't2'), ('foo', 0, 't3'), ('foo', 0, 't4'),
       ('foo', 1, 't3'), ('foo', 1, 't4')], dtype=object)

Correct me if I'm wrong. If I use method=None (the default) reindex looks for the first element of the new index, ('bar', 0, 't1') in this case, and, not finding it into the series index, gives back a nan. Fine.
But if I use method='pad', how does this comparison work? In other terms, shall I consider that ('bar', 0, 't1') and ('bar', 0, 't2') come after ('bar', 0) but before ('bar', 1) and so on? (now that I think of it, it seems quite obvious indeed)

I guess the whole thing goes wrong if I don't properly sort the index then.

In [19]: s.reindex(index=df.reorder_levels([0,2,1]).index, method='pad')
Out[19]:

first  third  second
bar    0      t1       -1.014172
       1      t1       -1.008454
       0      t2       -1.008454
       1      t2       -1.008454
foo    0      t3        1.088346
       1      t3       -0.845830
       0      t4       -0.845830
       1      t4       -0.845830


 


Thanks, definitely more readable.

L

Wouter Overmeire

unread,
Jun 13, 2012, 9:34:07 AM6/13/12
to pyd...@googlegroups.com
2012/6/13 Lorenzo De Leo <lorenz...@gmail.com>

correct
 

I guess the whole thing goes wrong if I don't properly sort the index then.

Indeed, the reorder_levels is critical for the padding/filling of NaN values you want.
 
Reply all
Reply to author
Forward
0 new messages