Applying function to a Pandas Dataframe with a MultiIndex

2,420 views
Skip to first unread message

Aquil H. Abdullah

unread,
May 14, 2013, 2:09:03 PM5/14/13
to pandas
I would like to retrieve the rows of a DataFrame with a MultiIndex index, based on a given criteria.  For, example, if I have a DataFrame that has observations for multiple labels on the same day, I would like to select the row or rows that meet a given criteria.  I the data below I would like to select A12 until November 6th and then use B12 after that. One way to do that is to have a function that compares the index timestamp or some other time stamp to the end_dt and then decide which row to return...

if dt < end_dt:
   return row containing A12 data
if dt > end_dt 
   return row containing B12 data

In [18]: df
Out[18]: 
                              end_dt                           obs1  obs2
date           label                                      
2012-11-02  A12   2012-11-07 00:00:00  110.537881   423
                  B12   2012-12-08 00:00:00  101.048770   373
2012-11-04  A12   2012-11-07 00:00:00    2.827904   813
                  B12   2012-12-08 00:00:00   79.367235   532
2012-11-06  A12   2012-11-07 00:00:00  146.138290   662
                  B12   2012-12-08 00:00:00  169.884978   408
2012-11-08  A12   2012-11-07 00:00:00  164.156614   122
                  B12   2012-12-08 00:00:00  125.115908   504
2012-11-10  B12   2012-12-08 00:00:00   68.988800   153
                  C12   2013-01-09 00:00:00  118.298669   537

Which, I generate through the following code:

import pandas as pd
import numpy as np
from numpy.random import randn as randn
from numpy.random import randint as randint
from datetime import datetime
obs1 = [ob if ob > 0 else ob *-1 for ob in randn(10)*100]
obs2 = [randint(1000) for i in range(10)]
labels = ['A12', 'B12', 'A12', 'B12', 'A12','B12', 'A12','B12', 'B12', 'C12']
dates = [datetime(2012, 11, i) for i in range(1,11)]
dates[0] = dates[1]
dates[2] = dates[3]
dates[4] = dates[5]
dates[6] = dates[7]
dates[8] = dates[9]
m_idx = pd.MultiIndex.from_tuples(zip(dates, labels), names=['date', 'label'])
dt = datetime(2012,11,7)
dt2 = datetime(2012,12, 8)
end_dt = [dt, dt2, dt, dt2, dt, dt2, dt, dt2,dt2, datetime(2013, 1, 9)]
data_dict = {'obs1':obs1, 'obs2':obs2, 'end_dt':end_dt}
df = pd.DataFrame(data_dict, index=m_idx)

Can anyone suggest a way to apply a function to a MultiIndex DataFrame so that will look at the first level of the DataFrame apply a function and then return a row?

--
Aquil H. Abdullah
aquil.a...@gmail.com

Wouter Overmeire

unread,
May 15, 2013, 5:09:52 AM5/15/13
to pystat...@googlegroups.com



2013/5/14 Aquil H. Abdullah <aquil.a...@gmail.com>


The function used in apply get as input a series, which is either a row or column of df (depending on the axis argument used in apply). The name of this series holds the index value and you can use this to filter.

In [21]: df
Out[21]:
                              end_dt        obs1  obs2
date       label
2012-11-02 A12   2012-11-07 00:00:00   61.366226   887
           B12   2012-12-08 00:00:00   91.901519   447
2012-11-04 A12   2012-11-07 00:00:00  116.727974   388
           B12   2012-12-08 00:00:00   36.376575   919
2012-11-06 A12   2012-11-07 00:00:00    2.864715   957
           B12   2012-12-08 00:00:00   32.391975   454
2012-11-08 A12   2012-11-07 00:00:00   68.080480   759
           B12   2012-12-08 00:00:00  175.167616   971
2012-11-10 B12   2012-12-08 00:00:00   24.423740   956
           C12   2013-01-09 00:00:00   35.590169   586

In [22]: def filter(date, label):
    if date <= datetime(2012,11,6):
        return label == 'A12'
    else:
        return label == 'B12'
   ....:

In [23]: def func(s):
    if filter(*s.name):
        s['obs2'] += 10000
    return s
   ....:

In [24]: df.apply(func, axis=1)
Out[24]:
                              end_dt        obs1   obs2
date       label
2012-11-02 A12   2012-11-07 00:00:00   61.366226  10887
           B12   2012-12-08 00:00:00   91.901519    447
2012-11-04 A12   2012-11-07 00:00:00  116.727974  10388
           B12   2012-12-08 00:00:00   36.376575    919
2012-11-06 A12   2012-11-07 00:00:00    2.864715  10957
           B12   2012-12-08 00:00:00   32.391975    454
2012-11-08 A12   2012-11-07 00:00:00   68.080480    759
           B12   2012-12-08 00:00:00  175.167616  10971
2012-11-10 B12   2012-12-08 00:00:00   24.423740  10956
           C12   2013-01-09 00:00:00   35.590169    586

Aquil H. Abdullah

unread,
May 15, 2013, 8:31:52 AM5/15/13
to pandas
Wouter,

Thanks for your reply, this looks like what I need. I will give it a try.
Reply all
Reply to author
Forward
0 new messages