Multilevel indexing from in dataframe and filling missing days in date range.

JDennison

unread,

Feb 1, 2012, 12:05:41 AM2/1/12

to pystatsmodels

This should seem to be a simple problem but I have yet to figure out
the panda way of doing this. I have a list of payments form my
database for each client. I can get them into a dataframe like such:

raw = DataFrame(cur.fetchall(),columns=['organization_name', 'date',
'sum'])

where cur is psycopg2 cursor. That works great. What I need to process
is the cumulative payments for a each client for each month. I have
solved this in R using ddply() and rbind() but I need to serve the
results out my webserver so i would rather do it in python. My data
frame looks like:

organization_name date sum
0 ORG1 2010-07-02 00:00:00 746.03
1 ORG1 2010-07-08 00:00:00 0.00
2 ORG1 2010-07-09 00:00:00 5620.36
3 ORG1 2010-07-12 00:00:00 1592.53
4 ORG1 2010-07-13 00:00:00 1921.05
5 ORG2 2010-07-09 00:00:00 140.00
6 ORG2 2010-07-10 00:00:00 4812.81
7 ORG2 2010-07-22 00:00:00 1017.72
8 ORG2 2010-07-23 00:00:00 1278.99

I need to add a filler to every missing day(with the value of 0).
There appears to me a clever reindexing trick with time series
asfreq(DateOffset()) and I think that with the group by commands would
get me most of the way there.

something like:

df2 = df.groupby('date')['sum'].aggregate(cumsum)

But my problem is much simpler then that. In order to reorder the
dataframe with the missing days, I think i have to index this data
frame with a multilevel index on org_name and date. However all my
attempts have been met with errors and frustration. I could find a
working example online. Could anyone give me a push in the right
direction.

Thanks

John

Wouter Overmeire

unread,

Feb 1, 2012, 3:30:56 AM2/1/12

to pystat...@googlegroups.com

Op woensdag 1 februari 2012 06:05:41 UTC+1 schreef JDennison het volgende:

To calculate sum for each organization over a certain period ...

In [68]: start = datetime(2010, 7, 1)

In [69]: end = datetime(2010, 7, 31)

In [70]: df[(df['date'] >= start) & (df['date'] <= end)].groupby('organization_name')['sum'].agg(np.sum)
Out[70]:
organization_name
ORG1                 9880
ORG2                 7250

If there is no data available over the selected period, this will be an empty series, or have no value for a certain organization. So is there really a need to add missing dates?

To create a MultiIndex:
In [71]: df.set_index(['organization_name', 'date'])
Out[71]:
                              sum
organization_name date
ORG1              2010-07-02 746.
                  2010-07-08 0.00
                  2010-07-09 5620
                  2010-07-12 1593
                  2010-07-13 1921
ORG2              2010-07-09 140.
                  2010-07-10 4813
                  2010-07-22 1018
                  2010-07-23 1279

In [72]:

Adam Klein

unread,

Feb 1, 2012, 12:19:19 PM2/1/12

to pystat...@googlegroups.com

There seem to be two questions here - either normalizing the index so it contains all dates with zeros for sum, and aggregating. For aggregating, you might want to add a new column that extracts the month (use dateutil parser), and group and aggregate over (organization, month).

Do you need to reindex as well?

Wouter Overmeire

unread,

Feb 2, 2012, 3:18:51 AM2/2/12

to pystat...@googlegroups.com

2012/2/1 Adam Klein <ad...@lambdafoundry.com>

Looks like DataFrame.groupby() could use a keyfunc argument to do derived grouping. This way there is no need to create temporary column(s).

for example (assuming here that date holds datetime objects.):

df.groupby(['orginazation_name', 'date'], keyfunc=lambda x, y : (x, y.month)).agg(np.sum)

will generate monthly total for each org.

Wes McKinney

unread,

Feb 4, 2012, 10:05:41 PM2/4/12

to pystat...@googlegroups.com

hi John,

if you're using ddply and rbind I don't see any reason you can't do
this with groupby and the apply function. Write a function that takes
a piece of the DataFrame for each organization and computes the result
that you're looking for. maybe something like:

def get_cumsum(piece):
indexed_piece = piece.set_index('date')
indexed_piece = indexed_piece.asfreq(DateOffset(1))
return indexed_piece['sum'].cumsum()

then do:

df.groupby('organization_name').apply(get_cumsum)

I think that plyr ddply usage should very easily onto groupby but
there may be a bit of finagling involved since time series operations
require that the Series or DataFrame have a datetime index. if you
can't get it to work (sort of shooting from the hip here) please let
me know

hope this helps,
Wes

Reply all

Reply to author

Forward