Plot timedelta of groupby object with pandas

1,050 views
Skip to first unread message

Scott Miles

unread,
Apr 10, 2013, 1:09:21 PM4/10/13
to pystat...@googlegroups.com
(Sorry if this is a repost, but I doesn't seem like it got posted.)

I've been able to create the below plot with datetimes on the x-axis. I'd like to do the same but with elapsed time. I'm not quite sure how to handle timedelta for plotting. 


Here is the data (power outages in LA after Hurricane Isaac): https://github.com/geomando/code_share/blob/master/OutageData_MasterList.csv


Here is my attempt:


outage_data = read_csv('OutageData_MasterList.csv', parse_dates = [[4,5]])


# calculate elapsed time and create a new df column

start_date_time = outage_data.index[0].to_datetime()

outage_date_time = outage_data.index.to_pydatetime()

outage_data['Elapsed Time'] = outage_date_time - start_date_time


# create groups by parish

outage_data_parish = outage_data.groupby('Parish')


# plot -- clearly matplotlib doesn't like timedelta

for name, group in outage_data_parish:

    plt.plot(group['Elapsed Time'], group['Outage Percent'])


TypeError: float() argument must be a string or a number


Thoughts? thanks!


Scott

Jeff Reback

unread,
Apr 10, 2013, 1:39:46 PM4/10/13
to pystat...@googlegroups.com

see the attached pdf of a notebook
you will need 0.11-dev in order to properly support timedelta64[ns]
which is what you get when you subtract 2 datetimes
(0.11 is coming out this week, betas are on the web-site)

you have the right idea, just need to do a bit of conversion
I show it in terms of fractions of days

Matplotlib doesn't properly deal with the timedelta64[ns] (well as least my version)
it just treats them as an integer, which is correct (as that is the base dtype), but
not useful.
outage_data.pdf

Jeff Reback

unread,
Apr 29, 2013, 12:43:31 PM4/29/13
to pystat...@googlegroups.com
Nicholaus,

you are excactly, right

using pandas master, and 1.6.2:

In [3]: df = DataFrame(dict(A = Timestamp('20130102'),B=date_range('20130102',periods=5,freq='H')))

In [4]: df
Out[4]: 
                    A                   B
0 2013-01-02 00:00:00 2013-01-02 00:00:00
1 2013-01-02 00:00:00 2013-01-02 01:00:00
2 2013-01-02 00:00:00 2013-01-02 02:00:00
3 2013-01-02 00:00:00 2013-01-02 03:00:00
4 2013-01-02 00:00:00 2013-01-02 04:00:00

In [5]: df['C'] = df['B']-df['A']

In [6]: df
Out[6]: 
                    A                   B        C
0 2013-01-02 00:00:00 2013-01-02 00:00:00 00:00:00
1 2013-01-02 00:00:00 2013-01-02 01:00:00 01:00:00
2 2013-01-02 00:00:00 2013-01-02 02:00:00 02:00:00
3 2013-01-02 00:00:00 2013-01-02 03:00:00 03:00:00
4 2013-01-02 00:00:00 2013-01-02 04:00:00 04:00:00

In [7]: df.dtypes
Out[7]: 
A     datetime64[ns]
B     datetime64[ns]
C    timedelta64[ns]
dtype: object

In [8]: df['C'].apply(lambda x: x.item())
Out[8]: 
0           00:00:00
1   -00:11:34.967296
2   -00:23:09.934592
3   -00:34:44.901888
4    00:25:15.098112
Name: C, dtype: timedelta64[ns]

In [9]: df['C'].apply(lambda x: x.item().total_seconds())
Out[9]: 
0        0
1     3600
2     7200
3    10800
4    14400
Name: C, dtype: float64


using 1.7.0 (also 32-bit, but that doesn't matter)

In [12]: df['C'].apply(lambda x: x.item().total_seconds())

AttributeError: 'long' object has no attribute 'total_seconds'

Here's the workaround, numpy keeps the timedeltas in nanoseconds as an int
1.6.2 its a 'timedelta' like object

In [15]: df['C'].apply(lambda x: x.item()/1e9)
Out[15]: 
0        0
1     3600
2     7200
3    10800
4    14400
Name: C, dtype: float64

This of course is the reason pandas needs a TimedeltaScalar & Index, to basically hide the numpy issues

On Monday, April 29, 2013 12:15:53 PM UTC-4, Nicholaus Halecky wrote:
Hey all,

Sorry to drop in on the thread out of nowhere, but google sent me here and I thought it wouldn't hurt to ask... So, I tried this suggestion, however, I think that there were some significant changes in the timedelta64 dtype from numpy 1.6 to 1.7 that don't allow this to work.

After downloading the data and attempting to convert using Jeff's suggestion, as:

outage_data['diff']=outage_data['Date_Time']-outage_data['Date_Time'][0]
outage_data[
'ddiff'] = outage_data['diff'].apply(lambda x: float(x.item().total_seconds())/(3600*24)) outage_data.dtypes 

I get hit with the following error:

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last) <ipython-input-4-6d80c1d26eb6> in <module>() 1 outage_data['diff']=outage_data['Date_Time']-outage_data['Date_Time'][0] ----> 2 outage_data['ddiff'] = outage_data['diff'].apply(lambda x: float(x.item().total_seconds())/(3600*24)) 3 outage_data.dtypes /Users/nehalecky/Documents/projects/python/pandas/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds) 2449 values = lib.map_infer(values, lib.Timestamp) 2450 -> 2451 mapped = lib.map_infer(values, f, convert=convert_dtype) 2452 if isinstance(mapped[0], Series): 2453 from pandas.core.frame import DataFrame /Users/nehalecky/Documents/projects/python/pandas/pandas/lib.so in pandas.lib.map_infer (pandas/lib.c:42231)() <ipython-input-4-6d80c1d26eb6> in <lambda>(x) 1 outage_data['diff']=outage_data['Date_Time']-outage_data['Date_Time'][0] ----> 2 outage_data['ddiff'] = outage_data['diff'].apply(lambda x: float(x.item().total_seconds())/(3600*24)) 3 outage_data.dtypes AttributeError: 'long' object has no attribute 'total_seconds'

FYI, I am running on pandas developer build:
1.7.1
0.12.0.dev-5afb0eb

Any suggestions?
Thanks much, 
Nicholaus
Reply all
Reply to author
Forward
0 new messages