astype: date conversion on dataframe is ignored on assignment if it is already datetime64

836 views
Skip to first unread message

Raymond Roberts

unread,
Feb 6, 2014, 11:12:14 AM2/6/14
to pyd...@googlegroups.com
any column which is of type datetime64 can not be converted to a new type.

df.date = df.date.astype(datetime.date)
df.date = df.date.astype('O')
df.loc[:, 'date'] = df.date.astype('O') # this just fails completely although it's totally unclear why from the error message.

I've tried many different combinations, including setting the copy argument to True and False. IF this is intentional behavior it is incredibly restrictive.

Jeff

unread,
Feb 6, 2014, 11:28:07 AM2/6/14
to pyd...@googlegroups.com
If you really really want to do this, here's how

In [9]: df = DataFrame(dict(date=date_range('20130101',periods=5)))

In [11]: df['date2'] = [ datetime.date(v.year,v.month,v.day) for v in df.date ]

In [12]: df
Out[12]: 
        date       date2
0 2013-01-01  2013-01-01
1 2013-01-02  2013-01-02
2 2013-01-03  2013-01-03
3 2013-01-04  2013-01-04
4 2013-01-05  2013-01-05

[5 rows x 2 columns]

In [13]: df.dtypes
Out[13]: 
date     datetime64[ns]
date2            object
dtype: object

In [14]: df.head()
Out[14]: 
        date       date2
0 2013-01-01  2013-01-01
1 2013-01-02  2013-01-02
2 2013-01-03  2013-01-03
3 2013-01-04  2013-01-04
4 2013-01-05  2013-01-05

[5 rows x 2 columns]

In [15]: df['date2']
Out[15]: 
0    2013-01-01
1    2013-01-02
2    2013-01-03
3    2013-01-04
4    2013-01-05
Name: date2, dtype: object

In [16]: df.ix[0,'date2']
Out[16]: datetime.date(2013, 1, 1)


The reason this is 'restrictive' is datetime.datetime is a superset of datetime.date so no actual reason to have it
since it cannot be vectorized and is just plain confusing

you can also do this with Period's if you want to represent 'dates' with no times

Raymond Roberts

unread,
Feb 6, 2014, 11:35:55 AM2/6/14
to pyd...@googlegroups.com
Jeff,
Thanks for the reply. The concerning bit to me is that I can't change the type of a column, even to Object.


On Thursday, February 6, 2014 11:12:14 AM UTC-5, Raymond Roberts wrote:

Raymond Roberts

unread,
Feb 6, 2014, 11:44:06 AM2/6/14
to pyd...@googlegroups.com
Here's an example of what I mean.
(Pdb) type(tmp.date.astype(date).values[0])
<type 'datetime.datetime'>

(Pdb) df2 = pd.DataFrame({'a': [1, 2, 3]})
(Pdb) df2.dtypes
a    int64
dtype: object
(Pdb) df2.a.astype(float)
0    1
1    2
2    3
Name: a, dtype: float64
(Pdb) df2['a'] = df2.a.astype(float)
(Pdb) df2
   a
0  1
1  2
2  3

[3 rows x 1 columns]
(Pdb) df2.dtypes
a    float64
dtype: object

(Pdb) tmp['date'] = tmp.date.astype(date)
(Pdb) tmp.dtypes
date         datetime64[ns]
timeStamp    datetime64[ns]
dtype: object
(Pdb)


On Thursday, February 6, 2014 11:12:14 AM UTC-5, Raymond Roberts wrote:

Raymond Roberts

unread,
Feb 6, 2014, 11:46:12 AM2/6/14
to pyd...@googlegroups.com
This seems like a bug, since I don't want or need a numpy.datetime64 object and all attempts to change the type fail.


On Thursday, February 6, 2014 11:12:14 AM UTC-5, Raymond Roberts wrote:

Jeff Reback

unread,
Feb 6, 2014, 11:46:45 AM2/6/14
to pyd...@googlegroups.com
did u try the soln I gave?

why exactly do u need datetime.date?
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jeff

unread,
Feb 6, 2014, 11:48:45 AM2/6/14
to pyd...@googlegroups.com
this is not a bug, but defined behavior. you can use the soln I gave to assign the datetime.date as object dtypes as I indicated
you cannot astype these datetime.date types; its just too expesive perf wise to check for this, as they are rarely used types.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

Raymond Roberts

unread,
Feb 6, 2014, 12:05:07 PM2/6/14
to pyd...@googlegroups.com
ok, so bug is probably the wrong thing to say, but I am still confused why converting the datetime64 using astype to  datetime.datetime or object fails to override the values when assigned. You can see from my example that running astype with datetime.date (which converts to datetime.datetime, which is fine for now) results in the correct output. Now if this assignment is done
tmp['date'] = tmp.date.astype(datetime.datetime)

the dtype of 'date' column remains numpy.datetime64! this is pretty ridiculous since it's completely independent of the type I want to assign, if I try to jam new information into a column of my DataFrame it should either do exactly what I want or fail. I don't want it to "work" and then keep the old type.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

Raymond Roberts

unread,
Feb 6, 2014, 12:09:13 PM2/6/14
to pyd...@googlegroups.com
Maybe this just comes down to a disagreement between me and the people who wrote this bit of code. I don't think of datetime.datetime objects as convertible to numpy.datetime64 just because the basic ideas are the same and datetime64 can represent the "same" information. for me, numpy.datetime64 is not a superset of datetime.datetime objects in the way int64 is a reasonable superset of int32.

Jeff

unread,
Feb 6, 2014, 12:10:50 PM2/6/14
to pyd...@googlegroups.com
when constructing a Series or inserting it into a DataFrame if the type is datetimelike it is automatically and forcibly converted to datetime64[ns] without exception, even if you specify an object dtype
it simply is easier and much more performant to do this

astype('O') on a datetime64[ns] actually works! but then then reassignment causes a reverse conversion

you CAN leave it as a series if you really want to

you still haven't come back with the reason you want to do this


On Thursday, February 6, 2014 12:05:07 PM UTC-5, Raymond Roberts wrote:

Dale Jung

unread,
Feb 6, 2014, 12:45:06 PM2/6/14
to pyd...@googlegroups.com
Imagine that there was a Python Integer class. If we injested a Series of Integer objects into a DataFrame, you’d have a reasonable expectation that the we store them as unboxed ints(np.int). This allows us to send the array to some lower level c function for performance. 

If we kept the Integers as a list of objects, doing something like sort() would require we stay in python, which would be slow. Or we’d try to be smart and convert the objects into an array of ints. (which is essentially what is happening here).

The datetime auto-conversion follows the same logic. Think of np.datetime64 as just an array of ints with datetime metadata. It’s storing datetime data in it’s most concise and performant state. It’s not meant to be a substitute for datetime.datetime. You can always box the data into datetime(or use the default Timestamp) for doing non-vectorizable ops. 

For the record, I had legacy code break when the change was first put in. Though that was due to some gaps in the autoboxing to Timestamps. There’s not a feature gap from a datetime object, in fact you can just convert to datetime if need be.
-- 
Dale Jung

From: Raymond Roberts rayvr...@gmail.com
Reply: pyd...@googlegroups.com pyd...@googlegroups.com
Date: February 6, 2014 at 12:10:58 PM
To: pyd...@googlegroups.com pyd...@googlegroups.com
Subject:  Re: [pydata] Re: astype: date conversion on dataframe is ignored on assignment if it is already datetime64

Raymond Roberts

unread,
Feb 6, 2014, 1:26:04 PM2/6/14
to pyd...@googlegroups.com
I completely understand where you guys are coming from wanting that to be fast. However, it's not what I want or expect from tools I'm using in Python. Converting all floats and ints to 64 bit is already frustrating enough when I'm on a 32bit version of Python, setting my date/datetime objects to a ridiculous numpy object just drives me nuts. This is the sort of type conversion I see happen in Matlab all the time and it's the source of numerous issues there. Numpy does this in many instances as well but I would hope it would happen less in Pandas as their is supposed to be a sane layer between me and the lower level.

Additionaly, If I put user defined objects into a DataFrame you are stuck in python anyway, right? Why not let the user make that tradeoff rather than forcing it on them. If I want to sort on datetimelike objects I have to know that they are slow if they aren't datetime64. If I have 5 objects in my date column I'm not going to worry! I have no problem with an Index being converted since my expectation is that it will require some adjustments to be 'fast' and fit with the internal consistency of the datastructures underlying it.

Here's a problem with Period objects that cropped up seconds after playing with them:
In [5]: d = pd.Period('2012-01-01', 'D')

In [6]: d.freq
Out[6]: 'D'

In [7]: d.day
Out[7]: 1

In [8]: d.year
Out[8]: 2012

In [9]: d.freq = 'Y'

In [10]: d
Out[10]:
C:\Code>

That's a segfault that brought down python. If assigning a value to the freq attribute is not supported then it should be read only.

Jeff

unread,
Feb 6, 2014, 1:58:05 PM2/6/14
to pyd...@googlegroups.com
Robert

you can specify smaller dtypes if you want (e.g. int32/float32)

pandas is fast for a reason it uses the lower level primitives. Furthermore it accepts a lot of insane input! There are tradeoffs to be made; this is one of them. datatime.date objects are simply a sub-class of datetime.datetime which is represented
fully by datetime64[ns]. Their really is no usecase at all for this.

To be honest if you are putting python level objects in a dataframe, then don't use it; use a dict instead. Pandas has a very large audience. It tries to accomodate the largest possible userbase.

It is MUCH more flexible that numpy (don't even get me started on matlab), while offering a very rich set of operations.

But their are idiosyncracies and tradeoffs.

you cannot change a period frequency after it is set; if that is a problem fee free to file a bug report.
Reply all
Reply to author
Forward
0 new messages