Several problems with dataframes containing datetime64 columns

4,297 views
Skip to first unread message

Lorenzo De Leo

unread,
Jan 25, 2013, 6:17:41 AM1/25/13
to pyd...@googlegroups.com
I apologize in advance, it is going to be a long post. I tried to split it into "sub-problems" but they all seem to be related.
I tried to highlight here and there the most relevant part of the code to make it somewhat easier to read.

I found a series of problems when I handle dataframes that contain datetime64 columns (not index!)

The first problem is related to the declaration. I haven't found a recommended way to declare a dataframe with a datetime64 column (did I miss it), so I assume that there is nothing wrong with the following:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.10.1'

In [3]: pd.DataFrame({'a':[1,2,4,7], 'd':[pd.datetime(2000,1,1) for i in range(4)]}).dtypes
Out[3]:
a             int64
d    datetime64[ns]

In [4]: pd.DataFrame({'a':[1,2,4,7], 'b':[1.2, 2.3, 5.1, 6.3], 'd':[pd.datetime(2000,1,1) for i in range(4)]}).dtypes
Out[4]:
a             int64
b           float64
d    datetime64[ns]

In [5]: pd.DataFrame({'a':[1,2,4,7], 'b':[1.2, 2.3, 5.1, 6.3], 'c':list('abcd'), 'd':[pd.datetime(2000,1,1) for i in range(4)]}).dtypes
Out[5]:
a      int64
b    float64
c     object
d     object

It seems that the type of the 'd' column depends on the other columns. Specifically if there is an 'object' column then the datetime is converted to object.
A cleaner (but more lengthy) way is to pass a numpy array with proper type:

In [6]: pd.DataFrame({'c':list('abcd'), 'd':np.array([pd.datetime(2000,1,1) for i in range(4)])}).dtypes
Out[6]:
c    object
d    object

In [7]: pd.DataFrame({'c':list('abcd'), 'd':np.array([pd.datetime(2000,1,1) for i in range(4)], dtype='<M8[ns]')}).dtypes
Out[7]:
c            object
d    datetime64[ns]

Second (and worse) problem, if I add a column that converts the datetime column to 'object' I get a datetime64:

In [8]: df = pd.DataFrame({'c':list('abcd'), 'd':[pd.datetime(2000,1,1) for i in range(4)]})

In [9]: df.dtypes
Out[9]:
c    object
d    object

In [10]: df['d'].astype('O').dtype
Out[10]: dtype('O')

In [11]: df['e'] = df['d'].astype('O')

In [12]: df.dtypes
Out[12]:
c            object
d            object
e    datetime64[ns]

While if I assign this to a series, the dtype is 'object':

In [13]: s = df['d'].astype('O')

In [14]: s.dtype
Out[14]: dtype('O')

This means that there is no way (that I could find) to convert the type of a datetime64 column to 'object' inside a dataframe.

Third problem: converting the entire dataframe with astype('O') fails if there is a datetime64 column.

In [15]: df = pd.DataFrame({'A':[1,2,4,7], 'B':[1.2, 2.3, 5.1, 6.3], 'C':list('abcd')})

In [16]: df
Out[16]:
   A    B  C
0  1  1.2  a
1  2  2.3  b
2  4  5.1  c
3  7  6.3  d

In [17]: df.dtypes
Out[17]:
A      int64
B    float64
C     object

In [18]: df.astype('O')
Out[18]:
   A    B  C
0  1  1.2  a
1  2  2.3  b
2  4  5.1  c
3  7  6.3  d

In [19]: df.astype('O').dtypes
Out[19]:
A    object
B    object
C    object

In [30]: df = pd.DataFrame({'A':[1,2,4,7],
   ....:                    'B':[1.2, 2.3, 5.1, 6.3],
   ....:                    'C':list('abcd'),
   ....:                    'D':[pd.datetime(2000,1,1) for i in range(4)]})

In [31]: df['E'] = [pd.datetime(2000,1,1) for i in range(4)]

In [32]: df
Out[32]:
   A    B  C                    D                   E
0  1  1.2  a  2000-01-01 00:00:00 2000-01-01 00:00:00
1  2  2.3  b  2000-01-01 00:00:00 2000-01-01 00:00:00
2  4  5.1  c  2000-01-01 00:00:00 2000-01-01 00:00:00
3  7  6.3  d  2000-01-01 00:00:00 2000-01-01 00:00:00

In [33]: df.dtypes
Out[33]:
A             int64
B           float64
C            object
D            object
E    datetime64[ns]

In [34]: df['A'].astype('O')
Out[34]:
0    1
1    2
2    4
3    7
Name: A

In [35]: df['B'].astype('O')
Out[35]:
0    1.2
1    2.3
2    5.1
3    6.3
Name: B

In [36]: df['C'].astype('O')
Out[36]:
0    a
1    b
2    c
3    d
Name: C

In [37]: df['D'].astype('O')
Out[37]:
0    2000-01-01 00:00:00
1    2000-01-01 00:00:00
2    2000-01-01 00:00:00
3    2000-01-01 00:00:00
Name: D

In [38]: df['E'].astype('O')
Out[38]:
0    2000-01-01 00:00:00
1    2000-01-01 00:00:00
2    2000-01-01 00:00:00
3    2000-01-01 00:00:00
Name: E

In [39]: df.astype('O')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

/home/ldeleo/<ipython console> in <module>()

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in astype(self, dtype)
    499         casted : type of caller
    500         """
--> 501         return self._constructor(self._data, dtype=dtype)
    502
    503     @property

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    381
    382         if isinstance(data, BlockManager):
--> 383             mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy)
    384         elif isinstance(data, dict):
    385             mgr = self._init_dict(data, index, columns, dtype=dtype)

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _init_mgr(self, mgr, index, columns, dtype, copy)
    464             # avoid copy if we can

    465             if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype:
--> 466                 mgr = mgr.astype(dtype)
    467         return mgr
    468

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in astype(self, dtype)
    615         new_blocks = []
    616         for block in self.blocks:
--> 617             newb = make_block(com._astype_nansafe(block.values, dtype),
    618                               block.items, block.ref_items)
    619             new_blocks.append(newb)

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/common.pyc in _astype_nansafe(arr, dtype)
   1039     if issubclass(arr.dtype.type, np.datetime64):
   1040         if dtype == object:
-> 1041             return tslib.ints_to_pydatetime(arr.view(np.int64))
   1042     elif (np.issubdtype(arr.dtype, np.floating) and
   1043           np.issubdtype(dtype, np.integer)):

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tslib.so in pandas.tslib.ints_to_pydatetime (pandas/tslib.c:2561)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Last (!!!) problem, if I introduce NaNs the conversion to object screws up for the datetime64 columns (and not the others of course):

In [40]: df.ix[2,1] = np.nan

In [41]: df.ix[1,2] = np.nan

In [42]: df.ix[3,3] = np.nan

In [43]: df.ix[3,4] = np.nan

In [44]: df
Out[44]:
   A    B    C                    D                   E
0  1  1.2    a  2000-01-01 00:00:00 2000-01-01 00:00:00
1  2  2.3  NaN  2000-01-01 00:00:00 2000-01-01 00:00:00
2  4  NaN    c  2000-01-01 00:00:00 2000-01-01 00:00:00
3  7  6.3    d                  NaN                 NaT

In [45]: df.dtypes
Out[45]:
A             int64
B           float64
C            object
D            object
E    datetime64[ns]

In [46]: df['A'].astype('O')
Out[46]:
0    1
1    2
2    4
3    7
Name: A

In [47]: df['B'].astype('O')
Out[47]:
0    1.2
1    2.3
2    NaN
3    6.3
Name: B

In [48]: df['C'].astype('O')
Out[48]:
0      a
1    NaN
2      c
3      d
Name: C

In [49]: df['D'].astype('O')
Out[49]:
0    2000-01-01 00:00:00
1    2000-01-01 00:00:00
2    2000-01-01 00:00:00
3                    NaN
Name: D

In [50]: df['E'].astype('O')
Out[50]:
0           2000-01-01 00:00:00
1           2000-01-01 00:00:00
2           2000-01-01 00:00:00
3    2262-04-10 00:12:43.145224
Name: E

If you think this is a bug I'm ready to fill github issues.
Sorry again for the long post.




Jeff

unread,
Jan 25, 2013, 7:54:37 AM1/25/13
to pyd...@googlegroups.com
This is the whatsnew example for 0.10.1, but didn't get updated in the docs
 
0.10.1 makes every attempt to keep datetimelike columns as datetime64[ns], rather than object as previously
you can use pd.NaT as the 'nan' column (or starting on 0.10.1, np.nan will convert to this), you can't natively store np.nan
in a datetime64[ns] column as its a a column of np.datetime64 objects, while np.nan is a float64, that's why 'object', which
is a more general category was used. But all operations had to convert anyhow; it is more intuitive to have the datetime64[ns]
if at all possible (and where its object prob is a bug, or truly a mixed column)
 
df.astype('O') is almost never necessary, do you have a reason for doing this?
 
your 3rd example down (where there is also a list object in the DataFrame constructor, and the datetime column is converted to object is a bug I think)
 
 
In [27]: import numpy as np

In [28]: import pandas as pd

In [29]: df = pd.DataFrame(np.random.randn(6,2),pd.date_range('20010102',periods=6),columns=['A','B'])

In [30]: df['timestamp'] = pd.Timestamp('20010103')

In [31]: df
Out[31]:
                   A         B           timestamp
2001-01-02 -1.171874  0.658993 2001-01-03 00:00:00
2001-01-03  0.897805 -0.588730 2001-01-03 00:00:00
2001-01-04  1.188934  0.097223 2001-01-03 00:00:00
2001-01-05 -1.405073  1.517650 2001-01-03 00:00:00
2001-01-06  1.470620 -1.104856 2001-01-03 00:00:00
2001-01-07 -0.313212 -0.805524 2001-01-03 00:00:00

In [32]: df.dtypes
Out[32]:
A                   float64
B                   float64
timestamp    datetime64[ns]

In [33]: # use the traditional nan, which is mapped to NaT internally
 

In [35]: df.ix[2:4,['A','timestamp']] = np.nan

In [36]: df
Out[36]:
                   A         B           timestamp
2001-01-02 -1.171874  0.658993 2001-01-03 00:00:00
2001-01-03  0.897805 -0.588730 2001-01-03 00:00:00
2001-01-04       NaN  0.097223                 NaT
2001-01-05       NaN  1.517650                 NaT
2001-01-06  1.470620 -1.104856 2001-01-03 00:00:00
2001-01-07 -0.313212 -0.805524 2001-01-03 00:00:00

In [37]: df.dtypes
Out[37]:
A                   float64
B                   float64
timestamp    datetime64[ns]

Jeff

unread,
Jan 25, 2013, 8:55:44 AM1/25/13
to pyd...@googlegroups.com
I had most of these fixed already..... (this on 0.10.2 branch)
 

On Friday, January 25, 2013 6:17:41 AM UTC-5, Lorenzo De Leo wrote:

Lorenzo De Leo

unread,
Jan 25, 2013, 9:10:44 AM1/25/13
to pyd...@googlegroups.com
On Friday, January 25, 2013 1:54:37 PM UTC+1, Jeff wrote:
This is the whatsnew example for 0.10.1, but didn't get updated in the docs
 
0.10.1 makes every attempt to keep datetimelike columns as datetime64[ns], rather than object as previously
you can use pd.NaT as the 'nan' column (or starting on 0.10.1, np.nan will convert to this), you can't natively store np.nan
in a datetime64[ns] column as its a a column of np.datetime64 objects, while np.nan is a float64, that's why 'object', which
is a more general category was used. But all operations had to convert anyhow; it is more intuitive to have the datetime64[ns]
if at all possible (and where its object prob is a bug, or truly a mixed column)
 
df.astype('O') is almost never necessary, do you have a reason for doing this?
 

Thanks for the quick reply.
I'm using 0.10.1 and I'm aware that NaN for datetime64 becomes NaT. My post is a bit sloppy on this and I don't explicitly distinguish NaN from NaT, but the problem is not there.
The reason I want to retrieve a astype('O') version of a dataframe is that if I try to insert the dataframe into a sql database all the np.NaN become awful object that are displayed as zeros but are not really zeros (if you make a "select * from table where x=0" you don't retrieve these lines for example)
This is a problem of sql, numpy and cx_Oracle, so I'm just looking for a way around this problem.
One possibility is to replace all the NaN by None. (the python None)
In pandas I cannot just do df.fillna(None) because that is the default for the "value" argument.
The only way I found to do that is to convert the df that contains the NaNs to 'object' and replace all the NaNs with None:

df_tmp = df.astype('O')
df_tmp[pd.isnull(df_tmp)] = None

This breaks down as soon as there is a datetime64 column and the gymnastic to achieve the same result is quite annoying.

Let me try to summarize the issues I mentioned:

1) The constructor transforms the datetime64 column into object if there is another column of type object (a string for example)

2) There is no explicit way to convert (intentionally!) a datetime64 column into object

3) df.astype('O') fails if the dataframe contains a datetime64 column

4) df['a'].astype('O') gives rubbish if 'a' is datetime64 and contains NaT

As you said, 1) looks like a bug.
2) is in my opinion a missing feature. As I said, there are cases in which (unfortunately) you might be forced to convert the datetime64 to object.
3) and 4) also look like a bug to me, but I might be wrong.

Cheers


Lorenzo De Leo

unread,
Jan 25, 2013, 9:22:44 AM1/25/13
to pyd...@googlegroups.com
On Friday, January 25, 2013 2:55:44 PM UTC+1, Jeff wrote:
I had most of these fixed already..... (this on 0.10.2 branch)
 


That was quick! :)
By the way, since we are there, it seems that the series constructor uses a different convention compared to the dataframe one:

In [17]: s = pd.Series([pd.datetime(2000,1,1) for i in range(4)])

In [21]: s.dtype
Out[21]: dtype('O')

In [22]: df = pd.DataFrame([pd.datetime(2000,1,1) for i in range(4)])

In [24]: df.dtypes
Out[24]: 0    datetime64[ns]

Although maybe this is already addressed in your pull request.
Thanks again!

Jeff

unread,
Jan 25, 2013, 9:38:13 AM1/25/13
to pyd...@googlegroups.com
pretty sure the PR fixes this

On Friday, January 25, 2013 6:17:41 AM UTC-5, Lorenzo De Leo wrote:

Jeff

unread,
Jan 25, 2013, 10:08:08 AM1/25/13
to pyd...@googlegroups.com
this is something the sql converters should do
these reason I started addressing this was to fix the issue when exporting using HDFStore
 
I just updated the PR, I think all of these conversion are fixed by it
Reply all
Reply to author
Forward
0 new messages