Lorenzo De Leo
unread,Jan 25, 2013, 6:17:41 AM1/25/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to pyd...@googlegroups.com
I apologize in advance, it is going to be a long post. I tried to split it into "sub-problems" but they all seem to be related.
I tried to highlight here and there the most relevant part of the code to make it somewhat easier to read.
I found a series of problems when I handle dataframes that contain datetime64 columns (not index!)
The first problem is related to the declaration. I haven't found a recommended way to declare a dataframe with a datetime64 column (did I miss it), so I assume that there is nothing wrong with the following:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.10.1'
In [3]: pd.DataFrame({'a':[1,2,4,7], 'd':[pd.datetime(2000,1,1) for i in range(4)]}).dtypes
Out[3]:
a int64
d datetime64[ns]
In [4]: pd.DataFrame({'a':[1,2,4,7], 'b':[1.2, 2.3, 5.1, 6.3], 'd':[pd.datetime(2000,1,1) for i in range(4)]}).dtypes
Out[4]:
a int64
b float64
d datetime64[ns]
In [5]: pd.DataFrame({'a':[1,2,4,7], 'b':[1.2, 2.3, 5.1, 6.3], 'c':list('abcd'), 'd':[pd.datetime(2000,1,1) for i in range(4)]}).dtypes
Out[5]:
a int64
b float64
c object
d object
It seems that the type of the 'd' column depends on the other columns. Specifically if there is an 'object' column then the datetime is converted to object.
A cleaner (but more lengthy) way is to pass a numpy array with proper type:
In [6]: pd.DataFrame({'c':list('abcd'), 'd':np.array([pd.datetime(2000,1,1) for i in range(4)])}).dtypes
Out[6]:
c object
d object
In [7]: pd.DataFrame({'c':list('abcd'), 'd':np.array([pd.datetime(2000,1,1) for i in range(4)], dtype='<M8[ns]')}).dtypes
Out[7]:
c object
d datetime64[ns]
Second (and worse) problem, if I add a column that converts the datetime column to 'object' I get a datetime64:
In [8]: df = pd.DataFrame({'c':list('abcd'), 'd':[pd.datetime(2000,1,1) for i in range(4)]})
In [9]: df.dtypes
Out[9]:
c object
d object
In [10]: df['d'].astype('O').dtype
Out[10]: dtype('O')
In [11]: df['e'] = df['d'].astype('O')
In [12]: df.dtypes
Out[12]:
c object
d object
e datetime64[ns]
While if I assign this to a series, the dtype is 'object':
In [13]: s = df['d'].astype('O')
In [14]: s.dtype
Out[14]: dtype('O')
This means that there is no way (that I could find) to convert the type of a datetime64 column to 'object' inside a dataframe.
Third problem: converting the entire dataframe with astype('O') fails if there is a datetime64 column.
In [15]: df = pd.DataFrame({'A':[1,2,4,7], 'B':[1.2, 2.3, 5.1, 6.3], 'C':list('abcd')})
In [16]: df
Out[16]:
A B C
0 1 1.2 a
1 2 2.3 b
2 4 5.1 c
3 7 6.3 d
In [17]: df.dtypes
Out[17]:
A int64
B float64
C object
In [18]: df.astype('O')
Out[18]:
A B C
0 1 1.2 a
1 2 2.3 b
2 4 5.1 c
3 7 6.3 d
In [19]: df.astype('O').dtypes
Out[19]:
A object
B object
C object
In [30]: df = pd.DataFrame({'A':[1,2,4,7],
....: 'B':[1.2, 2.3, 5.1, 6.3],
....: 'C':list('abcd'),
....: 'D':[pd.datetime(2000,1,1) for i in range(4)]})
In [31]: df['E'] = [pd.datetime(2000,1,1) for i in range(4)]
In [32]: df
Out[32]:
A B C D E
0 1 1.2 a 2000-01-01 00:00:00 2000-01-01 00:00:00
1 2 2.3 b 2000-01-01 00:00:00 2000-01-01 00:00:00
2 4 5.1 c 2000-01-01 00:00:00 2000-01-01 00:00:00
3 7 6.3 d 2000-01-01 00:00:00 2000-01-01 00:00:00
In [33]: df.dtypes
Out[33]:
A int64
B float64
C object
D object
E datetime64[ns]
In [34]: df['A'].astype('O')
Out[34]:
0 1
1 2
2 4
3 7
Name: A
In [35]: df['B'].astype('O')
Out[35]:
0 1.2
1 2.3
2 5.1
3 6.3
Name: B
In [36]: df['C'].astype('O')
Out[36]:
0 a
1 b
2 c
3 d
Name: C
In [37]: df['D'].astype('O')
Out[37]:
0 2000-01-01 00:00:00
1 2000-01-01 00:00:00
2 2000-01-01 00:00:00
3 2000-01-01 00:00:00
Name: D
In [38]: df['E'].astype('O')
Out[38]:
0 2000-01-01 00:00:00
1 2000-01-01 00:00:00
2 2000-01-01 00:00:00
3 2000-01-01 00:00:00
Name: E
In [39]: df.astype('O')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/ldeleo/<ipython console> in <module>()
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/generic.pyc in astype(self, dtype)
499 casted : type of caller
500 """
--> 501 return self._constructor(self._data, dtype=dtype)
502
503 @property
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
381
382 if isinstance(data, BlockManager):
--> 383 mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy)
384 elif isinstance(data, dict):
385 mgr = self._init_dict(data, index, columns, dtype=dtype)
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in _init_mgr(self, mgr, index, columns, dtype, copy)
464 # avoid copy if we can
465 if len(mgr.blocks) > 1 or mgr.blocks[0].values.dtype != dtype:
--> 466 mgr = mgr.astype(dtype)
467 return mgr
468
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/internals.pyc in astype(self, dtype)
615 new_blocks = []
616 for block in self.blocks:
--> 617 newb = make_block(com._astype_nansafe(block.values, dtype),
618 block.items, block.ref_items)
619 new_blocks.append(newb)
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/common.pyc in _astype_nansafe(arr, dtype)
1039 if issubclass(arr.dtype.type, np.datetime64):
1040 if dtype == object:
-> 1041 return tslib.ints_to_pydatetime(arr.view(np.int64))
1042 elif (np.issubdtype(arr.dtype, np.floating) and
1043 np.issubdtype(dtype, np.integer)):
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tslib.so in pandas.tslib.ints_to_pydatetime (pandas/tslib.c:2561)()
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Last (!!!) problem, if I introduce NaNs the conversion to object screws up for the datetime64 columns (and not the others of course):
In [40]: df.ix[2,1] = np.nan
In [41]: df.ix[1,2] = np.nan
In [42]: df.ix[3,3] = np.nan
In [43]: df.ix[3,4] = np.nan
In [44]: df
Out[44]:
A B C D E
0 1 1.2 a 2000-01-01 00:00:00 2000-01-01 00:00:00
1 2 2.3 NaN 2000-01-01 00:00:00 2000-01-01 00:00:00
2 4 NaN c 2000-01-01 00:00:00 2000-01-01 00:00:00
3 7 6.3 d NaN NaT
In [45]: df.dtypes
Out[45]:
A int64
B float64
C object
D object
E datetime64[ns]
In [46]: df['A'].astype('O')
Out[46]:
0 1
1 2
2 4
3 7
Name: A
In [47]: df['B'].astype('O')
Out[47]:
0 1.2
1 2.3
2 NaN
3 6.3
Name: B
In [48]: df['C'].astype('O')
Out[48]:
0 a
1 NaN
2 c
3 d
Name: C
In [49]: df['D'].astype('O')
Out[49]:
0 2000-01-01 00:00:00
1 2000-01-01 00:00:00
2 2000-01-01 00:00:00
3 NaN
Name: D
In [50]: df['E'].astype('O')
Out[50]:
0 2000-01-01 00:00:00
1 2000-01-01 00:00:00
2 2000-01-01 00:00:00
3 2262-04-10 00:12:43.145224
Name: E
If you think this is a bug I'm ready to fill github issues.
Sorry again for the long post.