DataFrame indexing with DatetimeIndex

2,761 views
Skip to first unread message

Teodora Baeva

unread,
Sep 16, 2013, 12:36:52 PM9/16/13
to pyd...@googlegroups.com
Hi,

I'm trying to index a DataFrame with a DatetimeIndex.  I don't want to use .reindex or .loc as those silently append NaNs if an indexing date isn't in the DataFrame and I'd rather have an error to let me know.

With simple indexing I'm getting a KeyError, eventhough the dates I'm indexing with match the dates in the DataFrame index by construction. Here is a minimal example code

import pandas as pd
dates = pd.date_range('01-Jan-2014', periods=3, freq='MS')
df = pd.DataFrame(ones([3, 10]), index=dates)
df[dates]

KeyError: "['2014-01-01T00:00:00.000000000+0000' '2014-02-01T00:00:00.000000000+0000' '2014-03-01T00:00:00.000000000+0000'] not in index"

Is this a bug? Is it possible to get the behaviour I'm looking for?

Thx,
Teodora

Jeff

unread,
Sep 16, 2013, 7:08:24 PM9/16/13
to pyd...@googlegroups.com
You need to use .loc. You are doing a column selection.
 

In [13]: dates = pd.date_range('01-Jan-2014', periods=4, freq='MS')
In [14]: dates
Out[14]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00, ..., 2014-04-01 00:00:00]
Length: 4, Freq: MS, Timezone: None
In [15]: df = pd.DataFrame(np.ones([4, 10]), index=dates)
In [16]: df
Out[16]:
            0  1  2  3  4  5  6  7  8  9
2014-01-01  1  1  1  1  1  1  1  1  1  1
2014-02-01  1  1  1  1  1  1  1  1  1  1
2014-03-01  1  1  1  1  1  1  1  1  1  1
2014-04-01  1  1  1  1  1  1  1  1  1  1
 
# single selection
In [17]: df.loc[dates[0]]
Out[17]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
# slicing
Name: 2014-01-01 00:00:00, dtype: float64
In [18]: df.loc[dates[1]:dates[2]]
Out[18]:
            0  1  2  3  4  5  6  7  8  9
2014-02-01  1  1  1  1  1  1  1  1  1  1
2014-03-01  1  1  1  1  1  1  1  1  1  1
 
# multiple selection
In [20]: df.loc[[dates[1],dates[3]]]
Out[20]:
            0  1  2  3  4  5  6  7  8  9
2014-02-01  1  1  1  1  1  1  1  1  1  1
2014-04-01  1  1  1  1  1  1  1  1  1  1
# via timeseries string selection
In [24]: df.loc['20140301']
Out[24]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: 2014-03-01 00:00:00, dtype: float64
 
In [26]: df.loc['20140301':'20140401']
Out[26]:
            0  1  2  3  4  5  6  7  8  9
2014-03-01  1  1  1  1  1  1  1  1  1  1
2014-04-01  1  1  1  1  1  1  1  1  1  1

Dave Hirschfeld

unread,
Sep 18, 2013, 8:38:43 AM9/18/13
to pyd...@googlegroups.com
I'm also interested in being able to index with dates and have it throw an IndexError if an indexing date wasn't in the DatetimeIndex.

I'm not sure why you'd describe it as a column selection?
What I want is to pull out the *rows* matching the index dates, if I wanted to get columns surely that would be  df[:, column_index]?

Numpy has a very long history of the first index argument operating on the rows, the second on the columns, etc., with any unspecified axes slices being implicitly assumed to be :.  Anything else would be pretty surprising (and undesirable IMHO)

I can't use `.loc` because it has the undesirable "feature" of appending NaNs rather than raising an IndexError - e.g.

In [6]: index_dates = pd.date_range('01-Jan-2014', periods=4, freq='MS')

In [7]: df.loc[index_dates]
Out[7]: 

  0 1 2 3 4 5 6 7 8 9
2014-01-01 1 1 1 1 1 1 1 1 1 1
2014-02-01 1 1 1 1 1 1 1 1 1 1
2014-03-01 1 1 1 1 1 1 1 1 1 1
2014-04-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN  <----- Should raise an ImdexError


The behaviour I actually want does happen if you index the columns:

In [8]: df.loc[:, array([1,5,7,12])]
Traceback (most recent call last):

  File "<ipython-input-16-16285534a137>", line 1, in <module>
    df.loc[:, array([1,5,7,12])]

  File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 669, in __getitem__
    return self._getitem_tuple(key)

  File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 261, in _getitem_tuple
    self._has_valid_tuple(tup)

  File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 664, in _has_valid_tuple
    if not self._has_valid_type(k,i):

  File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 728, in _has_valid_type
    raise KeyError("[%s] are not in ALL in the [%s]" % (key,self.obj._get_axis_name(axis)))

KeyError: '[[ 1  5  7 12]] are not in ALL in the [columns]'


..but it's not so nice that the behaviour depends on the axis you index.

In the end my (ugly) work-around is:

In [11]: df.T[index_dates]
Traceback (most recent call last):

  File "<ipython-input-22-0718b74e0634>", line 1, in <module>
    df.T[index_dates]

  File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\frame.py", line 1921, in __getitem__
    return self._getitem_array(key)

  File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\frame.py", line 1953, in _getitem_array
    indexer = self.ix._convert_to_indexer(key, axis=1)

  File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 583, in _convert_to_indexer
    raise KeyError('%s not in index' % objarr[mask])

KeyError: "['2014-04-01T01:00:00.000000000+0100'] not in index" <--- desired IndexError


So to summarise my points:

1. Silently appending NaNs when indexing the rows is undesirable IMHO
2. Having the default indexer (i.e. df[col_idx]) operate on columns is undesirable as it flies in the face of established numpy indexing and the current behaviour could be easily achieved by prepending a :, i.e. df[:, col_idx] but there's no way to index the rows with the current behaviour (without resorting to specialised indexers)
3. Indexing columns throws an IndexError if the indexing value isn't in the index, making row / column indexing behave subtly differently

I guess with #2 it's inherited behaviour? It seem like a poor cousin to `.loc` indexing with less flexibility and some surprising (IMO) behaviour.

..so if .loc were made to throw if an indexing value wasn't in the specified index for *all* axes it would solve #1 & #3 for me.

Thanks,
Dave


Jeff

unread,
Sep 18, 2013, 9:03:52 AM9/18/13
to pyd...@googlegroups.com
[] slicing is provided purely as a convience for slicing the rows: http://pandas.pydata.org/pandas-docs/dev/indexing.html#slicing-ranges
[] is overloaded quite a lot, but primarily functions as a column selector
 
loc behavior of not raising an error is exactly what it is supposed to do. Pandas maintains integrity of what you are trying to do,
this use of loc is equivalent of a reindexing operation.
 
If you are truly interested in having only thoses dates in the index, then just do the following:
 
In [11]: df.loc[df.index & dates]
Out[11]:
            0  1  2  3  4  5  6  7  8  9
2014-01-01  1  1  1  1  1  1  1  1  1  1
2014-02-01  1  1  1  1  1  1  1  1  1  1
2014-03-01  1  1  1  1  1  1  1  1  1  1
Behaviors are somewhat different than numpy for good reason, e.g. in the case of a DataFrame column access just
makes sense as the default getitem. If you want numpy exactly, then use numpy.

On Monday, September 16, 2013 12:36:52 PM UTC-4, Teodora Baeva wrote:

Dave Hirschfeld

unread,
Sep 18, 2013, 4:11:04 PM9/18/13
to pyd...@googlegroups.com


On Wednesday, September 18, 2013 2:03:52 PM UTC+1, Jeff wrote:
[] slicing is provided purely as a convience for slicing the rows: http://pandas.pydata.org/pandas-docs/dev/indexing.html#slicing-ranges
[] is overloaded quite a lot, but primarily functions as a column selector
 
loc behavior of not raising an error is exactly what it is supposed to do. Pandas maintains integrity of what you are trying to do,
this use of loc is equivalent of a reindexing operation.

I'd argue that it's not maintaining the integrity of what I'm trying to do - I'm trying to index into the rows of my DataFrame, not to *reindex* the rows. If I wanted to reindex there's the reindex function for that. If I want an indexing operation on the rows of my DataFrame which will throw an error if the index values aren't in the row index... pandas doesn't supply that - you have to transpose, index on the columns and transpose back.
 
 
If you are truly interested in having only thoses dates in the index, then just do the following:
 
In [11]: df.loc[df.index & dates]
Out[11]:
            0  1  2  3  4  5  6  7  8  9
2014-01-01  1  1  1  1  1  1  1  1  1  1
2014-02-01  1  1  1  1  1  1  1  1  1  1
2014-03-01  1  1  1  1  1  1  1  1  1  1

The point is that I don't want the indexing operation to either silently expand my selection or silently contract it. I want to know (loudly) if my index values aren't a subset of the index itself. If I did want to expand it I could use reindex, if I did want to contract it I could index with the intersection. 

To try and determine if my index dates were a subset of the index I tried comparing the intersection to the original index dates but this throws an AttributeError (I assume) because the arrays are of different size. I couldn't see any issues regarding this but could open one if it's agreed that is should be following the numpy convention of returning False if the arrays aren't of the same size?

In [14]: (index_dates & df.index) == index_dates
Traceback (most recent call last):

  File "<ipython-input-14-e8163146b91d>", line 1, in <module>
    (index_dates & df.index) == index_dates

  File "C:\dev\bin\Anaconda\lib\site-packages\pandas\tseries\index.py", line 79, in wrapper
    return result.view(np.ndarray)

AttributeError: 'NotImplementedType' object has no attribute 'view'


Thanks,
Dave

Jeff

unread,
Sep 18, 2013, 4:27:24 PM9/18/13
to pyd...@googlegroups.com
try this:
 

In [10]: df.index.equals(dates)
Out[10]: False
In [11]: (df.index & dates).equals(dates)
Out[11]: False
 
(== is not supported on indexing objects for a variaty of reasons)
 
pandas does have different behavior if a single item is presented versus a list/boolean indexer/slice
 
the former will KeyError if it doesn't exist, while the latter do not. this has been the case as far back as I can remember.
 
You need to check your self if you want ALL of the objects to be present, which IMHO is actually a special case.
 
you can also do
 
In [13]: df.index.isin(dates)
Out[13]: array([ True,  True,  True], dtype=bool)
In [14]: dates.isin(df.index)
Out[14]: array([ True,  True,  True, False], dtype=bool)
 
and index using that if you want.

Dave Hirschfeld

unread,
Sep 18, 2013, 5:28:47 PM9/18/13
to pyd...@googlegroups.com

On Wednesday, September 18, 2013 9:27:24 PM UTC+1, Jeff wrote:
try this:
 

In [10]: df.index.equals(dates)
Out[10]: False
In [11]: (df.index & dates).equals(dates)
Out[11]: False
 
(== is not supported on indexing objects for a variaty of reasons)
 
pandas does have different behavior if a single item is presented versus a list/boolean indexer/slice
 
the former will KeyError if it doesn't exist, while the latter do not. this has been the case as far back as I can remember.
 
You need to check your self if you want ALL of the objects to be present, which IMHO is actually a special case.
 
you can also do
 
In [13]: df.index.isin(dates)
Out[13]: array([ True,  True,  True], dtype=bool)
In [14]: dates.isin(df.index)
Out[14]: array([ True,  True,  True, False], dtype=bool)
 
and index using that if you want.


Thanks, it looks like the former method is the way to go:

In [15]: dates1 = pd.date_range('01-Jan-2014', '01-Jan-2016', freq='H')[0:-1]

    ...: dates2 = pd.date_range('01-Jan-2014', '01-Jan-2016', freq='D')


In [16]: %timeit (dates1 & dates2).equals(dates2)

1000 loops, best of 3: 604 µs per loop


In [17]: %timeit dates2.isin(dates1).all()

100 loops, best of 3: 4.3 ms per loop


-Dave


 

Jeff Reback

unread,
Sep 18, 2013, 5:42:07 PM9/18/13
to pyd...@googlegroups.com


great :)
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply all
Reply to author
Forward
0 new messages