DataFrame indexing with DatetimeIndex

Teodora Baeva

unread,

Sep 16, 2013, 12:36:52 PM9/16/13

to pyd...@googlegroups.com

Hi,

I'm trying to index a DataFrame with a DatetimeIndex. I don't want to use .reindex or .loc as those silently append NaNs if an indexing date isn't in the DataFrame and I'd rather have an error to let me know.

With simple indexing I'm getting a KeyError, eventhough the dates I'm indexing with match the dates in the DataFrame index by construction. Here is a minimal example code

import pandas as pd

dates = pd.date_range('01-Jan-2014', periods=3, freq='MS')

df = pd.DataFrame(ones([3, 10]), index=dates)

df[dates]

KeyError: "['2014-01-01T00:00:00.000000000+0000' '2014-02-01T00:00:00.000000000+0000' '2014-03-01T00:00:00.000000000+0000'] not in index"

Is this a bug? Is it possible to get the behaviour I'm looking for?

Thx,

Teodora

Jeff

unread,

Sep 16, 2013, 7:08:24 PM9/16/13

to pyd...@googlegroups.com

You need to use .loc. You are doing a column selection.

In [13]: dates = pd.date_range('01-Jan-2014', periods=4, freq='MS')

In [14]: dates
Out[14]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00, ..., 2014-04-01 00:00:00]
Length: 4, Freq: MS, Timezone: None

In [15]: df = pd.DataFrame(np.ones([4, 10]), index=dates)

In [16]: df
Out[16]:
0 1 2 3 4 5 6 7 8 9
2014-01-01 1 1 1 1 1 1 1 1 1 1
2014-02-01 1 1 1 1 1 1 1 1 1 1
2014-03-01 1 1 1 1 1 1 1 1 1 1
2014-04-01 1 1 1 1 1 1 1 1 1 1

# single selection

In [17]: df.loc[dates[0]]
Out[17]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1

# slicing

Name: 2014-01-01 00:00:00, dtype: float64

In [18]: df.loc[dates[1]:dates[2]]
Out[18]:
0 1 2 3 4 5 6 7 8 9
2014-02-01 1 1 1 1 1 1 1 1 1 1
2014-03-01 1 1 1 1 1 1 1 1 1 1

# multiple selection

In [20]: df.loc[[dates[1],dates[3]]]
Out[20]:
0 1 2 3 4 5 6 7 8 9
2014-02-01 1 1 1 1 1 1 1 1 1 1
2014-04-01 1 1 1 1 1 1 1 1 1 1

# via timeseries string selection

In [24]: df.loc['20140301']
Out[24]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: 2014-03-01 00:00:00, dtype: float64

In [26]: df.loc['20140301':'20140401']
Out[26]:
0 1 2 3 4 5 6 7 8 9
2014-03-01 1 1 1 1 1 1 1 1 1 1
2014-04-01 1 1 1 1 1 1 1 1 1 1

Dave Hirschfeld

unread,

Sep 18, 2013, 8:38:43 AM9/18/13

to pyd...@googlegroups.com

I'm also interested in being able to index with dates and have it throw an IndexError if an indexing date wasn't in the DatetimeIndex.

I'm not sure why you'd describe it as a column selection?

What I want is to pull out the *rows* matching the index dates, if I wanted to get columns surely that would be df[:, column_index]?

Numpy has a very long history of the first index argument operating on the rows, the second on the columns, etc., with any unspecified axes slices being implicitly assumed to be :. Anything else would be pretty surprising (and undesirable IMHO)

I can't use `.loc` because it has the undesirable "feature" of appending NaNs rather than raising an IndexError - e.g.

In [6]: index_dates = pd.date_range('01-Jan-2014', periods=4, freq='MS')

In [7]: df.loc[index_dates]

Out[7]:

0 1 2 3 4 5 6 7 8 9

2014-01-01 1 1 1 1 1 1 1 1 1 1

2014-02-01 1 1 1 1 1 1 1 1 1 1

2014-03-01 1 1 1 1 1 1 1 1 1 1

2014-04-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <----- Should raise an ImdexError

The behaviour I actually want does happen if you index the columns:

In [8]: df.loc[:, array([1,5,7,12])]

Traceback (most recent call last):

File "<ipython-input-16-16285534a137>", line 1, in <module>

df.loc[:, array([1,5,7,12])]

File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 669, in __getitem__

return self._getitem_tuple(key)

File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 261, in _getitem_tuple

self._has_valid_tuple(tup)

File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 664, in _has_valid_tuple

if not self._has_valid_type(k,i):

File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 728, in _has_valid_type

raise KeyError("[%s] are not in ALL in the [%s]" % (key,self.obj._get_axis_name(axis)))

KeyError: '[[ 1 5 7 12]] are not in ALL in the [columns]'

..but it's not so nice that the behaviour depends on the axis you index.

In the end my (ugly) work-around is:

In [11]: df.T[index_dates]

Traceback (most recent call last):

File "<ipython-input-22-0718b74e0634>", line 1, in <module>

df.T[index_dates]

File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\frame.py", line 1921, in __getitem__

return self._getitem_array(key)

File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\frame.py", line 1953, in _getitem_array

indexer = self.ix._convert_to_indexer(key, axis=1)

File "C:\dev\bin\Anaconda\lib\site-packages\pandas\core\indexing.py", line 583, in _convert_to_indexer

raise KeyError('%s not in index' % objarr[mask])

KeyError: "['2014-04-01T01:00:00.000000000+0100'] not in index" <--- desired IndexError

So to summarise my points:

1. Silently appending NaNs when indexing the rows is undesirable IMHO

2. Having the default indexer (i.e. df[col_idx]) operate on columns is undesirable as it flies in the face of established numpy indexing and the current behaviour could be easily achieved by prepending a :, i.e. df[:, col_idx] but there's no way to index the rows with the current behaviour (without resorting to specialised indexers)

3. Indexing columns throws an IndexError if the indexing value isn't in the index, making row / column indexing behave subtly differently

I guess with #2 it's inherited behaviour? It seem like a poor cousin to `.loc` indexing with less flexibility and some surprising (IMO) behaviour.

..so if .loc were made to throw if an indexing value wasn't in the specified index for *all* axes it would solve #1 & #3 for me.

Thanks,

Dave

Jeff

unread,

Sep 18, 2013, 9:03:52 AM9/18/13

to pyd...@googlegroups.com

[] slicing is provided purely as a convience for slicing the rows: http://pandas.pydata.org/pandas-docs/dev/indexing.html#slicing-ranges

[] is overloaded quite a lot, but primarily functions as a column selector

loc behavior of not raising an error is exactly what it is supposed to do. Pandas maintains integrity of what you are trying to do,

this use of loc is equivalent of a reindexing operation.

If you are truly interested in having only thoses dates in the index, then just do the following:

In [11]: df.loc[df.index & dates]
Out[11]:

0 1 2 3 4 5 6 7 8 9
2014-01-01 1 1 1 1 1 1 1 1 1 1
2014-02-01 1 1 1 1 1 1 1 1 1 1
2014-03-01 1 1 1 1 1 1 1 1 1 1

Behaviors are somewhat different than numpy for good reason, e.g. in the case of a DataFrame column access just

makes sense as the default getitem. If you want numpy exactly, then use numpy.

On Monday, September 16, 2013 12:36:52 PM UTC-4, Teodora Baeva wrote:

Dave Hirschfeld

unread,

Sep 18, 2013, 4:11:04 PM9/18/13

to pyd...@googlegroups.com

On Wednesday, September 18, 2013 2:03:52 PM UTC+1, Jeff wrote:

[] slicing is provided purely as a convience for slicing the rows: http://pandas.pydata.org/pandas-docs/dev/indexing.html#slicing-ranges
[] is overloaded quite a lot, but primarily functions as a column selector

loc behavior of not raising an error is exactly what it is supposed to do. Pandas maintains integrity of what you are trying to do,
this use of loc is equivalent of a reindexing operation.

I'd argue that it's not maintaining the integrity of what I'm trying to do - I'm trying to index into the rows of my DataFrame, not to *reindex* the rows. If I wanted to reindex there's the reindex function for that. If I want an indexing operation on the rows of my DataFrame which will throw an error if the index values aren't in the row index... pandas doesn't supply that - you have to transpose, index on the columns and transpose back.

If you are truly interested in having only thoses dates in the index, then just do the following:

In [11]: df.loc[df.index & dates]
Out[11]:
0 1 2 3 4 5 6 7 8 9
2014-01-01 1 1 1 1 1 1 1 1 1 1
2014-02-01 1 1 1 1 1 1 1 1 1 1
2014-03-01 1 1 1 1 1 1 1 1 1 1

The point is that I don't want the indexing operation to either silently expand my selection or silently contract it. I want to know (loudly) if my index values aren't a subset of the index itself. If I did want to expand it I could use reindex, if I did want to contract it I could index with the intersection.

To try and determine if my index dates were a subset of the index I tried comparing the intersection to the original index dates but this throws an AttributeError (I assume) because the arrays are of different size. I couldn't see any issues regarding this but could open one if it's agreed that is should be following the numpy convention of returning False if the arrays aren't of the same size?

In [14]: (index_dates & df.index) == index_dates

Traceback (most recent call last):

File "<ipython-input-14-e8163146b91d>", line 1, in <module>

(index_dates & df.index) == index_dates

File "C:\dev\bin\Anaconda\lib\site-packages\pandas\tseries\index.py", line 79, in wrapper

return result.view(np.ndarray)

AttributeError: 'NotImplementedType' object has no attribute 'view'

Thanks,

Dave

Jeff

unread,

Sep 18, 2013, 4:27:24 PM9/18/13

to pyd...@googlegroups.com

try this:

In [10]: df.index.equals(dates)
Out[10]: False

In [11]: (df.index & dates).equals(dates)
Out[11]: False

(== is not supported on indexing objects for a variaty of reasons)

pandas does have different behavior if a single item is presented versus a list/boolean indexer/slice

the former will KeyError if it doesn't exist, while the latter do not. this has been the case as far back as I can remember.

You need to check your self if you want ALL of the objects to be present, which IMHO is actually a special case.

you can also do

In [13]: df.index.isin(dates)
Out[13]: array([ True, True, True], dtype=bool)

In [14]: dates.isin(df.index)
Out[14]: array([ True, True, True, False], dtype=bool)

and index using that if you want.

Dave Hirschfeld

unread,

Sep 18, 2013, 5:28:47 PM9/18/13

to pyd...@googlegroups.com

On Wednesday, September 18, 2013 9:27:24 PM UTC+1, Jeff wrote:

try this:

In [10]: df.index.equals(dates)
Out[10]: False
In [11]: (df.index & dates).equals(dates)
Out[11]: False

(== is not supported on indexing objects for a variaty of reasons)

pandas does have different behavior if a single item is presented versus a list/boolean indexer/slice

the former will KeyError if it doesn't exist, while the latter do not. this has been the case as far back as I can remember.

You need to check your self if you want ALL of the objects to be present, which IMHO is actually a special case.

you can also do

In [13]: df.index.isin(dates)
Out[13]: array([ True, True, True], dtype=bool)
In [14]: dates.isin(df.index)
Out[14]: array([ True, True, True, False], dtype=bool)

and index using that if you want.

Thanks, it looks like the former method is the way to go:

In [15]: dates1 = pd.date_range('01-Jan-2014', '01-Jan-2016', freq='H')[0:-1]

...: dates2 = pd.date_range('01-Jan-2014', '01-Jan-2016', freq='D')

In [16]: %timeit (dates1 & dates2).equals(dates2)

1000 loops, best of 3: 604 µs per loop

In [17]: %timeit dates2.isin(dates1).all()

100 loops, best of 3: 4.3 ms per loop

-Dave

Jeff Reback

unread,

Sep 18, 2013, 5:42:07 PM9/18/13

to pyd...@googlegroups.com

great :)

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward