Selecting multiple values from one level of a MultiIndex

Brendan Barnwell

unread,

Nov 2, 2013, 8:48:35 PM11/2/13

to pyd...@googlegroups.com

I have a DataFrame with a MultiIndex. I want to select all the elements where the first level of the MultiIndex is one of a list of specified values. `loc` seems to be the way to do this, but it is giving me strange behavior. This is with the dev version from github. Here's an example DataFrame:

    >>> d = pandas.DataFrame({
    ...     'X': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
    ...     'Y': ['a', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'd']
    ... })
    >>> print d
       X Y
    0 1 a
    1 1 b
    2 1 b
    3 1 c
    4 2 c
    5 2 c
    6 2 d
    7 3 d
    8 3 d
    9 4 d

If I have a single-level index, as above, I can use `loc` with a list to select multiple rows:

    >>> d.loc[[0, 2, 4]]
       X Y
    0 1 a
    2 1 b
    4 2 c

Now I give it a MultiIndex:

    >>> d.set_index(["X", "Y"], inplace=True, drop=False)
    >>> print d
         X Y
    X Y
    1 a 1 a
      b 1 b
      b 1 b
      c 1 c
    2 c 2 c
      c 2 c
      d 2 d
    3 d 3 d
      d 3 d
    4 d 4 d

Selecting rows by providing both index levels works:

    >>> d.loc[1, 'b']
         X Y
    X Y
    1 b 1 b
      b 1 b

However, things begin to get strange if I try to pass lists of indices. If I pass the two indices as a list, the rows seem to be missing:

    >>> d.loc[[1, 'b']]
        X    Y
    1 NaN NaN
    b NaN NaN

Notice that the index of the returned DataFrame mixes items from two different levels of the original DataFrame, which is weird. It's not clear what it's doing here; if it's selecting from the first level, it ought to throw an error because 'b' isn't a value in that level.

    >>> d.loc[[1, 2]]
         X Y
    X Y
    1 b 1 b
      b 1 b

What it now gives me is rows 1 and 2 *by number* from the original data frame --- that is, the same as d.iloc[[1, 2]]! (I figured this out by trying other values, e.g., d.loc[[2, 6]] gives the same as d.iloc[[2, 6]].) This is quite surprising, as the docs state vehemently that loc is ONLY for label-based indexing.

Is this a bug? Given a sequence x, how can I index into the DataFrame to get all the rows where the first level of the MultiIndex is any of the values in x?

Jeffrey Tratner

unread,

Nov 2, 2013, 8:52:57 PM11/2/13

to pyd...@googlegroups.com

Need to pass a tuple in the list, i.e.:

df.loc[[(1, 'b')]]

df.loc[1, 'b'] is implicitly a tuple.

Not sure about your later example.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jeff Reback

unread,

Nov 2, 2013, 8:56:32 PM11/2/13

to pyd...@googlegroups.com

try using xs if u don't want to specify a fully qualifies tuple

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.xs.html?highlight=xs#pandas.DataFrame.xs

also you can pass a list of tuples to loc (as Jeff indicated as well)

Brendan Barnwell

unread,

Nov 2, 2013, 9:23:02 PM11/2/13

to pyd...@googlegroups.com

Sorry if I'm being dense, but I still don't see how those approaches answer my question, or address the odd behavior.

Given the list [1, 2], how can I get all rows from the dataframe where the FIRST level of the index is one of the values in my list? That is, the union of d.ix[1] and d.ix[2]. xs doesn't seem to accept a list of indices either.

Also, I still don't understand why d.loc[[1, 2]] return items by numerical index, nor why d.loc[[1, 'b']] returns a DataFrame whose index is a hybrid of the two levels of the source DataFrame.

Jeffrey Tratner

unread,

Nov 2, 2013, 9:27:01 PM11/2/13

to pyd...@googlegroups.com

df.loc doesn't fail if you pass it a list of items and they aren't in the index. `1` and `'b'` aren't in the Index, so it creates a new DataFrame of nan values. Same thing would happen if you did `df.loc[['silly', 'name]]`.

Jeff Reback

unread,

Nov 2, 2013, 9:28:45 PM11/2/13

to pyd...@googlegroups.com

you can do

df.loc[1:2]

if they are disjoint then I think ATM the you would have do separately select then combine; just too many parameter to deal with a list in xs as it needs to handle level and axis as well

df.loc[[

Jeff Reback

unread,

Nov 2, 2013, 9:33:14 PM11/2/13

to pyd...@googlegroups.com

df.loc[[1.2]] is really pretty ambiguous because it tries the indicies on separate axes (it's possible that this is using some of the fallback integer indexing because this is a multi-index) not sure

you need to explicitly use a tuple

df.loc[(1,'b')]

xs will work similarly (but allow u to specify only a partial key)

Jeffrey Tratner

unread,

Nov 2, 2013, 9:36:49 PM11/2/13

to pyd...@googlegroups.com

We might look into the interaction between loc and MI (but it's somewhat complicated in some ways); however, the key takeaway is that, if you have a MultiIndex and you're trying to get something from it, you should use xs or pass tuples to loc.

Brendan Barnwell

unread,

Nov 2, 2013, 9:38:08 PM11/2/13

to pyd...@googlegroups.com

Okay, but that's contrary to the documentation, which says:

"ALL of the labels for which you ask, must be in the index or a KeyError will be raised!"

Also, the value 1 *is* in the index, and d.loc[1] works. If d.loc[1] works and d.loc['b'] fails (which is the case), then I would expect that d.loc[[1, 'b']] should either fail (because one of the passed values is not available), or return the same as d.ix[(1, 'b')] (if it interprets the two values on separate index levels).

I still can't see how the result for d.loc[[1, 'b']] makes sense, nor why d.loc[[1, 2]] returns values by numerical index.

Jeff Reback

unread,

Nov 2, 2013, 9:45:20 PM11/2/13

to pyd...@googlegroups.com

I believe it's a bug that loc falls back to integer indexing for multi indexes

dartdog

unread,

Nov 3, 2013, 2:25:29 PM11/3/13

to pyd...@googlegroups.com

I seem to have a similar lack of understanding that I'm trying to get to the bottom of on SO, Here if anyone can take a look/crack at?

http://stackoverflow.com/questions/19756108/selecting-a-new-dataframe-via-a-multi-indexed-frame-in-pandas-using-index-names

Reply all

Reply to author

Forward