How to select rows in a non-index-sorted DataFrame in a world without df.ix[]

56 views
Skip to first unread message

Leo

unread,
Jun 19, 2017, 8:39:24 AM6/19/17
to pydata
Hi,

the docs (http://pandas.pydata.org/pandas-docs/stable/advanced.html)
say two Things about Label-based indexing:

1. "Indexing will work even if the data are not sorted, but will be
rather inefficient (and show a PerformanceWarning)."

2. "Furthermore if you try to index something that is not fully
lexsorted, this can raise [...] UnsortedIndexError: 'Key length (2)
was greater than MultiIndex lexsort depth (1)'"

Both Statements appear imprecise and logically inconsistent. More
importantly, it seems that our soon-to-be buried .ix[], while maybe
not so performant, seemlessly allows Label-based selections in
unsorted DataFrames. So with deprecation of .ix, this per se very
useful Feature seems to get lost. There are indeed compelling reasons
for leaving a DataFrame non-index-sorted.

1. Am I getting anything wrong?
2. If not, is there a Workaround other than super-inefficiently
creating a new indeex-sorted DataFrame with the original index cols as
ordinary cols and using .isin?

Thanks.

Leo

Pietro Battiston

unread,
Jun 19, 2017, 10:02:01 AM6/19/17
to pyd...@googlegroups.com
Il giorno lun, 19/06/2017 alle 14.39 +0200, 'Leo' via PyData ha
scritto:
> Hi,
>
> the docs (http://pandas.pydata.org/pandas-docs/stable/advanced.html)
> say two Things about Label-based indexing:
>
> 1. "Indexing will work even if the data are not sorted, but will be
> rather inefficient (and show a PerformanceWarning)."
>
> 2. "Furthermore if you try to index something that is not fully
> lexsorted, this can raise [...] UnsortedIndexError: 'Key length (2)
> was greater than MultiIndex lexsort depth (1)'"
>
> Both Statements appear imprecise and logically inconsistent.

Well, the second refers to indexing by slices, but sure, this could be
more precise, and I guess you can file a bug.


> More
> importantly, it seems that our soon-to-be buried .ix[], while maybe
> not so performant, seemlessly allows Label-based selections in
> unsorted DataFrames. So with deprecation of .ix, this per se very
> useful Feature seems to get lost. There are indeed compelling reasons
> for leaving a DataFrame non-index-sorted.
>
> 1. Am I getting anything wrong?
> 2. If not, is there a Workaround other than super-inefficiently
> creating a new indeex-sorted DataFrame with the original index cols
> as
> ordinary cols and using .isin?
>


Could you provide a reproducible example in which you need .ix? In
particular, I'm trying to understand whether you need:
- simple indexing (single label)
- indexing by slicing on objects which Python (3) knows how to sort
- indexing by slicing on objects which Python (3) doesn't know how to
sort


By the way: in general I do feel like there is room to have .loc raise
more warnings and less errors with unsorted indexes, I'm just trying to
better understand your point.

Pietro

Dr. Leo

unread,
Jun 19, 2017, 6:03:36 PM6/19/17
to pyd...@googlegroups.com
Thanks. Here's a code example from the pandaSDMX docs:


In [1]: from pandasdmx import *

In [2]: estat = Request('estat')

In [3]: dsd_resp = estat.datastructure('DSD_une_rt_a')

In [4]: df = dsd_resp.write().codelist

In [5]: df.ix[['AGE', 'UNIT']]
C:\Users\stefan\Anaconda3\envs\py35\Scripts\ipython-script.py:1:
DeprecationWarn
ing:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
if __name__ == '__main__':
Out[5]:
dim_or_attr name
AGE AGE D AGE
TOTAL D Total
Y25-74 D From 25 to 74 years
Y_LT25 D Less than 25 years
UNIT UNIT D UNIT
PC_ACT D Percentage of active population
PC_POP D Percentage of total population
THS_PER D Thousand persons

In [6]: df.loc[['AGE', 'UNIT']]
---------------------------------------------------------------------------
UnsortedIndexError Traceback (most recent call last)
<ipython-input-6-7994d8369b48> in <module>()
----> 1 df.loc[['AGE', 'UNIT']]

C:\Users\stefan\Anaconda3\envs\py35\lib\site-packages\pandas\core\indexing.py
in
__getitem__(self, key)
1326 else:
1327 key = com._apply_if_callable(key, self.obj)
-> 1328 return self._getitem_axis(key, axis=0)
1329
1330 def _is_scalar_access(self, key):

C:\Users\stefan\Anaconda3\envs\py35\lib\site-packages\pandas\core\indexing.py
in
_getitem_axis(self, key, axis)
1543 # nested tuple slicing
1544 if is_nested_tuple(key, labels):
-> 1545 locs = labels.get_locs(key)
1546 indexer = [slice(None)] * self.ndim
1547 indexer[axis] = locs

C:\Users\stefan\Anaconda3\envs\py35\lib\site-packages\pandas\core\indexes\multi.
py in get_locs(self, tup)
2267 'to be fully lexsorted
tuple len ({
0}), '
2268 'lexsort depth ({1})'
-> 2269 .format(len(tup),
self.lexsort_dept
h))
2270
2271 # indexer

UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully
lexsorted
tuple len (1), lexsort depth (0)'


I'd like to see .loc do the job as .ix did, maybe with a performance
warning. Ideally the latter should be suppressible by setting an option
rather than wrapping everything in a context manager from the warnings
stdlib module.

Leo

Pietro Battiston

unread,
Jun 19, 2017, 6:47:00 PM6/19/17
to pyd...@googlegroups.com
Il giorno mar, 20/06/2017 alle 00.03 +0200, 'Dr. Leo' via PyData ha
scritto:
Indeed, that's one case I think could and should be fixed: and anyway,
I think the error message is wrong (no slicing is taking place).

If you want to open a bug, I should be able to provide a simple fix in
the next days.

Pietro

Dr. Leo

unread,
Jun 20, 2017, 4:35:29 AM6/20/17
to pyd...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages