Upcoming Index repr changes

Joris Van den Bossche

unread,

Apr 17, 2015, 6:07:44 AM4/17/15

to pyd...@googlegroups.com, panda...@python.org

Hi all,

We have a PR pending to unify the string representation of the different Index objects: https://github.com/pydata/pandas/pull/9901

What are the most important changes:

We propose to reduce the default number of values shown from 100 to 10 (an option controllable as pd.options.display.max_seq_items).
The datetime-like indices (DatetimeIndex, TimedeltaIndex, PeriodIndex) were always somewhat different and get a new repr that is now more consistent with how it is for other Index types like Int64Index. This is the biggest change.

So for eg Int64Index not much changes (only 'name' is now also shown, and the number of shown values has changed), but for DatetimeIndex the change is larger.

But we would like to get some feedback on this!

Do you like the changes? For DatetimeIndex? For the number of shown values?

Would you want different behaviour for repr() and str()?

Some examples of the changes with the current state of the PR are shown below:

Previous Behavior

In [1]: pd.get_option('max_seq_items')
Out[1]: 100

In [2]: pd.Index(range(4), name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')

In [3]: pd.Index(range(104), name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [4]: pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern')
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern

In [5]: pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern')
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern

New Behavior

In [1]: pd.get_option('max_seq_items')
Out[1]: 10

In [9]: pd.Index(range(4), name='foo')
Out[9]: Int64Index([0, 1, 2, 3], dtype='int64', name=u'foo')

In [10]: pd.Index(range(104), name='foo')
Out[10]: Int64Index([0, 1, ..., 102, 103], dtype='int64', name=u'foo', length=104)

In [11]: pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern')
Out[11]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'], dtype='datetime64[ns]', name=u'foo', freq='D', tz='US/Eastern')

In [12]: pd.date_range('20130101', periods=104 ,name='foo', tz='US/Eastern')
Out[12]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', ..., '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'], dtype='datetime64[ns]', name=u'foo', length=104, freq='D', tz='US/Eastern')

Lorenzo De Leo

unread,

Apr 20, 2015, 5:13:26 PM4/20/15

to pyd...@googlegroups.com, panda...@python.org

I like the changes you propose, the new version is much more readable. I used to be wary of calling df.index because it can be slow and the output is a bit messy, and I'm usually too lazy to select just a slice of it, so having something like this done by default is a welcome change.

Just a question, does it apply also to multiindexes?

Cheers!

John E

unread,

Apr 20, 2015, 8:37:01 PM4/20/15

to pyd...@googlegroups.com, panda...@python.org

This is probably not the sort of comment you're looking for, but I'd like to see more of a table-style output. I can just put a 'values' at the end to get the more numpy like output (which is easier to read IMO), but it won't stop at 10 or 100 unless I tell it to. Nevertheless, I think it's much easer to read this:

pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern').values

Out[442]:

array(['2013-01-01T00:00:00.000000000-0500',

'2013-01-02T00:00:00.000000000-0500',

'2013-01-03T00:00:00.000000000-0500',

'2013-01-04T00:00:00.000000000-0500',

'2013-01-05T00:00:00.000000000-0500',

than this:

pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern')

Out[443]:

[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]

Length: 104, Freq: D, Timezone: US/Eastern

Jeff

unread,

Apr 20, 2015, 8:53:17 PM4/20/15

to pyd...@googlegroups.com, panda...@python.org

John, you are quoting the current impl (which is first), the new is like this:

In [11]: pd.date_range('20130101',periods=4,name='foo',tz='US/Eastern')
Out[11]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'], dtype='datetime64[ns]', name=u'foo', freq='D', tz='US/Eastern')

In [12]: pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[12]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', ..., '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'], dtype='datetime64[ns]', name=u'foo', length=104, freq='D', tz='US/Eastern')

Lorenzo, to answer your question, MultiIndexes are unchanged (and CategoricalIndex are new). We *could* make them a single line but would be pretty crowded.

Note that MultiIndex and CategoricalIndex are multi-line repr and do no truncate sequences (of e.g. labels), this is consistent with previous versions. (easy to change this though)

In [1]: MultiIndex.from_product([list('abcdefg'),range(10)],names=['first','second'])
Out[1]: 
MultiIndex(levels=[[u'a', u'b', u'c', u'd', u'e', u'f', u'g'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]],
           names=[u'first', u'second'])

In [4]: pd.CategoricalIndex(np.random.randint(0,5,size=100),name='foo')
Out[4]: 
CategoricalIndex([3, 0, 0, 3, 1, 3, 0, 4, 2, 3, 0, 4, 0, 1, 2, 0, 4, 1, 4, 2, 3, 1, 0, 4, 4, 3, 0, 3, 0, 1, 2, 3, 3, 1, 1, 0, 0, 4, 4, 1, 1, 3, 1, 1, 4, 4, 3, 0, 0, 0, 4, 4, 0, 1, 3, 1, 2, 0, 3, 1, 2, 2, 2, 1, 1, 4, 1, 0, 4, 3, 3, 0, 0, 0, 4, 4, 1, 4, 2, 2, 1, 4, 0, 0, 0, 4, 3, 0, 4, 0, 0, 0, 3, 3, 1, 2, 2, 3, 4, 1],
                 categories=[0, 1, 2, 3, 4],
                 ordered=False,
                 name=u'foo',
                 dtype='category')

Joris Van den Bossche

unread,

Apr 20, 2015, 8:59:36 PM4/20/15

to pyd...@googlegroups.com, panda...@python.org

I like the suggestion of John to have something more like the output of numpy arrays.

For example, the proposed repr:

In [12]: pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[12]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', ..., '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'], dtype='datetime64[ns]', name=u'foo', length=104, freq='D', tz='US/Eastern')

would then be something like this:

In [12]: pd.date_range('20130101',periods=104,name='foo',tz='US/Eastern')
Out[12]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', ...,
'2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
dtype='datetime64[ns]', name=u'foo', length=104, freq='D', tz='US/Eastern')

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joris Van den Bossche

unread,

May 21, 2015, 7:31:34 PM5/21/15

to pyd...@googlegroups.com, panda...@python.org

Follow-up of this discussion: as you may have seen, the changes were released in 0.16.1 (see the whatsnew docs: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#index-representation).
In the end, we used the suggestion of John to go for a bit more numpy style output.

There will probably still be some quirks/things to improve, you can report them at this follow-up issue: https://github.com/pydata/pandas/issues/10095

Joris

Reply all

Reply to author

Forward