[pandas] MultiIndex levels always stored as dtype('object')

Wouter Overmeire

unread,

Dec 1, 2011, 4:34:19 AM12/1/11

to pystat...@googlegroups.com

MultiIndex seems to store the level data always as dtype('object').
When using DataFrame.delevel() the added columns from the index also have dtype('object').
This prevents from using DataFrame.delevel.corr() to have a look at the correlation between the original DataFrame columns and the index level values. Does anyone have an idea to work around this?

See example below:

In [1]: import pandas

In [2]: import numpy as np

In [3]: import itertools

In [4]: tuples = [tuple for tuple in itertools.product(['foo', 'bar'], [10, 20], [1.0, 1.1])]

In [5]: index = pandas.MultiIndex.from_tuples(tuples, names=['prm0', 'prm1', 'prm2'])

In [6]: df = pandas.DataFrame(np.random.randn(8,3), columns=['A', 'B', 'C'], index=index)

In [7]: df
Out[7]:
                A       B       C
prm0 prm1 prm2
foo 10   1.0   0.2074 0.3425 -1.295
          1.1   0.3194 0.8114 2.133
foo 20   1.0 -0.1798 -1.162   0.5774
          1.1 -0.4635 1.436   1.419
bar 10   1.0 -1.013   0.7605 -1.184
          1.1 -0.4716 0.6983 0.5209
bar 20   1.0 -0.87   -0.3788 0.272
          1.1   1.018 -0.4496 1.132

In [8]: df.corr()
Out[8]:
   A       B        C
A 1      -0.2445   0.3852
B -0.2445 1        0.08211
C 0.3852 0.08211 1

In [9]: df.delevel().corr()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
   2535         cols = self.columns
   2536         mat = self.as_matrix(cols).T
-> 2537         baseCov = np.cov(mat)
   2538
   2539         sigma = np.sqrt(np.diag(baseCov))

.../python2.7/site-packages/numpy/lib/function_base.pyc in cov(m, y, rowvar, bias, ddof)
   1920         raise ValueError("ddof must be integer")
   1921
-> 1922     X = array(m, ndmin=2, dtype=float)
   1923     if X.shape[0] == 1:
   1924         rowvar = 1

ValueError: setting an array element with a sequence.

My guess is that this exception is related to the fact corr can not work with strings.
So let`s try it without the strings.

In [10]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']]
Out[10]:
   prm1 prm2 A       B       C
0 10    1     0.2074 0.3425 -1.295
1 10    1.1   0.3194 0.8114 2.133
2 20    1    -0.1798 -1.162   0.5774
3 20    1.1 -0.4635 1.436   1.419
4 10    1    -1.013   0.7605 -1.184
5 10    1.1 -0.4716 0.6983 0.5209
6 20    1    -0.87   -0.3788 0.272
7 20    1.1   1.018 -0.4496 1.132

In [11]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']].corr()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[...]
TypeError: function not supported for these types, and can't coerce safely to supported types

In [12]: df.delevel()['prm1'].values.dtype
Out[12]: dtype('object')

In [13]: df.delevel()['prm1']
Out[13]:
0    10
1    10
2    20
3    20
4    10
5    10
6    20
7    20
Name: prm1

In [14]: index.levels
Out[14]:
[Index([bar, foo], dtype=object),
Index([10, 20], dtype=object),
Index([1.0, 1.1], dtype=object)]

In [15]:

Wes McKinney

unread,

Dec 2, 2011, 5:16:47 PM12/2/11

to pystat...@googlegroups.com

Created an enhancement issue here:

https://github.com/wesm/pandas/issues/440

I have a few tricks up my sleeve

Wes McKinney

unread,

Dec 2, 2011, 5:17:31 PM12/2/11

to pystat...@googlegroups.com

BTW what do you think of the function name delevel? I'm not in love
with it. Maybe deindex?

Wouter Overmeire

unread,

Dec 3, 2011, 4:17:36 PM12/3/11

to pystat...@googlegroups.com

Op vrijdag 2 december 2011 23:17:31 UTC+1 schreef Wes McKinney het volgende:

Pulled in latest dev version and now this does work fine -nice trick-, thanks Wes.

Concerning the method name, DataFrame.delevel() is maybe too close to a MultiIndex.
DataFrame.deindex() is more general, both are fine for me. And i have no problem to change either.

One little remark, for a MultiIndex default column names are inserted when needed, whereas if no MultiIndex is used, and no name - an exception is raised. Would it not be cleaner to do the same for both, either raise an exception if no names are set or enter default column names?

Reply all

Reply to author

Forward