[pandas] MultiIndex levels always stored as dtype('object')

247 views
Skip to first unread message

Wouter Overmeire

unread,
Dec 1, 2011, 4:34:19 AM12/1/11
to pystat...@googlegroups.com
MultiIndex seems to store the level data always as dtype('object').
When using DataFrame.delevel() the added columns from the index also have dtype('object').
This prevents from using DataFrame.delevel.corr() to have a look at the correlation between the original DataFrame columns and the index level values. Does anyone have an idea to work around this?

See example below:

In [1]: import pandas


In [2]: import numpy as np

In [3]: import itertools

In [4]: tuples = [tuple for tuple in itertools.product(['foo', 'bar'], [10, 20], [1.0, 1.1])]

In [5]: index = pandas.MultiIndex.from_tuples(tuples, names=['prm0', 'prm1', 'prm2'])

In [6]: df = pandas.DataFrame(np.random.randn(8,3), columns=['A', 'B', 'C'], index=index)

In [7]: df
Out[7]:
                A       B       C
prm0 prm1 prm2
foo  10   1.0   0.2074  0.3425 -1.295
          1.1   0.3194  0.8114  2.133
foo  20   1.0  -0.1798 -1.162   0.5774
          1.1  -0.4635  1.436   1.419
bar  10   1.0  -1.013   0.7605 -1.184
          1.1  -0.4716  0.6983  0.5209
bar  20   1.0  -0.87   -0.3788  0.272
          1.1   1.018  -0.4496  1.132

In [8]: df.corr()
Out[8]:
   A       B        C
A  1      -0.2445   0.3852
B -0.2445  1        0.08211
C  0.3852  0.08211  1

In [9]: df.delevel().corr()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
   2535         cols = self.columns
   2536         mat = self.as_matrix(cols).T
-> 2537         baseCov = np.cov(mat)
   2538
   2539         sigma = np.sqrt(np.diag(baseCov))

.../python2.7/site-packages/numpy/lib/function_base.pyc in cov(m, y, rowvar, bias, ddof)
   1920         raise ValueError("ddof must be integer")
   1921
-> 1922     X = array(m, ndmin=2, dtype=float)
   1923     if X.shape[0] == 1:
   1924         rowvar = 1

ValueError: setting an array element with a sequence.

My guess is that this exception is related to the fact corr can not work with strings.
So let`s try it without the strings.


In [10]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']]
Out[10]:
   prm1  prm2  A       B       C
0  10    1     0.2074  0.3425 -1.295
1  10    1.1   0.3194  0.8114  2.133
2  20    1    -0.1798 -1.162   0.5774
3  20    1.1  -0.4635  1.436   1.419
4  10    1    -1.013   0.7605 -1.184
5  10    1.1  -0.4716  0.6983  0.5209
6  20    1    -0.87   -0.3788  0.272
7  20    1.1   1.018  -0.4496  1.132

In [11]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']].corr()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[...]
TypeError: function not supported for these types, and can't coerce safely to supported types

In [12]: df.delevel()['prm1'].values.dtype
Out[12]: dtype('object')

In [13]: df.delevel()['prm1']
Out[13]:
0    10
1    10
2    20
3    20
4    10
5    10
6    20
7    20
Name: prm1

In [14]: index.levels
Out[14]:
[Index([bar, foo], dtype=object),
 Index([10, 20], dtype=object),
 Index([1.0, 1.1], dtype=object)]

In [15]:

Wes McKinney

unread,
Dec 2, 2011, 5:16:47 PM12/2/11
to pystat...@googlegroups.com

Created an enhancement issue here:

https://github.com/wesm/pandas/issues/440

I have a few tricks up my sleeve

Wes McKinney

unread,
Dec 2, 2011, 5:17:31 PM12/2/11
to pystat...@googlegroups.com

BTW what do you think of the function name delevel? I'm not in love
with it. Maybe deindex?

Wouter Overmeire

unread,
Dec 3, 2011, 4:17:36 PM12/3/11
to pystat...@googlegroups.com


Op vrijdag 2 december 2011 23:17:31 UTC+1 schreef Wes McKinney het volgende:

Pulled in latest dev version and now this does work fine -nice trick-, thanks Wes.

Concerning the method name, DataFrame.delevel() is maybe too close to a MultiIndex.
DataFrame.deindex() is more general, both are fine for me. And i have no problem to change either.

One little remark, for a MultiIndex default column names are inserted when needed, whereas if no MultiIndex is used, and no name - an exception is raised. Would it not be cleaner to do the same for both, either raise an exception if no names are set or enter default column names?

 
Reply all
Reply to author
Forward
0 new messages