Re: [pydata] MultiIndex and pivot_table (or data in index columns more generally)

903 views

Skip to first unread message

Wouter Overmeire

unread,

Oct 15, 2012, 6:02:04 AM10/15/12

to pyd...@googlegroups.com

2012/10/14 Hans-Martin von Gaudecker <hmgau...@gmail.com>

Hi,

I absolutely love Pandas for everyday work -- thanks for a great tool! This mail is a mixture between a specific question and a more general issue of comprehending its ways of working, hope my ramblings are more or less clear...

Specifically, I find myself repeatedly in the situation where I want to get a pivot table by a column that happens to be part of a MultiIndex (say I have a panel with persons that I see over many years; I would store these data with a MultiIndex containing person ID and age; say I want to have a pivot table by age and employment status, which is a "normal" column).

To have a concrete example, take something that is very close to (except for columns A, B, C adjusted so they contain a unique index) the reshaping example from the docs:

import numpy as np
import pandas as pd

index_cols =[np.array(['one'] * 6 + ['two'] * 6 + ['three'] * 6 + ['four'] * 6),
             np.array(['A', 'A', 'B', 'B', 'C', 'C'] * 4),
             np.array(['foo', 'bar'] * 12)]

data_cols = [np.random.randn(24), np.random.randn(24)]

# This is (almost) the example from the docs
df = pd.DataFrame(dict(zip(('A', 'B', 'C', 'D', 'E'),
                           index_cols + data_cols)))

# Now with the first three columns as MultiIndex
index=pd.MultiIndex.from_arrays(index_cols, names=('A', 'B', 'C'))
df2 = pd.DataFrame(dict(zip(('D', 'E'), data_cols)), index=index)

# This works great
df.pivot_table(values='D', rows=['A', 'B'], cols=['C'])

My specific question is whether there is a way to get the same pivot table for df2?

In general, I often struggle with the distinction between columns contained in a MultiIndex and "normal" columns. In SQL and this particular example, I would set a primary key and an "order by" on columns A, B, C, but otherwise there is no difference between these columns and any other columns. I can see why there is a stronger distinction being made in Pandas, but it seems like I can't get my head around how to harness it best. Am I using it in a fundamentally wrong way? Should I just keep duplicate "normal" columns? Or is there another recommended way?

Best,
Hans-Martin

--

You can use reset_index() to move the MultiIndex levels to regular columns.

In [88]: df2.reset_index().pivot_table(values='D', rows=['A', 'B'], cols=['C'])

Out[88]:

C bar foo

A B

four A 0.183273 0.501556

B -1.024678 -0.127654

C -1.494337 -0.521734

one A -1.018497 1.383526

B 0.364483 -0.860629

C -1.416624 -0.036285

three A -1.027290 0.325312

B -0.884827 1.138194

C 0.435088 -0.874509

two A 1.214402 -0.513534

B -0.477618 -1.117796

C -1.525965 0.627701

Alvaro Tejero Cantero

unread,

Oct 19, 2012, 11:38:47 AM10/19/12

to pyd...@googlegroups.com

Hi Hans Martin,

On 14 October 2012 21:39, Hans-Martin von Gaudecker

<hmgau...@gmail.com> wrote:
> In general, I often struggle with the distinction between columns contained
> in a MultiIndex and "normal" columns. In SQL and this particular example, I
> would set a primary key and an "order by" on columns A, B, C, but otherwise
> there is no difference between these columns and any other columns. I can
> see why there is a stronger distinction being made in Pandas, but it seems
> like I can't get my head around how to harness it best. Am I using it in a
> fundamentally wrong way? Should I just keep duplicate "normal" columns? Or
> is there another recommended way?

I totally agree.

It is also a problem for joins. Here is my experience:

Even if your index is named, it has to be singled out with e.g. left_index=True.

In most domains it is possible to keep unique global naming (because
of a reduced number of tables and/or columns) and then the columns
would be matched automatically by name ... if they were not at the
same time used as indexes.

I hadn't written to the list so far because I am not sure of what is
the best solution . Ideally one should be able, as you say, to treat
indexes as regular columns!

My current solution is to keep two copies, one as an index, one as a column

df.set_index(index_that_i_want, inplace=True, drop=False)

The drop option keeps the index amongst the columns. I can do this
because a) I have enough space (but who knows in the future...) and b)
because my columns are only ever 'bulk updated', there is no risk of
column and index getting out of sync (my scenario is more OLAP than
OLTP).

What are thoughts on this issue? Is this something really unavoidable?

Duality of column types (normal vs. index) seems to me the greatest
inconvenience so far in the use of Pandas, although it has been well
solved by drop=False so far.

Thanks for bringing this up!

-á.

Reply all

Reply to author

Forward

0 new messages