[pandas] Create single DataFrame from DataFrames with identical column and index

179 views
Skip to first unread message

Wouter Overmeire

unread,
Dec 22, 2011, 2:18:02 PM12/22/11
to pystat...@googlegroups.com
How to create a single DataFrame from DataFrames with identical column and index?

As an example i created 5 DataFrames with identical column and index and stored them in a list dfs.
I would like to create a single DataFrame containing all the data of the dfs DataFrames.
Had a look at Panel, my idea was to create a Panel and use DataFrame.unstack kind of method to go from Panel to DataFrame, but this does not exist.
Other idea was to add a 'foo' column to each DataFrame and do inverse of delevel() to create a MultiIndex, but that does not exist either.
Also looked for way to add an extra level to an existing index.

Created the single DataFrame by changing the index from each DataFrame in dfs to a MultiIndex after which i can just append them together.
Any other ideas?


In [79]: dfs

Out[79]:
[  baz  A       B
bar
a      0.2922  0.4178
b      0.2954  0.7714,
   baz  A       B
bar
a      0.7575  0.9414
b      0.4672  0.1876,
   baz  A       B
bar
a      0.6654  0.4004
b      0.8833  0.4481,
   baz  A       B
bar
a      0.9607  0.3983
b      0.1145  0.9141,
   baz  A       B
bar
a      0.6447  0.08133
b      0.4516  0.08101]

In [80]: for foo, df in enumerate(dfs):
   ....:     tuples = [(foo, idx) for idx in df.index]
   ....:     index = pandas.MultiIndex.from_tuples(tuples, names=['foo', 'bar'])
   ....:     df.index = index
   ....:

In [81]: dfs[0]
Out[81]:
  baz    A       B
foo bar
0   a    0.2922  0.4178
    b    0.2954  0.7714

In [82]: df = pandas.DataFrame()

In [83]: for df_foo in dfs:
   ....:     df = df.append(df_foo)
   ....:

In [84]: df
Out[84]:
  baz    A       B
foo bar
0   a    0.2922  0.4178
    b    0.2954  0.7714
1   a    0.7575  0.9414
    b    0.4672  0.1876
2   a    0.6654  0.4004
    b    0.8833  0.4481
3   a    0.9607  0.3983
    b    0.1145  0.9141
4   a    0.6447  0.08133
    b    0.4516  0.08101

In [85]:                  

Wes McKinney

unread,
Dec 22, 2011, 2:34:21 PM12/22/11
to pystat...@googlegroups.com

Panel is probably the right way to start:

p = Panel(dict(zip(range(5), dfs)))

In [29]: p.swapaxes(0, 2).swapaxes(2, 1).to_long()
Out[29]:
A B
0 a 0.02896 -0.01355
b -0.03215 0.01948
1 a 0.01272 -0.005954
b -0.01412 0.008555
2 a -0.05341 0.025
b 0.0593 -0.03593
3 a 0.3682 -0.1723
b -0.4087 0.2477
4 a -0.1489 0.0697
b 0.1653 -0.1002

Looks like to_long completely discards the index names. The returned
LongPanel object *is* a DataFrame, but I'm going to get rid of that
subclass altogether soon (probably 0.7.0)


In [33]: DataFrame(p.swapaxes(0, 2).swapaxes(2, 1).to_long())
Out[33]:
A B
0 a 0.02896 -0.01355
b -0.03215 0.01948
1 a 0.01272 -0.005954
b -0.01412 0.008555
2 a -0.05341 0.025
b 0.0593 -0.03593
3 a 0.3682 -0.1723
b -0.4087 0.2477
4 a -0.1489 0.0697
b 0.1653 -0.1002

issue for the to_long names being discarded:

https://github.com/wesm/pandas/issues/525

this part of the API (interplay between Panel and DataFrame) is pretty
malleable if you have some ideas for 3D<->2D manipulations

Wouter Overmeire

unread,
Dec 22, 2011, 3:30:34 PM12/22/11
to pystat...@googlegroups.com


2011/12/22 Wes McKinney <wesm...@gmail.com>

Panel was my first idea to use, but i got stuck rather quickly :-) I am unfamiliar with to_long() and LongPanel,
What you suggested is definitely more pathonic (pandas way of doing).

Concerning 3D<->2D manipulations API. First thing i looked for in Panel API was stack and unstack since these handle 2D<->1D manipulations on DataFrame,.
So i would reuse these method names. Just an idea...

Panel.stack(self, stack_ax=None, columns_ax=None, index_ax=None, dropna=True)
    stack_ax: int, panel ax that will be used to reduce from 3D<->2D, defaults to ...
    columns_ax: int, panel ax that will be used to create columns in resulting DataFrame, defaults to ...
    index_ax: int, panel ax that will be used to create index in resulting DataFrame, defaults to ...

Maybe better to use only two of the three axes, since one can only choose two freely.

Being able to select from which ax to stack or take column and index data avoids the need to swap axes before stacking.

Reply all
Reply to author
Forward
0 new messages