Efficient concatenation of pandas data frames

8,437 views
Skip to first unread message

Chris Rodgers

unread,
Dec 31, 2011, 8:44:30 PM12/31/11
to pystatsmodels
Hi
I'm trying to load data frames from many different datafiles, each
with the same column structure, and then concatenate them together
into one big dataframe. I can't find the best way to do this in the
documentation.

The most obvious way is to append them one by one, but this takes too
long for the amount of data that I have. I've also tried:

big_dataframe = pandas.DataFrame(np.concatenate([df.values for df in
df_list]), columns=df_list[0].columns)

And this seems to work and be much faster. Is this a general-purpose
solution? Are there other attributes that need to be set?

Actually I would ideally like to do is associate a unique key to each
individual dataframe, and then combine them such that each row has
this key. I feel like there must be a built-in way to do this but I
can't figure it out.

Thanks!
Chris


--
Graduate Student
Helen Wills Neuroscience Institute
University of California - Berkeley

Chris Rodgers

unread,
Jan 2, 2012, 5:52:00 PM1/2/12
to pystatsmodels
One problem with the solution that I posted using numpy.concatenate is
that heterogeneous dtype information is lost in the conversion to
ndarray. I end up with all columns of type "object" even if they
should be int.

So I rewrote it to reindex the first data frame to the size of all of
the combined data, and then assign each data frame one at a time. I
couldn't figure out how to assign an entire mixed-type data frame at
once ("setting on mixed-type frames only allowed with scalar values")
so I assign column-by-column within each frame.

Does this look right?

def concat2(df_list):
# Resize the first data frame to the new size
n_rows = sum([len(df) for df in df_list])
res = df_list[0].reindex(index=pandas.Index(range(n_rows)))

# Assign frame by frame
start = 0
for df in df_list:
# Assign column by column in case of mixed types
for col in df.columns:
res[col][start:start + len(df)] = df[col]
start += len(df)

return res

Wouter Overmeire

unread,
Jan 3, 2012, 4:12:16 AM1/3/12
to pystat...@googlegroups.com
What your code does is appending dataframes while ignoring the index, pandas.DataFrame has a solution to do this - see http://pandas.sourceforge.net/merging.html#appending-record-array-like-dataframes.
If you would have a faster method for appending dataframes than DataFrame.append(df, ignore_index=True) i guess Wes would be interested, you could post it as an issue on github.

Concerning "associate a unique key to each individual dataframe, and then combine them such that each row has this key", this thread shows an example. Note that the unique key is added to the index, if you want the key to be in a column you can unstack the unique key level from the index.

Wes McKinney

unread,
Jan 3, 2012, 11:27:05 PM1/3/12
to pystat...@googlegroups.com

Stay tuned:

https://github.com/wesm/pandas/issues/41
https://github.com/wesm/pandas/issues/115
https://github.com/wesm/pandas/issues/218
https://github.com/wesm/pandas/issues/273
https://github.com/wesm/pandas/issues/479

The main reason I haven't done this yet is that there are thorny
corner cases that are unpleasant to deal with in the implementation.

But anyway, I've already got quite a bit of functioning concatenation
code done inside GroupBy, so I'm going to see if I can make it user
friendly, fast, and ready for the imminent 0.7.0 release. Having a
working API function that isn't all that fast would be preferable to
no function at all, I guess-- rather have you complaining about it
being slow than not being able to do it at all :)

- Wes

Chris Rodgers

unread,
Jan 6, 2012, 8:49:15 PM1/6/12
to pystat...@googlegroups.com
Wouter: thanks for the code snips! I think this concatenation with
MultiIndex is a neat trick.

Wes: I had found your GH #479 when I googled this problem. That's how
I knew it was in the works. Looks like it has just been released,
great! So the new syntax is:

big_df = df_list[0].join(df_list[1:])

correct? I got it from GH #115. But I will have to re-install from
source 0.7 in order to try it myself.

Thanks!!
Chris

Wes McKinney

unread,
Jan 7, 2012, 11:55:05 AM1/7/12
to pystat...@googlegroups.com

hi Chris,

yes, I finally did a proper job of implementing multi-joins and
multi-appends. There is new a single API function, concat, that does
all the hard labor. So you can do:

df_list[0].append(df_list[1:])

or simply

concat(df_list, axis=0)

With concat you have more control-- namely you can choose how the
other axes should be handled (i.e. use the columns from the first
object, or union/intersect them).

The last thing I'm going to do is allow you to pass a layers of keys
for the groups to form a hierarchical index along the concatenation
axis. For example, if you have:

>>> df1
a b
0 1 2
1 3 4
2 5 6

>>> df2
a b
0 7 8
1 9 10
2 11 12

you might want:

>>> concat([df1, df2], axis=1, group_keys=['one', 'two'])
one two
a b a b
0 1 2 7 8
1 3 4 9 10
2 5 6 11 12

i.e. a hierarchical index along the concatenation index.

This will be a part of the upcoming 0.7.0 release-- stay tuned. Of
course everything but that last bit (which is already implemented for
groupby--actually somewhat nontrivial--but I need to expose with a
reasonable API) is available in git master

- W

Wes McKinney

unread,
Jan 7, 2012, 5:20:56 PM1/7/12
to pystat...@googlegroups.com

To your question about associating a key with each DataFrame, this is
hot off the presses:

In [2]: df
Out[2]:
0 1 2
0 1.6614 -0.71357 0.9032
1 0.9877 -0.43574 -1.8906
2 0.1742 -0.06604 0.5700
3 0.9013 -0.80383 -1.8286
4 -1.8021 -1.20078 0.4313

In [3]: concat([df, df], keys=[0, 1], axis=1)
Out[3]:
0 1
0 1 2 0 1 2
0 1.6614 -0.71357 0.9032 1.6614 -0.71357 0.9032
1 0.9877 -0.43574 -1.8906 0.9877 -0.43574 -1.8906
2 0.1742 -0.06604 0.5700 0.1742 -0.06604 0.5700
3 0.9013 -0.80383 -1.8286 0.9013 -0.80383 -1.8286
4 -1.8021 -1.20078 0.4313 -1.8021 -1.20078 0.4313

In [4]: concat([df, df], keys=[0, 1], axis=0)
Out[4]:
0 1 2
0 0 1.6614 -0.71357 0.9032
1 0.9877 -0.43574 -1.8906
2 0.1742 -0.06604 0.5700
3 0.9013 -0.80383 -1.8286
4 -1.8021 -1.20078 0.4313
1 0 1.6614 -0.71357 0.9032
1 0.9877 -0.43574 -1.8906
2 0.1742 -0.06604 0.5700
3 0.9013 -0.80383 -1.8286
4 -1.8021 -1.20078 0.4313

So it makes it very easy to concatenate DataFrame objects and
simultaneously index them with a MultiIndex based on some keys. For
example:

In [5]: glued = concat([df, df], keys=['foo', 'bar'], axis=1)

In [6]: glued
Out[6]:
foo bar
0 1 2 0 1 2
0 1.6614 -0.71357 0.9032 1.6614 -0.71357 0.9032
1 0.9877 -0.43574 -1.8906 0.9877 -0.43574 -1.8906
2 0.1742 -0.06604 0.5700 0.1742 -0.06604 0.5700
3 0.9013 -0.80383 -1.8286 0.9013 -0.80383 -1.8286
4 -1.8021 -1.20078 0.4313 -1.8021 -1.20078 0.4313

In [7]: glued['bar']
Out[7]:
0 1 2
0 1.6614 -0.71357 0.9032
1 0.9877 -0.43574 -1.8906
2 0.1742 -0.06604 0.5700
3 0.9013 -0.80383 -1.8286
4 -1.8021 -1.20078 0.4313

- Wes

Nathaniel Smith

unread,
Jan 7, 2012, 8:13:33 PM1/7/12
to pystat...@googlegroups.com
On Sat, Jan 7, 2012 at 2:20 PM, Wes McKinney <wesm...@gmail.com> wrote:
> In [3]: concat([df, df], keys=[0, 1], axis=1)

Perhaps it'd be nice to support concat({0=df, 1=df}, axis=1) as well
(with no guaranteed order for the resulting concatenation, of course,
but lots of times there's no reason to care).

-- N

Wes McKinney

unread,
Jan 9, 2012, 6:21:43 PM1/9/12
to pystat...@googlegroups.com
I'll bite. That's very consistent with the rest of the way that pandas works:

In [2]: df
Out[2]: 
   0       1        2     
0  0.1898  0.09703  1.4824
1 -0.2658 -0.15954  0.8485
2  1.2489  0.82036  2.2680
3  0.4756 -0.43297  1.4955

In [3]: concat({'foo' : df, 'bar' : df, 'baz' : df})
Out[3]: 
       0       1        2     
bar 0  0.1898  0.09703  1.4824
    1 -0.2658 -0.15954  0.8485
    2  1.2489  0.82036  2.2680
    3  0.4756 -0.43297  1.4955
baz 0  0.1898  0.09703  1.4824
    1 -0.2658 -0.15954  0.8485
    2  1.2489  0.82036  2.2680
    3  0.4756 -0.43297  1.4955
foo 0  0.1898  0.09703  1.4824
    1 -0.2658 -0.15954  0.8485
    2  1.2489  0.82036  2.2680
    3  0.4756 -0.43297  1.4955

In [4]: concat({'foo' : df, 'bar' : df, 'baz' : df}, axis=1)
Out[4]: 
   bar                      baz                      foo                    
   0       1        2       0       1        2       0       1        2     
0  0.1898  0.09703  1.4824  0.1898  0.09703  1.4824  0.1898  0.09703  1.4824
1 -0.2658 -0.15954  0.8485 -0.2658 -0.15954  0.8485 -0.2658 -0.15954  0.8485
2  1.2489  0.82036  2.2680  1.2489  0.82036  2.2680  1.2489  0.82036  2.2680
3  0.4756 -0.43297  1.4955  0.4756 -0.43297  1.4955  0.4756 -0.43297  1.4955

You can even "select out" a set of objects from a dict (this is very pandas-ic):

In [5]: concat({'foo' : df, 'bar' : df, 'baz' : df}, keys=['foo', 'bar'])
Out[5]: 
       0       1        2     
foo 0  0.1898  0.09703  1.4824
    1 -0.2658 -0.15954  0.8485
    2  1.2489  0.82036  2.2680
    3  0.4756 -0.43297  1.4955
bar 0  0.1898  0.09703  1.4824
    1 -0.2658 -0.15954  0.8485
    2  1.2489  0.82036  2.2680
    3  0.4756 -0.43297  1.4955


- W
Reply all
Reply to author
Forward
0 new messages