Pandas Data Frame: how to build multi columns?

Michael Hooreman

unread,

Dec 21, 2015, 6:24:13 AM12/21/15

to pyd...@googlegroups.com

Hello,

I'm sorry, I have a simple question, but I'm honnestly a bit stuck in finding a reply in the documentation. I should be a bit blind..

I have a pandas DataFrame which has columns which looks like:

n_0
n_1
p_0
p_1

I want to transform it to have columns and subcolumns:

0

n
p

1

n
p

How can I do?

Thanks a lot.

P.S: The real world is more complex, but the layout is there.

--

Michaël Hooreman
mic...@hooreman.be - http://www.hooreman.be

Francesc Alted

unread,

Dec 21, 2015, 6:35:23 AM12/21/15

to pyd...@googlegroups.com

Hi Michael,

Pandas has powerful reshaping functionality that will help you here:

http://pandas.pydata.org/pandas-docs/stable/reshaping.html

Francesc

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Francesc Alted

Michael Hooreman

unread,

Dec 21, 2015, 7:46:35 AM12/21/15

to pyd...@googlegroups.com

Hello,

Finally, I found a solution.

You can find the example script below.

    #!/usr/bin/env python3
    import pickle
    import pandas as pd
    import itertools
    import numpy as np

    data = pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', 'x'))

    indices = set()
    groups = set()
    others = set()
    for c in data.columns:
        if '_' in c:
            (i, g) = c.split('_')
            c2 = pd.MultiIndex.from_tuples((i, g),)
            indices.add(int(i))
            groups.add(g)
        else:
            others.add(c)
    columns = list(itertools.product(groups, indices))
    columns = pd.MultiIndex.from_tuples(columns)
    ret = pd.DataFrame(columns=columns)
    for c in columns:
        ret[c] = data['%d_%s' % (int(c[1]), c[0])]
    for c in others:
        ret[c] = data['%s' % c]
    ret.rename(columns={'total': 'total_indices'}, inplace=True)

    print("Before:")
    print(data)
    print("")
    print("After:")
    print(ret)

Sorry for this...

2015-12-21 12:23 GMT+01:00 Michael Hooreman <mic...@hooreman.be>:

Joris Van den Bossche

unread,

Dec 22, 2015, 4:44:19 AM12/22/15

to PyData

If you have a recent pandas, you can do this a bit simpler with the following:

data = pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', 'x'))

data.columns = data.columns.str.split('_', expand=True)
data.columns = data.columns.swaplevel(1,0)

This gives:

In [11]: data
Out[11]:
n p NaN
0 1 0 1 x
0 0.882079 -0.167304 -0.081573 -0.607486 -0.907488
1 -0.600067 0.285092 -1.592314 -0.049510 -1.050059
2 1.309180 -0.149241 0.428812 -2.046242 -0.141231
...

Only for the `x` columns this gives not fully the desired result. But therefore, it is maybe better to swap the strings first before splitting:

In [14]: data.columns.str[::-1].str.split('_', expand=True)
Out[14]:
MultiIndex(levels=[[u'n', u'p', u'x'], [u'0', u'1']],
           labels=[[0, 0, 1, 1, 2], [0, 1, 0, 1, -1]])

In [15]: data.columns = data.columns.str[::-1].str.split('_', expand=True)

In [16]: data
Out[16]:
          n                   p                   x
          0         1         0         1       NaN
0 0.994069 -1.525370 -0.001837 -0.358824 -0.040629
1 -1.133140 1.088123 -0.112497 -0.926222 0.291892
2 -0.655722 -1.236698 -2.019218 -1.832824 0.532997
...

--

Paul Hobson

unread,

Dec 22, 2015, 11:43:57 AM12/22/15

to pyd...@googlegroups.com

On Tue, Dec 22, 2015 at 1:44 AM, Joris Van den Bossche <jorisvand...@gmail.com> wrote:

If you have a recent pandas, you can do this a bit simpler with the following:

data = pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', 'x'))

data.columns = data.columns.str.split('_', expand=True)
data.columns = data.columns.swaplevel(1,0)

This gives:

In [11]: data
Out[11]:
n p NaN
0 1 0 1 x
0 0.882079 -0.167304 -0.081573 -0.607486 -0.907488
1 -0.600067 0.285092 -1.592314 -0.049510 -1.050059
2 1.309180 -0.149241 0.428812 -2.046242 -0.141231
...

Joris, you just a blew couple of minds here in my office.

This is awesome.

-paul

Paul Hobson

unread,

Jun 20, 2018, 6:07:19 PM6/20/18

to PyData

On Tuesday, December 22, 2015 at 1:44:19 AM UTC-8, Joris Van den Bossche wrote:

If you have a recent pandas, you can do this a bit simpler with the following:

data = pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', 'x'))

data.columns = data.columns.str.split('_', expand=True)
data.columns = data.columns.swaplevel(1,0)

This gives:

In [11]: data
Out[11]:
n p NaN
0 1 0 1 x
0 0.882079 -0.167304 -0.081573 -0.607486 -0.907488
1 -0.600067 0.285092 -1.592314 -0.049510 -1.050059
2 1.309180 -0.149241 0.428812 -2.046242 -0.141231
...

Sorry to resurrect an old thread. I noticed that in 0.22 you could do this to get a MultiIndex in the columns:

data = (

pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', '0_x'))

.rename(columns=dict(zip(_raw.columns, _raw.columns.str.split('_', expand=True))))

)

This was quite convenient as it didn't break a long chain. But with 0.23, it's not longer working.

I'm currently working around this by piping to the following function:

def split_cols(df, split_char, names=None):

new = df.copy()

new.columns = df.columns.str.split(split, expand=True) # broken chain

return new

This works perfectly well, but I'm curious if 1) the behavior of .rename changed intentionally and 2) if there's a new preferred way return a new dataframe with split columns.

Thanks,

-Paul

Tom Augspurger

unread,

Jun 20, 2018, 9:35:35 PM6/20/18

to pyd...@googlegroups.com

Yes, I think this was intentional. See https://github.com/pandas-dev/pandas/pull/21029 and https://github.com/pandas-dev/pandas/issues/19497

Basically, doing `index=List[Tuple]` or `columns=List[Tuple]` is ambiguous. It could be a MultiIndex or an Index where the actual items are tuples.

The Index constructor currently has a `tupleize_cols` keyword to control this, but I think the plan is to eventually deprecate that in favor of a `dtype=object` keyword, so we don't want to add it to other methods.

Tom

--
You received this message because you are subscribed to the Google Groups "PyData" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

Paul Hobson

unread,

Jun 21, 2018, 1:10:37 PM6/21/18

to pyd...@googlegroups.com

Thanks, Tom. The insight is much appreciated.

I'll stick with piping to my helper function.

Cheers,

-Paul

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

Joris Van den Bossche

unread,

Jun 21, 2018, 6:02:37 PM6/21/18

to PyData

One other alternative with the existing methods is set_axis, for which you can specify the full Index object that you want to use to replace the existing one (as opposed the .rename() where the mapper/function only works on the single Index labels):

In [135]: df.set_axis(df.columns.str.split('_', expand=True), axis=1, inplace=False)
Out[135]:
0 1 0 1 0
n n p p x
0 -0.175580 -0.139883 1.010042 0.478417 -1.531008
1 -1.646657 0.526828 0.149565 0.561415 0.032743

With one caveat: when using in a method chain, this only works if the columns are still identical as in the starting dataframe in the chain (so if you have a reference to that)

Joris

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

Paul Hobson

unread,

Jun 21, 2018, 6:24:41 PM6/21/18

to pyd...@googlegroups.com

Thanks, Joris.

That caveat is important, as I think it a lot of my cases those conditions won't be met. But this is nonetheless really helpful as I think it cleans up my helper function a lot.

(Note to future readers: at this time, it appears set_axis's default value of inplace is True, which is unusual for pandas. Be sure to explicitly pass inplace=False if you're a chainer)

import pandas as pd
import numpy as np

def split_cols(df, split_char, names):
    return (
        df.set_axis(df.columns.str.split(split_char, expand=True), axis='columns', inplace=False)
          .rename_axis(names, axis='columns')
    )

data = (
    pd.DataFrame(np.random.randn(3, 5), columns=('0_n', '1_n', '0_p', '1_p', '0_x'))
      .pipe(split_cols, '_', ['top', 'bottom'])

)

top            0         1         0         1         0
bottom         n         n         p         p         x
0       0.592875 -0.672805  0.717177  0.332351 -1.046899
1       1.273643 -0.966496  0.288252 -0.792620 -0.524999
2       0.979101 -0.097374  0.237005 -0.117168 -2.508691

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

Reply all

Reply to author

Forward