Pandas Data Frame: how to build multi columns?

1,759 views
Skip to first unread message

Michael Hooreman

unread,
Dec 21, 2015, 6:24:13 AM12/21/15
to pyd...@googlegroups.com
Hello,

I'm sorry, I have a simple question, but I'm honnestly a bit stuck in finding a reply in the documentation. I should be a bit blind..

I have a pandas DataFrame which has columns which looks like:
  • n_0
  • n_1
  • p_0
  • p_1
I want to transform it to have columns and subcolumns:
  • 0
    • n
    • p
  • 1
    • n
    • p
How can I do?

Thanks a lot. 

P.S: The real world is more complex, but the layout is there.

--

Francesc Alted

unread,
Dec 21, 2015, 6:35:23 AM12/21/15
to pyd...@googlegroups.com
Hi Michael,

Pandas has powerful reshaping functionality that will help you here:


Francesc

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Francesc Alted

Michael Hooreman

unread,
Dec 21, 2015, 7:46:35 AM12/21/15
to pyd...@googlegroups.com
Hello,

Finally, I found a solution.

You can find the example script below.

    #!/usr/bin/env python3
    import pickle
    import pandas as pd
    import itertools
    import numpy as np

    data = pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', 'x'))

    indices = set()
    groups = set()
    others = set()
    for c in data.columns:
        if '_' in c:
            (i, g) = c.split('_')
            c2 = pd.MultiIndex.from_tuples((i, g),)
            indices.add(int(i))
            groups.add(g)
        else:
            others.add(c)
    columns = list(itertools.product(groups, indices))
    columns = pd.MultiIndex.from_tuples(columns)
    ret = pd.DataFrame(columns=columns)
    for c in columns:
        ret[c] = data['%d_%s' % (int(c[1]), c[0])]
    for c in others:
        ret[c] = data['%s' % c]
    ret.rename(columns={'total': 'total_indices'}, inplace=True)

    print("Before:")
    print(data)
    print("")
    print("After:")
    print(ret)


Sorry for this...

2015-12-21 12:23 GMT+01:00 Michael Hooreman <mic...@hooreman.be>:

Joris Van den Bossche

unread,
Dec 22, 2015, 4:44:19 AM12/22/15
to PyData
If you have a recent pandas, you can do this a bit simpler with the following:

data = pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', 'x'))

data.columns = data.columns.str.split('_', expand=True)
data.columns = data.columns.swaplevel(1,0)


This gives:

In [11]: data
Out[11]:
          n                   p                 NaN
          0         1         0         1         x
0  0.882079 -0.167304 -0.081573 -0.607486 -0.907488
1 -0.600067  0.285092 -1.592314 -0.049510 -1.050059
2  1.309180 -0.149241  0.428812 -2.046242 -0.141231
...


Only for the `x` columns this gives not fully the desired result. But therefore, it is maybe better to swap the strings first before splitting:

In [14]: data.columns.str[::-1].str.split('_', expand=True)
Out[14]:
MultiIndex(levels=[[u'n', u'p', u'x'], [u'0', u'1']],
           labels=[[0, 0, 1, 1, 2], [0, 1, 0, 1, -1]])

In [15]: data.columns = data.columns.str[::-1].str.split('_', expand=True)

In [16]: data
Out[16]:
          n                   p                   x
          0         1         0         1       NaN
0  0.994069 -1.525370 -0.001837 -0.358824 -0.040629
1 -1.133140  1.088123 -0.112497 -0.926222  0.291892
2 -0.655722 -1.236698 -2.019218 -1.832824  0.532997
...



--

Paul Hobson

unread,
Dec 22, 2015, 11:43:57 AM12/22/15
to pyd...@googlegroups.com
On Tue, Dec 22, 2015 at 1:44 AM, Joris Van den Bossche <jorisvand...@gmail.com> wrote:
If you have a recent pandas, you can do this a bit simpler with the following:

data = pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', 'x'))

data.columns = data.columns.str.split('_', expand=True)
data.columns = data.columns.swaplevel(1,0)


This gives:

In [11]: data
Out[11]:
          n                   p                 NaN
          0         1         0         1         x
0  0.882079 -0.167304 -0.081573 -0.607486 -0.907488
1 -0.600067  0.285092 -1.592314 -0.049510 -1.050059
2  1.309180 -0.149241  0.428812 -2.046242 -0.141231
...



Joris, you just a blew couple of minds here in my office.

This is awesome.
-paul

Paul Hobson

unread,
Jun 20, 2018, 6:07:19 PM6/20/18
to PyData


On Tuesday, December 22, 2015 at 1:44:19 AM UTC-8, Joris Van den Bossche wrote:
If you have a recent pandas, you can do this a bit simpler with the following:

data = pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', 'x'))

data.columns = data.columns.str.split('_', expand=True)
data.columns = data.columns.swaplevel(1,0)


This gives:

In [11]: data
Out[11]:
          n                   p                 NaN
          0         1         0         1         x
0  0.882079 -0.167304 -0.081573 -0.607486 -0.907488
1 -0.600067  0.285092 -1.592314 -0.049510 -1.050059
2  1.309180 -0.149241  0.428812 -2.046242 -0.141231
...



 
Sorry to resurrect an old thread. I noticed that in 0.22 you could do this to get a MultiIndex in the columns:

data = (
    pd.DataFrame(np.random.randn(10, 5), columns=('0_n', '1_n', '0_p', '1_p', '0_x'))
            .rename(columns=dict(zip(_raw.columns, _raw.columns.str.split('_', expand=True))))
)

This was quite convenient as it didn't break a long chain. But with 0.23, it's not longer working.

I'm currently working around this by piping to the following function:

def split_cols(df, split_char, names=None):

    new = df.copy()

    new.columns = df.columns.str.split(split, expand=True) # broken chain

    return new

This works perfectly well, but I'm curious if 1) the behavior of .rename changed intentionally and 2) if there's a new preferred way return a new dataframe with split columns.

Thanks,
-Paul

Tom Augspurger

unread,
Jun 20, 2018, 9:35:35 PM6/20/18
to pyd...@googlegroups.com

Basically, doing `index=List[Tuple]` or `columns=List[Tuple]` is ambiguous. It could be a MultiIndex or an Index where the actual items are tuples.

The Index constructor currently has a `tupleize_cols` keyword to control this, but I think the plan is to eventually deprecate that in favor of a `dtype=object` keyword, so we don't want to add it to other methods.

Tom


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

Paul Hobson

unread,
Jun 21, 2018, 1:10:37 PM6/21/18
to pyd...@googlegroups.com
Thanks, Tom. The insight is much appreciated.

I'll stick with piping to my helper function.

Cheers,
-Paul

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

Joris Van den Bossche

unread,
Jun 21, 2018, 6:02:37 PM6/21/18
to PyData
One other alternative with the existing methods is set_axis, for which you can specify the full Index object that you want to use to replace the existing one (as opposed the .rename() where the mapper/function only works on the single Index labels):

In [135]: df.set_axis(df.columns.str.split('_', expand=True), axis=1, inplace=False)
Out[135]:
          0         1         0         1         0
          n         n         p         p         x
0 -0.175580 -0.139883  1.010042  0.478417 -1.531008
1 -1.646657  0.526828  0.149565  0.561415  0.032743


With one caveat: when using in a method chain, this only works if the columns are still identical as in the starting dataframe in the chain (so if you have a reference to that)

Joris




To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

Paul Hobson

unread,
Jun 21, 2018, 6:24:41 PM6/21/18
to pyd...@googlegroups.com
Thanks, Joris.

That caveat is important, as I think it a lot of my cases those conditions won't be met. But this is nonetheless really helpful as I think it cleans up my helper function a lot.

(Note to future readers: at this time, it appears set_axis's default value of inplace is True, which is unusual for pandas. Be sure to explicitly pass inplace=False if you're a chainer)

import pandas as pd
import numpy as np

def split_cols(df, split_char, names):
    return (
        df.set_axis(df.columns.str.split(split_char, expand=True), axis='columns', inplace=False)
          .rename_axis(names, axis='columns')
    )

data = (
    pd.DataFrame(np.random.randn(3, 5), columns=('0_n', '1_n', '0_p', '1_p', '0_x'))
      .pipe(split_cols, '_', ['top', 'bottom'])
    
)

top            0         1         0         1         0
bottom         n         n         p         p         x
0       0.592875 -0.672805  0.717177  0.332351 -1.046899
1       1.273643 -0.966496  0.288252 -0.792620 -0.524999
2       0.979101 -0.097374  0.237005 -0.117168 -2.508691



To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages