Convert values to indicator columns

Warren Weckesser

unread,

Dec 14, 2011, 1:41:47 PM12/14/11

to pystat...@googlegroups.com

Hey stats modelers,

Does pandas (or anything else) already have a convenient way to convert a file like this:

gender,hand,color,height
male,right,green,5.75
female,right,blue,5.42
female,left,blue,5.58
male,right,grey,5.92
male,right,brown,5.83

into this:

gender.female,gender.male,hand.left,hand.right,color.blue,color.brown,color.green,height,age
0,1,0,1,0,0,1,5.72,23
1,0,0,1,0,1,0,5.42,27
1,0,1,0,0,0,1,5.58,31
0,1,0,1,0,1,0,5.92,39
0,1,0,1,1,0,0,5.83,33

That is, each categorical value becomes a column containing 0/1 (ie. boolean) values.

Warren

Aman Thakral

unread,

Dec 14, 2011, 1:47:03 PM12/14/11

to pystat...@googlegroups.com

Hi Warren,

I recently had to do this. I just used the pivot function then joined the resulting dataframes together. I'm not sure if it is the most efficient way, but it does the trick.

def create_nominal_variables(df,index,column):
    col_name = lambda x: '%s_%s'%(column,x)
    df['1'] = np.ones(len(df))
    df2 = df[[index,'1',column]]
    df2 = df2.pivot(index=index,columns=column,values='1').fillna(0)
    df2 = df2.rename(columns=col_name)
    return df2

Warren Weckesser

unread,

Dec 14, 2011, 1:47:45 PM12/14/11

to pystat...@googlegroups.com

On Wed, Dec 14, 2011 at 12:41 PM, Warren Weckesser <warren.w...@enthought.com> wrote:

Hey stats modelers,

Does pandas (or anything else) already have a convenient way to convert a file like this:

gender,hand,color,height
male,right,green,5.75
female,right,blue,5.42
female,left,blue,5.58
male,right,grey,5.92
male,right,brown,5.83

Argh--cut-and-pasted the wrong data snippet. Here's the input:

gender,hand,color,height,age
male,right,green,5.75,23
female,right,brown,5.42,27
female,left,green,5.58,31
male,right,brown,5.92,39
male,right,blue,5.83,33

Wes McKinney

unread,

Dec 14, 2011, 1:56:21 PM12/14/11

to pystat...@googlegroups.com

On Wed, Dec 14, 2011 at 1:47 PM, Warren Weckesser

Here's a quick hack at it (not too dissimilar to Aman's code it looks
like)-- I should find a place in the library to put this:

def make_dummies(data, cat_variables):
result = data.drop(cat_variables, axis=1)

for variable in cat_variables:
dummies = _get_dummy_frame(data, variable)
result = result.join(dummies)
return result

def _get_dummy_frame(data, column):
from pandas import Factor
factor = Factor(data[column])
dummy_mat = np.eye(len(factor.levels)).take(factor.labels, axis=0)
dummy_cols = ['%s.%s' % (column, v) for v in factor.levels]
dummies = DataFrame(dummy_mat, index=data.index,
columns=dummy_cols)

return dummies

In [29]: df
Out[29]:
gender hand color height age
0 male right green 5.75 23
1 female right brown 5.42 27
2 female left green 5.58 31
3 male right brown 5.92 39
4 male right blue 5.83 33

In [30]: make_dummies(df, ['gender', 'hand', 'color']).T
Out[30]:
0 1 2 3 4
height 5.75 5.42 5.58 5.92 5.83
age 23 27 31 39 33
gender.female 0 1 1 0 0
gender.male 1 0 0 1 1
hand.left 0 0 1 0 0
hand.right 1 1 0 1 1
color.blue 0 0 0 0 1
color.brown 0 1 0 1 0
color.green 1 0 1 0 0

(BTW I read in that data using df = read_clipboard(sep=','))

- Wes

Skipper Seabold

unread,

Dec 14, 2011, 2:03:13 PM12/14/11

to pystat...@googlegroups.com

Hi Warren,

Unfortunately, what I have doesn't work well for the case of several variables and doesn't work on DataFrames, but if you file an issue I'll work something up. In the meantime, second best option

from StringIO import StringIO
import pandas
import scikits.statsmodels.api as sm
import numpy as np

s = """gender,hand,color,height

male,right,green,5.75
female,right,blue,5.42
female,left,blue,5.58
male,right,grey,5.92
male,right,brown,5.83"""

recarr = np.genfromtxt(StringIO(s), delimiter=",", dtype=None, names=True)

gender = pandas.DataFrame.from_records(sm.categorical(recarr[['gender']], drop=True))
hand = pandas.DataFrame.from_records(sm.categorical(recarr[['hand']], drop=True))
color = pandas.DataFrame.from_records(sm.categorical(recarr[['color']], drop=True))

df = pandas.DataFrame.from_records(recarr[['height']])
df = df.join(gender)
df = df.join(hand)
df = df.join(color)

df.columns
Index([height, gender_female, gender_male, hand_left, hand_right,
       color_blue, color_brown, color_green, color_grey], dtype=object)

df.values
array([[ 5.75, 0. , 1. , 0. , 1. , 0. , 0. , 1. , 0. ],
       [ 5.42, 1. , 0. , 0. , 1. , 1. , 0. , 0. , 0. ],
       [ 5.58, 1. , 0. , 1. , 0. , 1. , 0. , 0. , 0. ],
       [ 5.92, 0. , 1. , 0. , 1. , 0. , 0. , 0. , 1. ],
       [ 5.83, 0. , 1. , 0. , 1. , 0. , 1. , 0. , 0. ]])

Skipper

Skipper Seabold

unread,

Dec 14, 2011, 2:07:44 PM12/14/11

to pystat...@googlegroups.com

Is that a sly return of the Factor object I see? Is this being used again anywhere else internally? Ie., just do this on reading data in?

Skipper

josef...@gmail.com

unread,

Dec 14, 2011, 2:56:27 PM12/14/11

to pystat...@googlegroups.com

Here is my (numpy) version, used two weeks ago for combining groups
and get integer labels for internal usage, not included (is in
separate files):

dummy = group_index[:,None] * np.arange(len(uniques)))

labels are tuples in the current version, I think.

Josef

try_factor_indices.py

josef...@gmail.com

unread,

Dec 18, 2011, 1:46:12 PM12/18/11

to pystat...@googlegroups.com

FYI, I started to collect my various group handling functions
https://github.com/josef-pkt/statsmodels/blob/mixed/scikits/statsmodels/tools/grouputils.py

pure numpy, and a bit of sparse, written mostly for internal use.

Josef

>
> Josef

Warren Weckesser

unread,

Mar 16, 2012, 12:20:19 AM3/16/12

to pystat...@googlegroups.com

Reviving an old thread to point out a snippet I put on scipy-central:

http://scipy-central.org/item/35/1/convert-categorical-data-in-a-structure-numpy-array-to-boolean-fields

It works on n-dimensional structured arrays, with minimal temporary memory used.

Warren

Reply all

Reply to author

Forward