DataFrames: convert categorical variables to dummy ones?

1,207 views
Skip to first unread message

Andrei

unread,
Oct 1, 2014, 4:23:04 AM10/1/14
to julia...@googlegroups.com
Probably simple question, but I can't find any reference.

What is the most convenient way to convert categorical variable to a set of dummy vars?
So far I've been using

   int(indicatormat(df[:categoricalvar]))

but it's pretty annoying to do it for every variable, give every indicator a name, resolve collinearity, etc.
Is there any standard way to do these operations?

Stefan Karpinski

unread,
Oct 1, 2014, 9:23:31 AM10/1/14
to Julia Users
Probably a better question for julia-stats, but there's also likely some people here who can answer.

John Myles White

unread,
Oct 1, 2014, 10:27:33 AM10/1/14
to julia...@googlegroups.com
Currently, the way to do this is via the GLM package (or at least its strategy for generating design matrices), which handles indicators for you.

Lots of improvements are possible, but we need better categorical data support at a lower level before we can work on the improvements: https://github.com/johnmyleswhite/CategoricalData.jl

 — John

Andrei Zh

unread,
Oct 1, 2014, 4:24:58 PM10/1/14
to julia...@googlegroups.com
Thanks for your suggestions!
So far using GLM and pool!() with a list of categorical variables works fine for me.

Andrei Zh

unread,
Oct 11, 2014, 9:41:05 AM10/11/14
to julia...@googlegroups.com
I found it convenient in some cases to flatten categorical variables into separate new columns instead of wrapping them into PooledDataArray or CategoricalVariables. Here's some functions for doing this:

function getdummy{R}(df::DataFrame, cname::Symbol, ::Type{R})
    darr = df[cname]
    vals = sort(levels(darr))[2:end]
    namedict = Dict(vals, 1:length(vals))   
    arr = zeros(R, length(darr), length(namedict))
    for i=1:length(darr)
        if haskey(namedict, darr[i])
            arr[i, namedict[darr[i]]] = 1
        end        
    end
    newdf = convert(DataFrame, arr)
    names!(newdf, [symbol("$(cname)_$k") for k in vals])
    return newdf
end

function convertdummy{R}(df::DataFrame, cnames::Array{Symbol}, ::Type{R})
    # consider every variable from cnames as categorical
    # and convert them into set of dummy variables,
    # return new dataframe
    newdf = DataFrame()
    for cname in names(df)
        if !in(cname, cnames)
            newdf[cname] = df[cname]
        else
            dummydf = getdummy(df, cname, R)
            for dummyname in names(dummydf)
                newdf[dummyname] = dummydf[dummyname]
            end
        end
    end
    return newdf
end

convertdummy(df::DataFrame, cnames::Array{Symbol}) = convertdummy(df, cnames, Int32)
Reply all
Reply to author
Forward
0 new messages