RFC: Generics for indicators, constrasts, etc. in StatsBase?

127 views

Skip to first unread message

Douglas Bates

unread,

Jul 13, 2016, 12:48:55 PM7/13/16

to julia-stats

I opened https://github.com/JuliaStats/DataFrames.jl/issues/1012 because the ModelFrame/ModelMatrix code badly needs refactoring. It was written early in the development of Julia when I was still thinking R while writing Julia.

Part of the motivation is to fix problems with ModelMatrix in Julia v0.5.0-dev using the new DataFrames forumlation using NullableArrays and CategoricalArrays.

One issue I encountered is expanding columns from terms involivng a CategoricalArray or a PooledDataArray. If you have a NominalArray in a model with an intercept, that term should generate k - 1 columns in the model matrix. In R the reduced set of columns are called the contrasts. Some will argue with that name (technically contrasts columns are defined as being orthogonal to a constant column but that is no longer important).

One way of generating contrasts is first to generate the matrix of indicators then generate the desired contrasts. Sometimes it is simpler to generate the contrasts matrix directly.

Contrasts can be defined by a k by k-1 matrix. The default in R for nominal arrays are the "treatment contrasts". The matrix defining these is obtained by dropping the first column of an identity of size k. To reproduce the parameterization used in SAS the last column of the identity is dropped. For ordinal arrays polynomial contrasts are sometimes used.

Currently there is an indicatormat generic in StatsBase that creates a Matrix{Bool}, either sparse or dense, that is the transpose of the matrix of indicator columns. That is, it is the Matrix{Bool} of indicator rows, not columns.

julia> indicatormat(repeat([1,2,3], inner=2))
3×6 Array{Bool,2}:
  true   true  false  false  false  false
 false  false   true   true  false  false
 false  false  false  false   true   true

I suggest that indicatormat methods be defined for PooledDataArray and CategoricalArray types too.

Regarding contrasts, I think the contrasts generic should also be defined in StatsBase. Methods would be defined in packages like DataArrays and CategoricalArrays because they depend on the internal representation of the array type. The primary method would be like

function contrasts{T <: Number}(m::Matrix{T}, a::NominalArray)
    km1, k = size(m)
    nlev = length(levels(a))
    if k ≠ nlev || km1 ≠ k - 1
        throw(DimensionMismatch("m of size $(size(m)) should be $(nlev - 1) × $nlev"))
    end
    m[:, a.refs]
end

Contrast types can be expressed as functions that map k to a k - 1 x k matrix.

contrasts(f::Function, a::NominalArray) = contrasts(f(length(levels(a))), a)

contrTreatment(k) = eye(k)[2:end, :]

contrasts(a::NominalArray) = contrasts(contrTreatment, a)

Most uses require the contrasts columns so we could consider whether to stick with the indicatormat convention or to return contrast columns rather than contrast rows.

Dave Kleinschmidt

unread,

Jul 21, 2016, 2:58:15 PM7/21/16

to julia-stats

Have you looked at https://github.com/JuliaStats/DataFrames.jl/pull/870? Many (but not all, I think) of these ideas are incorporated there.

Reply all

Reply to author

Forward

0 new messages