I opened
https://github.com/JuliaStats/DataFrames.jl/issues/1012 because the ModelFrame/ModelMatrix code badly needs refactoring. It was written early in the development of Julia when I was still thinking R while writing Julia.
Part of the motivation is to fix problems with ModelMatrix in Julia v0.5.0-dev using the new DataFrames forumlation using NullableArrays and CategoricalArrays.
One issue I encountered is expanding columns from terms involivng a CategoricalArray or a PooledDataArray. If you have a NominalArray in a model with an intercept, that term should generate k - 1 columns in the model matrix. In R the reduced set of columns are called the contrasts. Some will argue with that name (technically contrasts columns are defined as being orthogonal to a constant column but that is no longer important).
One way of generating contrasts is first to generate the matrix of indicators then generate the desired contrasts. Sometimes it is simpler to generate the contrasts matrix directly.
Contrasts can be defined by a k by k-1 matrix. The default in R for nominal arrays are the "treatment contrasts". The matrix defining these is obtained by dropping the first column of an identity of size k. To reproduce the parameterization used in SAS the last column of the identity is dropped. For ordinal arrays polynomial contrasts are sometimes used.
Currently there is an indicatormat generic in StatsBase that creates a Matrix{Bool}, either sparse or dense, that is the transpose of the matrix of indicator columns. That is, it is the Matrix{Bool} of indicator rows, not columns.
julia> indicatormat(repeat([1,2,3], inner=2))
3×6 Array{Bool,2}:
true true false false false false
false false true true false false
false false false false true true
I suggest that indicatormat methods be defined for PooledDataArray and CategoricalArray types too.
Regarding contrasts, I think the contrasts generic should also be defined in StatsBase. Methods would be defined in packages like DataArrays and CategoricalArrays because they depend on the internal representation of the array type. The primary method would be like
function contrasts{T <: Number}(m::Matrix{T}, a::NominalArray)
km1, k = size(m)
nlev = length(levels(a))
if k ≠ nlev || km1 ≠ k - 1
throw(DimensionMismatch("m of size $(size(m)) should be $(nlev - 1) × $nlev"))
end
m[:, a.refs]
end
Contrast types can be expressed as functions that map k to a k - 1 x k matrix.
contrasts(f::Function, a::NominalArray) = contrasts(f(length(levels(a))), a)
contrTreatment(k) = eye(k)[2:end, :]
contrasts(a::NominalArray) = contrasts(contrTreatment, a)
Most uses require the contrasts columns so we could consider whether to stick with the indicatormat convention or to return contrast columns rather than contrast rows.