categorical variables: dummy coding vs effect coding

christoph...@ucsf.edu

unread,

Sep 23, 2014, 10:42:04 PM9/23/14

to pystat...@googlegroups.com

Hi

Is there any way to force "effect coding" of a categorical variable?

effect coding = -1/1, dummy coding = 0/1.
outcomeVariable=DayNight+eastWest+DayNight*eastWest.

Using from statsmodels.formula.api import ols,
pvalues for DayNight and DayNight*eastWest are staying consistent,
but eastWest's pvalue changes when I use -1/1 or 1/2 or 0.25/0.75 to encode the choices for genotype.

I would just change "eas"/"west" and "Day"/"Night" to -1/1 explicitly in my input file, but I'm not sure that'll be handled correctly in an interaction term. Arithmetic tells me -1/1 *-1/1 only gives two outcomes, where I'd like four(A*A, A*B, B*A, B*B). I'm I misunderstanding how an interaction variable works?

christoph...@ucsf.edu

unread,

Sep 24, 2014, 5:51:20 PM9/24/14

to pystat...@googlegroups.com

I've read up some on this, and -1/1 coding seems the way to go for interactions of two categorical variables.
See:
http://en.wikipedia.org/wiki/Categorical_variable#Categorical_variables_and_regression

Be careful about blindly using 0/1 dummy coded variables in interaction terms as happens by default.
Your interaction term might not be capturing exactly what you want. Out of the four combinations (A*A, A*B, B*A, and B*B) it might
have an interpretable meaning for only one combination.

To change the encoding you can edit your input design matrix.
designMatrixForEditing= patsy.dmatrices(formula, MyDataFrame)

designMatrixForEditing["eastWest"]=designMatrixForEditing["eastWest"].replace(to_replace=0,value=-1)

josef...@gmail.com

unread,

Sep 25, 2014, 6:39:00 AM9/25/14

to pystatsmodels

patsy offers different ways of encoding categorical variables.

Skipper had written this introduction for our documentation
http://statsmodels.sourceforge.net/devel/contrasts.html

I'm not much of an expert on encoding schemes, they are mainly linear
transformations to me.
We can always get the parameters or hypothesis tests for any different
encoding through linear restrictions/transformation/contrast using
`t_test`. But we don't have premade constraints yet in the style of
LSMEANS in SAS.

Josef

Nathaniel Smith

unread,

Sep 25, 2014, 8:30:50 AM9/25/14

to pystatsmodels

On 24 Sep 2014 22:51, <christoph...@ucsf.edu> wrote:
>
>
> I've read up some on this, and -1/1 coding seems the way to go for interactions of two categorical variables.
> See:
> http://en.wikipedia.org/wiki/Categorical_variable#Categorical_variables_and_regression

I'm not very familiar with effects coding. A quick skim of that
wikipedia article makes me somewhat wary of recommending it, though,
given that in the "weighted" form you and it describe, the
interpretation of the beta values becomes dependent on the exact
structure of your design. (E.g. if you have a missing value in some
row somewhere, then this changes which groups are included in which
proportions in the grand mean, and thus the exact pattern of
missingness changes the *meaning* of all betas. This seems
theoretically inappropriate for designed experiments in general, and a
major practical trap for the unwary.)

Also, the link you give doesn't say anything about how interactions
end up working with effects coding, or why they would be preferred in
this case.

Can you elaborate on what it is about effects coding that makes you
recommend it, esp. for interactions?

> Be careful about blindly using 0/1 dummy coded variables in interaction terms as happens by default.
> Your interaction term might not be capturing exactly what you want.

This is good advice for all defaults :-).

> Out of the four combinations (A*A, A*B, B*A, and B*B) it might
> have an interpretable meaning for only one combination.

I'm not sure what these combinations are that yippee referring to, but
I do know how to interpret interactions that use 0/1 coding. There are
actually two very different coding scheme that use just the values 0
and 1: treatment coding and (what I strictly call) dummy coding.

Treatment coding: suppose we have two variables A and B, which take on
values A1, A2 and B1, B2 respectively. If we write the patsy formula
"A*B", then this is interpreted as the same as "1 + A + B + A:B", and
each of those four terms generates a single column of the design
matrix, giving us four betas in total. These are interpreted as:

Intercept beta: gives the mean of the A1,B1 data.

A beta: gives the difference between the means A2,B1 - A1,B1, i.e.,
estimates how changing A from A1 to A2 effects things (keeping B=B1
constant).

B beta: gives the difference between the means A1,B2 - A1,B1, i.e.,
estimates how changing B from B1 to B2 effects things (keeping A1
constant).

A:B beta: given the previous betas, we can make a guess at how going
from A1,B1 to A2,B2 might work, by following a path like:
A1,B1 -> A2,B1 -> A2,B2
we know how A1,B1 acts from the intercept, we know how changing A1->A2
acts from the A beta, and we know how B1->B2 acts from the B beta;
therefore, adding all these betas together gives us an idea about how
the A2,B2 data will act. But, this is only an approximation -- notice
that our B beta is a measurement of how going from B1->B2 affects
things when keeping A=A1 constant. Here A=A2. So this way of
approximating the A2,B2 data is valid if and only if the effect of
B1->B2 is the same regardless of the value of A, i.e., it's valid if
and only if A and B do not interact. The A:B beta gives the difference
between the actual behaviour of the A2,B2 data - the estimate you
would get by applying the above logic. (So, ignoring noise, it will be
zero if and only if A and B do not interact.)

Since the situation is symmetric, you can also rewrite the above
paragraph swapping A and B everywhere for a somewhat different
interpretation of the same situation -- the A:B beta also checks
whether the effect of A1->A2 is the same regardless of the value of B,
etc.

This is slightly complicated, but it's definitely interpretable. You
can also pick arbitrarily which levels you want to treat as the
"reference".

That was treatment coding. For dummy coding sensu strictu, in patsy we
write something like "0 + A:B", which gives us a single term that
generates 4 columns. The first column is 1 for A1,B1 and 0 otherwise,
the second column is 1 for A2,B1 and 0 otherwise, etc. The resulting
betas just give us the means of each cell. This doesn't provide as
obvious a way to test for interactions, though.

See
http://patsy.readthedocs.org/en/latest/formulas.html#redundancy

> To change the encoding you can edit your input design matrix.
> designMatrixForEditing= patsy.dmatrices(formula, MyDataFrame)
>
> designMatrixForEditing["eastWest"]=designMatrixForEditing["eastWest"].replace(to_replace=0,value=-1)

An easier way is to simply tell patsy that you'd like this coding
scheme (which it knows as "sum to zero" coding):

formula = "C(A, Sum) * C(B, Sum)"

See:
http://patsy.readthedocs.org/en/latest/categorical-coding.html
http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.Sum

;-)

-n

Reply all

Reply to author

Forward