Patsy and the Dummy Variable Trap

521 views
Skip to first unread message

Daniel Hoynoski

unread,
May 30, 2018, 9:53:14 PM5/30/18
to PyData
I have been reading the following Patsy documentation, however, I am still a bit perplexed on how Patsy decides which column vector to get rid of when dealing with the "Dummy Variable Trap." Let me give the following situation (in python) as a means to illustrate my question:

Let's say I have two independent categorical variables x and z with a dependent variable of vector y:

patsy_equation = "y ~ C(x) + C(z) - 1"
weight = np.array([0.37, 0.37, 0.53, 0.754])
y = np.array([0.23, 0.55, 0.66, 0.88])
x = np.array([3, 3, 4, 5])
z = np.array([9, 10, 10, 11])
= {"x": x.tolist(), "z": z.tolist(), "y": y.tolist()}

Patsy creates (i.e. from patsy import dmatrix) the following input matrix:
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0.]
 [0. 0. 1. 0. 1.]]
  3  4  5  10 11  <-- represents

The first three columns represent "C(x)" (i.e. 3, 4, and 5) and the last two columns represent "10" and "11" respectively, but value "9" is dropped in order to avoid the column vectors that correspond to each categorical variable (i.e. the first three column vectors and the last three column vectors) from being perfectly collinear (i.e, summing to 1 more than once).

Although I understand why it dropped a column, how does patsy decide which column to drop?

Under the strategies for which to avoid the Dummy Variable Trap, I could include the 9 from "C(z)" and not the 11 from "C(z):"
[[1. 0. 0. 1. 0.]
 [1. 0. 0. 0. 1.]
 [0. 1. 0. 0. 1.]
 [0. 0. 1. 0. 0.]]
  3  4  5  9  10  <-- Reference(s)

However, this gives me a new set of parameters (i.e. a new best fit line). The R² values did, however, stay the same though.

How does patsy decide which column to get rid of? Are the above matrices basically the same in the end of the day?

-Dan

Nathaniel Smith

unread,
May 30, 2018, 11:25:45 PM5/30/18
to pyd...@googlegroups.com
Patsy processes the predictors from low-order to high-order, and from
left-to-right. In this case all the predictors are first order (no
intercept, no interaction terms), so it's just left to right.

So it processes 'x' first, and uses a full-rank encoding. Then it
processes 'z', and it notices that since 'x' had a full-rank encoding,
if it gives 'z' a full-rank encoding, that will create collinearity.
So it decides to use a reduced-rank encoding for 'z'.

Then it has to actually generate the values. Since you haven't
specified any particular coding system, it uses the default coding
system, which is patsy.Treatment:

https://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment

When the Treatment class gets asked to produce a full-rank encoding,
it uses full-rank dummy encoding. When it gets asked to produce a
reduced-rank encoding, it does that by dummy coding all the levels
except for the "reference level". By default the reference level is
whichever level comes first, and by default levels are sorted. So in
your data, "9" is the first level (because it's the smallest), and
that's what patsy.Treatment uses as the reference level.

You can change the ordering of levels by using C(data, levels=[...]),
or if you pass in a pandas Categorical then patsy respects its level
ordering. You can change Treatment coding to use a different reference
level -- see the link above for examples. You can switch to a
different coding system entirely, by passing it like C(data,
coding_system), and this coding system could be one of the ones that
ship as part of patsy (e.g. patsy.Poly for polynomial encoding), or
you can define your own custom system:

https://patsy.readthedocs.io/en/latest/categorical-coding.html

Hope that helps.

-n
> --
> You received this message because you are subscribed to the Google Groups
> "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Nathaniel J. Smith -- https://vorpus.org
Reply all
Reply to author
Forward
0 new messages