Dropped categorical variable in ols

836 views
Skip to first unread message

Ryan Nelson

unread,
Apr 27, 2017, 6:20:38 PM4/27/17
to pystatsmodels
I have a problem with dropped categorical variable levels in ols, which I hope has a very simple answer. I have a DataFrame with two columns of text-based categorical variables. As per the docs (http://www.statsmodels.org/devel/example_formulas.html), one category should dropped to make an intercept, but it seems like one level from both categorical columns is being removed. After adding a "-1" to my formula to remove the intercept, I'm still missing one level. Below is a self-contained example using some M&M data from the internet. The summary output is also shown below, and "color[T.blue]" is missing from the table. I'm using statsmodels version 0.8.0 and Python 3.5 from the Anaconda distro. 

import pandas as pd
import statsmodels.formula.api as smf


df
= pd.read_csv('http://stat.pugetsound.edu/hoard/datasets/mms.csv')
formula
= 'mass ~ type + color - 1'
fit
= smf.ols(formula, data=df).fit()
print(fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   mass   R-squared:                       0.922
Model:                            OLS   Adj. R-squared:                  0.921
Method:                 Least Squares   F-statistic:                     1361.
Date:                Thu, 27 Apr 2017   Prob (F-statistic):               0.00
Time:                        18:06:42   Log-Likelihood:                 156.53
No. Observations:                 816   AIC:                            -297.1
Df Residuals:                     808   BIC:                            -259.4
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
type[peanut]            2.6029      0.022    116.682      0.000       2.559       2.647
type[peanut butter]     1.8044      0.021     84.241      0.000       1.762       1.846
type[plain]             0.8673      0.018     48.910      0.000       0.833       0.902
color[T.brown]         -0.0033      0.024     -0.141      0.888      -0.049       0.043
color[T.green]          0.0410      0.023      1.754      0.080      -0.005       0.087
color[T.orange]        -0.0228      0.024     -0.932      0.351      -0.071       0.025
color[T.red]           -0.0163      0.026     -0.621      0.535      -0.068       0.035
color[T.yellow]        -0.0312      0.024     -1.301      0.194      -0.078       0.016
==============================================================================
Omnibus:                      159.700   Durbin-Watson:                   1.844
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              761.978
Skew:                           0.809   Prob(JB):                    3.46e-166
Kurtosis:                       7.449   Cond. No.                         5.79
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

josef...@gmail.com

unread,
Apr 27, 2017, 6:47:53 PM4/27/17
to pystatsmodels
The current handling should be correct.
One level from each categorical variable has to be dropped to avoid
singular design.
The full dummy set for a categorical variable sums to 1 for each
observations, so it is collinear with the constant. This is true for
each categorical variable in the regression.

If there are interaction terms, then additional levels in the
joint/interaction dummy set have to be dropped.

patsy does this automatically in the formula handling.
(One problem is that patsy doesn't allow us to specify an
over-parameterized model that is singular but has the full dummy sets.
This is useful if, for example, we want to use constrained least
squares.)

aside: overall patsy doesn't check whether the effects for the actual
data don't result in singular design matrices, e.g. if there are
missing cells in interaction terms, then patsy still includes them as
columns of zeros.

Josef

Ryan Nelson

unread,
Apr 27, 2017, 10:11:19 PM4/27/17
to pystatsmodels
Got it! Thanks Josef! I learn something new every day ;)

Ryan
Reply all
Reply to author
Forward
0 new messages