Dropped categorical variable in ols

Ryan Nelson

unread,

Apr 27, 2017, 6:20:38 PM4/27/17

to pystatsmodels

I have a problem with dropped categorical variable levels in ols, which I hope has a very simple answer. I have a DataFrame with two columns of text-based categorical variables. As per the docs (http://www.statsmodels.org/devel/example_formulas.html), one category should dropped to make an intercept, but it seems like one level from both categorical columns is being removed. After adding a "-1" to my formula to remove the intercept, I'm still missing one level. Below is a self-contained example using some M&M data from the internet. The summary output is also shown below, and "color[T.blue]" is missing from the table. I'm using statsmodels version 0.8.0 and Python 3.5 from the Anaconda distro.

import pandas as pd
import statsmodels.formula.api as smf


df = pd.read_csv('http://stat.pugetsound.edu/hoard/datasets/mms.csv')
formula = 'mass ~ type + color - 1'
fit = smf.ols(formula, data=df).fit()
print(fit.summary())

OLS Regression Results

==============================================================================

Dep. Variable: mass R-squared: 0.922

Model: OLS Adj. R-squared: 0.921

Method: Least Squares F-statistic: 1361.

Date: Thu, 27 Apr 2017 Prob (F-statistic): 0.00

Time: 18:06:42 Log-Likelihood: 156.53

No. Observations: 816 AIC: -297.1

Df Residuals: 808 BIC: -259.4

Df Model: 7

Covariance Type: nonrobust

=======================================================================================

coef std err t P>|t| [0.025 0.975]

---------------------------------------------------------------------------------------

type[peanut] 2.6029 0.022 116.682 0.000 2.559 2.647

type[peanut butter] 1.8044 0.021 84.241 0.000 1.762 1.846

type[plain] 0.8673 0.018 48.910 0.000 0.833 0.902

color[T.brown] -0.0033 0.024 -0.141 0.888 -0.049 0.043

color[T.green] 0.0410 0.023 1.754 0.080 -0.005 0.087

color[T.orange] -0.0228 0.024 -0.932 0.351 -0.071 0.025

color[T.red] -0.0163 0.026 -0.621 0.535 -0.068 0.035

color[T.yellow] -0.0312 0.024 -1.301 0.194 -0.078 0.016

==============================================================================

Omnibus: 159.700 Durbin-Watson: 1.844

Prob(Omnibus): 0.000 Jarque-Bera (JB): 761.978

Skew: 0.809 Prob(JB): 3.46e-166

Kurtosis: 7.449 Cond. No. 5.79

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

josef...@gmail.com

unread,

Apr 27, 2017, 6:47:53 PM4/27/17

to pystatsmodels

The current handling should be correct.
One level from each categorical variable has to be dropped to avoid
singular design.
The full dummy set for a categorical variable sums to 1 for each
observations, so it is collinear with the constant. This is true for
each categorical variable in the regression.

If there are interaction terms, then additional levels in the
joint/interaction dummy set have to be dropped.

patsy does this automatically in the formula handling.
(One problem is that patsy doesn't allow us to specify an
over-parameterized model that is singular but has the full dummy sets.
This is useful if, for example, we want to use constrained least
squares.)

aside: overall patsy doesn't check whether the effects for the actual
data don't result in singular design matrices, e.g. if there are
missing cells in interaction terms, then patsy still includes them as
columns of zeros.

Josef

Ryan Nelson

unread,

Apr 27, 2017, 10:11:19 PM4/27/17

to pystatsmodels

Got it! Thanks Josef! I learn something new every day ;)

Ryan

Reply all

Reply to author

Forward