I have a problem with dropped categorical variable levels in ols, which I hope has a very simple answer. I have a DataFrame with two columns of text-based categorical variables. As per the docs (
), one category should dropped to make an intercept, but it seems like one level from both categorical columns is being removed. After adding a "-1" to my formula to remove the intercept, I'm still missing one level. Below is a self-contained example using some M&M data from the internet. The summary output is also shown below, and "color[T.blue]" is missing from the table. I'm using statsmodels version 0.8.0 and Python 3.5 from the Anaconda distro.
OLS Regression Results
==============================================================================
Dep. Variable: mass R-squared: 0.922
Model: OLS Adj. R-squared: 0.921
Method: Least Squares F-statistic: 1361.
Date: Thu, 27 Apr 2017 Prob (F-statistic): 0.00
Time: 18:06:42 Log-Likelihood: 156.53
No. Observations: 816 AIC: -297.1
Df Residuals: 808 BIC: -259.4
Df Model: 7
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
type[peanut] 2.6029 0.022 116.682 0.000 2.559 2.647
type[peanut butter] 1.8044 0.021 84.241 0.000 1.762 1.846
type[plain] 0.8673 0.018 48.910 0.000 0.833 0.902
color[T.brown] -0.0033 0.024 -0.141 0.888 -0.049 0.043
color[T.green] 0.0410 0.023 1.754 0.080 -0.005 0.087
color[T.orange] -0.0228 0.024 -0.932 0.351 -0.071 0.025
color[T.red] -0.0163 0.026 -0.621 0.535 -0.068 0.035
color[T.yellow] -0.0312 0.024 -1.301 0.194 -0.078 0.016
==============================================================================
Omnibus: 159.700 Durbin-Watson: 1.844
Prob(Omnibus): 0.000 Jarque-Bera (JB): 761.978
Skew: 0.809 Prob(JB): 3.46e-166
Kurtosis: 7.449 Cond. No. 5.79
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.