Categorical Data vs. Numerical Data in Regression Models

Daniel Hoynoski

unread,

May 25, 2018, 4:07:25 PM5/25/18

to pystatsmodels

When dealing with categorical data in mathematics, dummy variables are assigned to allow "objects" to operate in formulas. However, Statsmodels handles categorical data differently from numerical data. Take the following example:

(See attachment for code)

Leaving the patsy expression to enforce categorical data as in "C(x)" yields a different model than what is obtained from the Numpy output. However, if I remove the "C()" and just write "x" I finally get the same model as Numpy.

This makes absolute no sense. Can anyone explain what Statsmodels is doing differently with categorical data?

statsmodelnumpytest.py

josef...@gmail.com

unread,

May 25, 2018, 4:18:14 PM5/25/18

to pystatsmodels

patsy is handling the formulas for statsmodels

http://patsy.readthedocs.io/en/latest/categorical-coding.html

Essentially, if a column is string or similar, then patsy automatically assumes that it is a categorical variables and creates the corresponding encoding, e.g. the dummy variables.

C(x) forces categorical encoding with any variable type, i.e. assumes each unique number is a separate category and creates the corresponding encoding or dummy variables.

In your example x is just constant 3, so there is only one category or level because of the "-1" the dummy coding for the level is a vector of 1's, i.e. it's the same as a column for the intercept.

BTW: It's better to copy text, than adding a picture.

Josef

josef...@gmail.com

unread,

May 25, 2018, 4:21:35 PM5/25/18

to pystatsmodels

about the R squared:

If there is a constant column in the design matrix, then statsmodels uses R squared relative to the model with only a constant.

Because you don't have any slope explanatory variables, the R-squared is zero, i.e. the (non-existing) slope variables explain nothing.

Josef

Josef

Message has been deleted

Daniel Hoynoski

unread,

May 29, 2018, 11:56:27 AM5/29/18

to pystatsmodels

Hey Josef,

I understood exactly what you are saying, but after I made my data a bit more diverse (i.e. no longer using the model with only a constant) I am not able to get the R² values of Numpy and Statsmodels to match. Take the following code as a quick example:

#-------------------------- Code BEGIN --------------------------#
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd

# Test Data
patsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"
weight = np.array([0.37, 0.37, 0.53, 0.754])
y = np.array([0.23, 0.55, 0.66, 0.88])
x = np.array([3, 3, 4, 5])
d = {"x": x.tolist(), "y": y.tolist()}
data_df = pd.DataFrame(data=d)

# Weighted Least Squares from Statsmodel API
statsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)
statsmodel_r2 = statsmodel_model.fit().rsquared

# Weighted Least Squares from Numpy API
x = np.array([[1, 1, 0, 0],[0, 0, 1, 0],[0, 0, 0, 1]])
Aw = x.T * np.sqrt(weight[:, np.newaxis]) # Apply weights to x matrix
Bw = y * np.sqrt(weight)
numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]
numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())

print("Statsmodels Params: " + str(statsmodel_model.fit().params) + "\nStatsmodel R²: " + str(statsmodel_r2))
print("\nNumpy Params: " + str(numpy_model) + "\nNumpy R²: " + str(numpy_r2[0]))
#-------------------------- Code END --------------------------#

I tried to colorize the code text above to make it more readable. I have attached it for added usability.
This code produces the following output:

"Statsmodels Params: C(x)[3] 0.39
C(x)[4] 0.66
C(x)[5] 0.88
dtype: float64
Statsmodel R²: 0.8259065761528327

Numpy Params: [0.39 0.66 0.88]
Numpy R²: 0.9086856516773343"

The two models agree with each other, however their R² values do not. I have been reading up on this and supposedly Statsmodels computes R² differently. Am I computing the R² from the Numpy model incorrectly? Or am I missing something?

In addition, if I remove "-1" from the patsy equation, only one of my three parameters matches. Why is this? I thought the "-1" was just removing the "hidden +1 intercept?"

-Dan

statsmodelnumpytest.py

Daniel Hoynoski

unread,

May 29, 2018, 12:06:51 PM5/29/18

to pystatsmodels

Hey Josef,

I just noticed in my attached python file that the patsy equation is missing the "-1." In case you were running that, slide in the "-1" to get the same results as me above.

My question still stands: do you have any idea why the R² values do not agree?

-Dan

josef...@gmail.com

unread,

May 29, 2018, 12:37:13 PM5/29/18

to pystatsmodels

On Tue, May 29, 2018 at 11:55 AM, Daniel Hoynoski <daniel....@nielsen.com> wrote:

Hey Josef,

I understood exactly what you are saying, but after I made my data a bit more diverse (i.e. no longer using the model with only a constant) I am not able to get the R² values of Numpy and Statsmodels to match. Take the following code as a quick example:

#-------------------------- Code BEGIN --------------------------#
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd

# Test Data
patsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"
weight = np.array([0.37, 0.37, 0.53, 0.754])
y = np.array([0.23, 0.55, 0.66, 0.88])
x = np.array([3, 3, 4, 5])
d = {"x": x.tolist(), "y": y.tolist()}
data_df = pd.DataFrame(data=d)

# Weighted Least Squares from Statsmodel API
statsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)
statsmodel_r2 = statsmodel_model.fit().rsquared

# Weighted Least Squares from Numpy API
x = np.array([[1, 1, 0, 0],[0, 0, 1, 0],[0, 0, 0, 1]])
Aw = x.T * np.sqrt(weight[:, np.newaxis]) # Apply weights to x matrix
Bw = y * np.sqrt(weight)
numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]
numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())

Using Bw.var() is wrong or at least not what statsmodels is doing.

I don't remember the math right now.

Essentially, the mean for Bw.var needs to be correctly weighted

statsmodels is using this for centered_tss instead of Bw.var()

if weights is not None:

return np.sum(weights * (

model.endog - np.average(model.endog, weights=weights))**2)

print("Statsmodels Params: " + str(statsmodel_model.fit().params) + "\nStatsmodel R²: " + str(statsmodel_r2))
print("\nNumpy Params: " + str(numpy_model) + "\nNumpy R²: " + str(numpy_r2[0]))
#-------------------------- Code END --------------------------#

I tried to colorize the code text above to make it more readable. I have attached it for added usability.
This code produces the following output:

"Statsmodels Params: C(x)[3] 0.39
C(x)[4] 0.66
C(x)[5] 0.88
dtype: float64
Statsmodel R²: 0.8259065761528327

Numpy Params: [0.39 0.66 0.88]
Numpy R²: 0.9086856516773343"

The two models agree with each other, however their R² values do not. I have been reading up on this and supposedly Statsmodels computes R² differently. Am I computing the R² from the Numpy model incorrectly? Or am I missing something?

In addition, if I remove "-1" from the patsy equation, only one of my three parameters matches. Why is this? I thought the "-1" was just removing the "hidden +1 intercept?"

The parameterization of categorical variables differs depending on whether -1 is there or not (if there is only one categorical or for the first categorical variable)

adding -1, i.e. removing the constant, requires that all levels are included. The estimated params are then the means for the different levels

without -1, i.e. including the constant, patsy drops one reference level to avoid a singular design matrix. The constant is then the mean of the response for the reference level, the other coefficients are the difference in means to the reference level.

You can check this by adding the constant to the other coefficients, then you should get back the "-1" parameters.

Josef

-Dan

Daniel Hoynoski

unread,

May 29, 2018, 2:12:01 PM5/29/18

to pystatsmodels

Hey Josef,

Thank you very much for your help. It looks like you were right. I wasn't taking the weights into an account when calculating the R² value.

I have another question. When I start extending the patsy equation to incorporate more than one categorical variable, I am a bit perplexed (i.e. confused) by the input matrix produced by patsy.

For example, suppose I have the following data:

patsy_equation = "y ~ C(x) + C(z) - 1"

weight = np.array([0.37, 0.37, 0.53, 0.754])

y = np.array([0.23, 0.55, 0.66, 0.88])

x = np.array([3, 3, 4, 5])

z = np.array([9, 10, 10, 11])

d = {"x": x.tolist(), "z": z.tolist(), "y": y.tolist()}

Why does patsy (i.e. from patsy import dmatrix) create the following input matrix?:

[[1. 0. 0. 0. 0.]
 [1. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0.]
 [0. 0. 1. 0. 1.]]

  3  4  5  10 11  <-- represents

I understand the first three columns represent "C(x)" (i.e. 3, 4, and 5) and the last two columns represent "10" and "11" respectively, but why is the value "9" skipped?

Why is the matrix not like the following?:

[[1. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]]

  3  4  5  9  10 11  <-- represents

I have attached some working code below to help illustrate what I am talking about.

-Dan

morecategoricalvariables.py

josef...@gmail.com

unread,

May 29, 2018, 2:19:30 PM5/29/18

to pystatsmodels

look up "dummy variable trap"

The first 3 columns and the last 3 columns both add up to 1 so they are perfectly collinear, the matrix rank is smaller than the number of columns.

To prevent this patsy removes a reference level when needed.

That's why I put a qualifier " (if there is only one categorical or for the first categorical variable)" in my previous message, to be correct without having to elaborate.

i.e. -1 doesn't affect the second categorical, because the first categorical already sums to a constant, and patsy needs to drop columns for all further categoricals to avoid collinearity.

Josef

Daniel Hoynoski

unread,

May 29, 2018, 2:54:11 PM5/29/18

to pystatsmodels

Hey Josef,

Thank you, I understand why it dropped a column, but how does patsy decide which column to drop?

Taking from the previous example and code, patsy gives the answer:

[[1. 0. 0. 0. 0.]
 [1. 0. 0. 1. 0.]
 [0. 1. 0. 1. 0.]
 [0. 0. 1. 0. 1.]]

  3  4  5  10 11  <-- Reference(s)

However, under your explanation and the strategies for which to avoid the dummy variable trap, I could include the 9 from "C(z)" and not the 11 from "C(z):"

[[1. 0. 0. 1. 0.]
 [1. 0. 0. 0. 1.]
 [0. 1. 0. 0. 1.]
 [0. 0. 1. 0. 0.]]

  3  4  5  9  10  <-- Reference(s)

However, this gives me a new set of parameters (i.e. a new best fit line). The R² values did, however, stay the same though.

How does patsy decide which column to get rid of? Are the above matrices basically the same in the end of the day?

-Dan

Alan G. Isaac

unread,

May 29, 2018, 5:42:39 PM5/29/18

to pystat...@googlegroups.com

http://patsy.readthedocs.io/en/latest/overview.html
http://patsy.readthedocs.io/en/latest/overview.html#contact

On 5/29/2018 2:54 PM, Daniel Hoynoski wrote:
> *how does patsy decide which column to drop?*

Nathaniel Smith

unread,

May 29, 2018, 7:39:25 PM5/29/18

to pystatsmodels

On Tue, May 29, 2018 at 11:54 AM, Daniel Hoynoski
<daniel....@nielsen.com> wrote:
> How does patsy decide which column to get rid of?

If you want the full details, they're here:

https://patsy.readthedocs.io/en/latest/formulas.html
(especially: https://patsy.readthedocs.io/en/latest/formulas.html#redundancy-and-categorical-factors)

https://patsy.readthedocs.io/en/latest/categorical-coding.html

-n

Reply all

Reply to author

Forward