(See attachment for code)
Leaving the patsy expression to enforce categorical data as in "C(x)" yields a different model than what is obtained from the Numpy output. However, if I remove the "C()" and just write "x" I finally get the same model as Numpy.
This makes absolute no sense. Can anyone explain what Statsmodels is doing differently with categorical data?
Josef
Hey Josef,I understood exactly what you are saying, but after I made my data a bit more diverse (i.e. no longer using the model with only a constant) I am not able to get the R² values of Numpy and Statsmodels to match. Take the following code as a quick example:#-------------------------- Code BEGIN --------------------------#import numpy as npimport statsmodels.formula.api as smfimport pandas as pd# Test Datapatsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"weight = np.array([0.37, 0.37, 0.53, 0.754])y = np.array([0.23, 0.55, 0.66, 0.88])x = np.array([3, 3, 4, 5])d = {"x": x.tolist(), "y": y.tolist()}data_df = pd.DataFrame(data=d)# Weighted Least Squares from Statsmodel APIstatsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)statsmodel_r2 = statsmodel_model.fit().rsquared# Weighted Least Squares from Numpy APIx = np.array([[1, 1, 0, 0],[0, 0, 1, 0],[0, 0, 0, 1]])Aw = x.T * np.sqrt(weight[:, np.newaxis]) # Apply weights to x matrixBw = y * np.sqrt(weight)numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())print("Statsmodels Params: " + str(statsmodel_model.fit().params) + "\nStatsmodel R²: " + str(statsmodel_r2))print("\nNumpy Params: " + str(numpy_model) + "\nNumpy R²: " + str(numpy_r2[0]))#-------------------------- Code END --------------------------#I tried to colorize the code text above to make it more readable. I have attached it for added usability.This code produces the following output:"Statsmodels Params: C(x)[3] 0.39C(x)[4] 0.66C(x)[5] 0.88dtype: float64Statsmodel R²: 0.8259065761528327Numpy Params: [0.39 0.66 0.88]Numpy R²: 0.9086856516773343"The two models agree with each other, however their R² values do not. I have been reading up on this and supposedly Statsmodels computes R² differently. Am I computing the R² from the Numpy model incorrectly? Or am I missing something?In addition, if I remove "-1" from the patsy equation, only one of my three parameters matches. Why is this? I thought the "-1" was just removing the "hidden +1 intercept?"-Dan
Hey Josef,I understood exactly what you are saying, but after I made my data a bit more diverse (i.e. no longer using the model with only a constant) I am not able to get the R² values of Numpy and Statsmodels to match. Take the following code as a quick example:
#-------------------------- Code BEGIN --------------------------#import numpy as npimport statsmodels.formula.api as smfimport pandas as pd# Test Datapatsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"weight = np.array([0.37, 0.37, 0.53, 0.754])y = np.array([0.23, 0.55, 0.66, 0.88])x = np.array([3, 3, 4, 5])d = {"x": x.tolist(), "y": y.tolist()}data_df = pd.DataFrame(data=d)# Weighted Least Squares from Statsmodel APIstatsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)statsmodel_r2 = statsmodel_model.fit().rsquared# Weighted Least Squares from Numpy APIx = np.array([[1, 1, 0, 0],[0, 0, 1, 0],[0, 0, 0, 1]])Aw = x.T * np.sqrt(weight[:, np.newaxis]) # Apply weights to x matrixBw = y * np.sqrt(weight)numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())
print("Statsmodels Params: " + str(statsmodel_model.fit().params) + "\nStatsmodel R²: " + str(statsmodel_r2))print("\nNumpy Params: " + str(numpy_model) + "\nNumpy R²: " + str(numpy_r2[0]))#-------------------------- Code END --------------------------#I tried to colorize the code text above to make it more readable. I have attached it for added usability.This code produces the following output:"Statsmodels Params: C(x)[3] 0.39C(x)[4] 0.66C(x)[5] 0.88dtype: float64Statsmodel R²: 0.8259065761528327Numpy Params: [0.39 0.66 0.88]Numpy R²: 0.9086856516773343"The two models agree with each other, however their R² values do not. I have been reading up on this and supposedly Statsmodels computes R² differently. Am I computing the R² from the Numpy model incorrectly? Or am I missing something?In addition, if I remove "-1" from the patsy equation, only one of my three parameters matches. Why is this? I thought the "-1" was just removing the "hidden +1 intercept?"
-Dan
[[1. 0. 0. 0. 0.] [1. 0. 0. 1. 0.] [0. 1. 0. 1. 0.] [0. 0. 1. 0. 1.]]
3 4 5 10 11 <-- represents
[[1. 0. 0. 1. 0. 0.] [1. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 1. 0.] [0. 0. 1. 0. 0. 1.]]
3 4 5 9 10 11 <-- represents
[[1. 0. 0. 0. 0.] [1. 0. 0. 1. 0.] [0. 1. 0. 1. 0.] [0. 0. 1. 0. 1.]]
3 4 5 10 11 <-- Reference(s)
[[1. 0. 0. 1. 0.] [1. 0. 0. 0. 1.] [0. 1. 0. 0. 1.] [0. 0. 1. 0. 0.]]
3 4 5 9 10 <-- Reference(s)