Accessing predicted endog

david.wa...@cimenviro.com

unread,

Nov 14, 2018, 5:24:25 PM11/14/18

to pystatsmodels

Hi

I'm an R user moving to python so I've been learning the statsmodel api. I'm having difficulty accessing the endog data after doing a prediction

i.e.

ols = smf.ols("np.log(demand) ~ temperature", data=train_data).fit()

y_train_pred = ols.predict(train_data)

y_train = ols.model.endog

print('train error (rmse): \n', np.round(np.sqrt(ols.mse_resid), 2))

y_test_pred = ols.predict(test_data)

y_test = ols.model.endog

print('test error (rmse): \n', np.round(np.sqrt(smt.eval_measures.mse(y_test, y_test_pred)), 2))

In the last block, y_test is incorrect as ols.model contains the orginal fitted values. I've specifically transformed endog in the example to show why it's important to access the predicted endog rather than use test_data.demand, but it's also important if ols.predict() drops any rows i.e. due to NA's

In R for example predict() returns a predict object containing the endog / exog and fitted values using the original data and predict(newdata) updates endog / exog and fitted. Is there some way of doing this with statsmodel?

Regards

David

josef...@gmail.com

unread,

Nov 14, 2018, 7:17:20 PM11/14/18

to pystat...@googlegroups.com

If I understand correctly, and I never looked how R handles your last question:

statsmodels currently has no general support for endog transformations, except in a few models where the model itself does the transformation.

The models receive the processed endog and exog from patsy when the formula interface is used. We reuse patsy for the exog transformation in predict, but predict does not know anything about what the corresponding observed endog would be, given that new data is not in the training data.

Also statsmodels never updates the training endog and exog to include new observations. (Exceptions to this might be in some time series models that allow updating and extending the dataset.)

Given that there is no specific support for endog transformation, the easiest that is usually done, is that the user creates a new variable e.g. log_demand, or log_price, and use that as a regular variable, and transform train and test data before splitting or use the same transformation.

I don't know if it would even be possible to support this at the moment. Transformations are handled by patsy and I never looked what information is available for endog. But AFAIK patsy just `eval`uates the LHS expression and might not know what the function does. But there might be a way to evaluate the same function on new data.

patsy is increasing the formula information that is available about the design matrix, and we slowly add more support to those, but there is nothing about endog transformations yet.

Issues about endog transformation have shown up several times, but it is not clear what and how we should support those. There are some starts in time series models that need to be linear. For non-time series models there is the alternative of using a GLM link function which would avoid the endog transformation and model instead E(y|x) = exp(x b) as in Poisson.

( a bit related aside: Most of the time it's easier to figure out what can be supported in general by starting in specific models. For example log transformation of endog shows up in parametric survival models (accelerated failure time model). In that case there is an option to either look at the gaussian model for logY or at the log-normal model for Y. That's only in experimental stage.)

Josef

Regards
David

david.wa...@cimenviro.com

unread,

Nov 14, 2018, 9:48:59 PM11/14/18

to pystatsmodels

Thanks Josef

I'll follow your suggestion on explicitly transforming the endog - although it's nice to be able to do it implicitly

I just want to confirm one thing, this is how I've been using sklearn + patsy. Can you confirm this is how statsmodel operates - I believe the build_design_matrices() is important as cr() is stateful?

Regards
David

y, X = dmatrices("np.log(demand) ~ cr(temperature, df=3)", data)
lm = LinearRegression().fit(X, y)
lm.predict(X)

...

X = build_design_matrices([X.design_info], newdata)
lm.predict(X)

Note: if newdata is a test / dev set (i.e. contains a demand column) you can do the following to transform newdata to X and y

y, X = build_design_matrices([y.design_info. X.design_info], newdata)
lm.predict(X)

josef...@gmail.com

unread,

Nov 15, 2018, 12:10:18 AM11/15/18

to pystat...@googlegroups.com

On Wed, Nov 14, 2018 at 9:49 PM <david.wa...@cimenviro.com> wrote:

Thanks Josef

I'll follow your suggestion on explicitly transforming the endog - although it's nice to be able to do it implicitly

I just want to confirm one thing, this is how I've been using sklearn + patsy. Can you confirm this is how statsmodel operates - I believe the build_design_matrices() is important as cr() is stateful?

Yes, all splines are stateful transformations, and we store the design_info in the model.data instance so we can reuse it in predict.

If users use the formula interface to the model, then patsy handles stateful transforms, and statsmodels just delegates in predict.

One consequence is that we only have the information to "interpret" the design matrix, or endog transformation in this case, that patsy provides to us in the `design_info`.

This is now pretty good for categorical factors, and we reuse it also in some Wald tests. There is some information about the splines in design_info but it is not used in the GAM PR.

But there is no way for statsmodels to know that the endog transformation is log, and that we could use our log-link support for it.

A related problem is that the code in patsy is focused on constructing the design matrix, which is not enough for us for supporting extras. If we need to provide statistical support for example for a log transformed variable, then we would need additional methods similar to the ones that are available in our link classes, specifically inverse transforms and derivatives.

Another example, GAM includes wrappers of or copies code from patsy's splines because we need to add derivatives and similar methods to support penalization.

In R, formulas have a much longer history and much more tightly integrated with the models, while our direct interface to models is DataFrames or just numpy arrays, or anything `array_like`. In the array case we just get numbers and have no information at all about whether there is a structure behind that data.

e.g https://github.com/statsmodels/statsmodels/issues/5342

and one more example:

It would be very helpful to provide some automatic support for polynomials that are in the design matrix, i.e. multiple columns that are a transformation of some underlying explanatory variables. Right now there is no way for the models to get that information in a general way. And it will take work for designing and implementing it before we can add those features.

(GAM requires that penalized splines are provided as separate argument and not as part of the design matrix, so we know exactly which splines to support.)

Regards
David

y, X = dmatrices("np.log(demand) ~ cr(temperature, df=3)", data)
lm = LinearRegression().fit(X, y)
lm.predict(X)

...

X = build_design_matrices([X.design_info], newdata)
lm.predict(X)

Note: if newdata is a test / dev set (i.e. contains a demand column) you can do the following to transform newdata to X and y

y, X = build_design_matrices([y.design_info. X.design_info], newdata)
lm.predict(X)

we use dmatrix directly

exog = dmatrix(design_info, exog, return_type="dataframe")

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/model.py#L997

and I used it the same way in a GAM example that doesn't have formula support yet

line In 35 in https://gist.github.com/josef-pkt/453de603b019143e831fbdd4dfb6aa30

AFAIK, we use still the code that was initially added by Skipper and Nathaniel, and I just keep copying the same pattern.

I didn't know about using it for endog also, because it is not used in the statsmodels code.

Josef

Reply all

Reply to author

Forward