TSA: usability / polish

Chad Fulton

unread,

Jan 6, 2017, 9:06:33 PM1/6/17

to Statsmodels Mailing List

I'm going to start working towards improving the usability / polish of some of the TSA package (at least starting with state space models where I know the code well, but maybe others too). If there are things that people know that don't work, let me know.

Although little things are also important, here are two examples of bigger things I recently was reminded of when I was doing some forecasting with tsa models.

- Model selection (esp. information criteria, maybe including pseudo-out-of-sample RMSE or something).

- Dates for `fit` vs dates of given data: I wonder if it would pay to be more flexible with the data for tsa models. For example, we may want to fit the model on a subset of the available date range (for example for pseudo out-of-sample forecasting), or we may want to allow passing in "too much" `exog` data, so that we don't have to worry about passing new exog back in when forecasting.

I find it so cumbersome to have to split my exog so that it fits into the model, and then try and figure out how to grab just the part I need for forecasting.

I have created a project for this topic https://github.com/statsmodels/statsmodels/projects/3 (hopefully helpful for organization, although I still don't quite understand how projects work).

I'm hoping that as people run across things that may work but not very "nicely", or if there are common tasks that are annoying to perform, that we could collect those things and hopefully improve some of them.

Chad

josef...@gmail.com

unread,

Jan 7, 2017, 12:04:09 AM1/7/17

to pystatsmodels

On Fri, Jan 6, 2017 at 9:06 PM, Chad Fulton <chadf...@gmail.com> wrote:

I'm going to start working towards improving the usability / polish of some of the TSA package (at least starting with state space models where I know the code well, but maybe others too). If there are things that people know that don't work, let me know.

Although little things are also important, here are two examples of bigger things I recently was reminded of when I was doing some forecasting with tsa models.

- Model selection (esp. information criteria, maybe including pseudo-out-of-sample RMSE or something).

better automatic or well supported model selection would be in high demand, based on questions on stackoverflow and other places.

- Dates for `fit` vs dates of given data: I wonder if it would pay to be more flexible with the data for tsa models. For example, we may want to fit the model on a subset of the available date range (for example for pseudo out-of-sample forecasting), or we may want to allow passing in "too much" `exog` data, so that we don't have to worry about passing new exog back in when forecasting.

I also wished for this a few times for backtesting type of forecast evaluation, however (next item)

I find it so cumbersome to have to split my exog so that it fits into the model, and then try and figure out how to grab just the part I need for forecasting.

The main question I see here is whether this should be included in the models or live outside like the cross-validation framework of scikit-learn.

Adding it too the models would make the models more complex, and IMO is only worth it if we can get additional computational savings. If there is not much performance gain, then a better, more user friendly external support like in cross-validation would be better.

Related: repeated forecasting without always re-estimating the parameters has also been asked for for a long time. Based on https://mail.google.com/mail/u/0/?tab=wm#label/pystatsmodels_maint/158f9035c0f44306 it looks like it can be done with the current implementation of the statespace models with some helper functions without needing additional code inside the models.

I have created a project for this topic https://github.com/statsmodels/statsmodels/projects/3 (hopefully helpful for organization, although I still don't quite understand how projects work).

I'm hoping that as people run across things that may work but not very "nicely", or if there are common tasks that are annoying to perform, that we could collect those things and hopefully improve some of them.

One item that might be useful is to support transfer models, i.e. lagpolynomials in the exog. But I'm not sure that is removing an annoyance or a feature request.

Another possible item are more built-in support for nonlinear transformation of endog (we have a Box-Cox pull request that I had forgotten about since summer.) I don't know where this belongs in the tsa models.

Josef

Chad

Dave Hirschfeld

unread,

Jan 11, 2017, 7:32:06 PM1/11/17

to pystatsmodels

Cross-validation for timeseries models would be a very welcome addition.

IIUC, it may require in-depth knowledge of the model structure so may be difficult to implement generically, outside of statsmodels?

cf. https://docs.google.com/viewer?url=http%3A%2F%2Frobjhyndman.com%2Fpapers%2Fcv-wp.pdf

-Dave

Chad Fulton

unread,

Jan 12, 2017, 7:12:54 PM1/12/17

to Statsmodels Mailing List

On Wed, Jan 11, 2017 at 7:32 PM, Dave Hirschfeld <nov...@gmail.com> wrote:

Cross-validation for timeseries models would be a very welcome addition.

IIUC, it may require in-depth knowledge of the model structure so may be difficult to implement generically, outside of statsmodels?

cf. https://docs.google.com/viewer?url=http%3A%2F%2Frobjhyndman.com%2Fpapers%2Fcv-wp.pdf

-Dave

Thanks for the suggestion. I only skimmed the paper and it sounds interesting, but it also looked like it wasn't a general approach - maybe just for ARMA models? If you've read the paper more carefully, do you know how general their method is?

Chad

josef...@gmail.com

unread,

Jan 13, 2017, 9:10:20 AM1/13/17

to pystatsmodels

Also a quick skimming

It's purely for autoregressive processes with finite number of lags, small relative to sample size AFAIU, including nonlinear models.

It doesn't apply to ARMA or exponential smoothing, nor most of our TSA models.

The basic idea is to leave rows out if the past are in the regressors as for example in estimating an AR(p) by OLS (regressor is lagmat as in VAR for the multivariate case).

(aside: I was thinking of leaving rows out like this for outlier robust estimation, but didn't look for it in references.)

In terms of implementation for statespace models, an approach to try out would be to define the left out segment as missing/nan in the estimation (or if available set weights=0). I have no idea what the theoretical properties are (future observations still contain information about the missing values in non purely autoregressive case).

Josef

Chad

Brock Mendel

unread,

Feb 28, 2017, 2:13:44 PM2/28/17

to pystatsmodels

On Friday, January 6, 2017 at 6:06:33 PM UTC-8, Chad Fulton wrote:

I'm going to start working towards improving the usability / polish of some of the TSA package (at least starting with state space models where I know the code well, but maybe others too). If there are things that people know that don't work, let me know.

Although little things are also important, here are two examples of bigger things I recently was reminded of when I was doing some forecasting with tsa models.

- Model selection (esp. information criteria, maybe including pseudo-out-of-sample RMSE or something).

What do you think about a mixin class for the information criteria?

1) cut boilerplate code that shows up in a bunch of different places,

2) some classes calculate AIC etc using formulas equivalent to those from tools.eval_measures; others use formulas equivalent to tsa.vector_ar.var_model. It would be nice to have them in one place to clarify when each is appropriate.

3) Similar situation with resid vs wresid etc.

- Dates for `fit` vs dates of given data: I wonder if it would pay to be more flexible with the data for tsa models. For example, we may want to fit the model on a subset of the available date range (for example for pseudo out-of-sample forecasting), or we may want to allow passing in "too much" `exog` data, so that we don't have to worry about passing new exog back in when forecasting.

A few days ago when plotting a forecast from a VAR I noticed that the labels on the X-axis were not date-like. I'll look into this sometime in the next month or so; working with matplotlib always feels like pulling teeth.

I find it so cumbersome to have to split my exog so that it fits into the model, and then try and figure out how to grab just the part I need for forecasting.

I have created a project for this topic https://github.com/statsmodels/statsmodels/projects/3 (hopefully helpful for organization, although I still don't quite understand how projects work).

I'm hoping that as people run across things that may work but not very "nicely", or if there are common tasks that are annoying to perform, that we could collect those things and hopefully improve some of them.

Any preference between bringing these up here vs. Issues vs the project page?

josef...@gmail.com

unread,

Feb 28, 2017, 3:13:50 PM2/28/17

to pystatsmodels

On Tue, Feb 28, 2017 at 2:13 PM, Brock Mendel <jbrock...@gmail.com> wrote:

On Friday, January 6, 2017 at 6:06:33 PM UTC-8, Chad Fulton wrote:
I'm going to start working towards improving the usability / polish of some of the TSA package (at least starting with state space models where I know the code well, but maybe others too). If there are things that people know that don't work, let me know.

Although little things are also important, here are two examples of bigger things I recently was reminded of when I was doing some forecasting with tsa models.

- Model selection (esp. information criteria, maybe including pseudo-out-of-sample RMSE or something).

What do you think about a mixin class for the information criteria?

1) cut boilerplate code that shows up in a bunch of different places,
2) some classes calculate AIC etc using formulas equivalent to those from tools.eval_measures; others use formulas equivalent to tsa.vector_ar.var_model. It would be nice to have them in one place to clarify when each is appropriate.
3) Similar situation with resid vs wresid etc.

Mixin classes need to be substantial to justify makeing the code more complex or complicated.

So far we either ended up using inheritance from the appropriate model, or writing a method that mainly calls the corresponding reusable standalone function.

The current trend is towards reusable functions for features that are not directly tied into a model.

I don't think information criteria are "worth" a mixin class, and except for the generic definition based on loglike there are versions that are specific to one category of models, eg.g GLM, VAR define other versions.

I think what we need are more methods with options instead of cached attributes to handle variation or different definitions of results

- Dates for `fit` vs dates of given data: I wonder if it would pay to be more flexible with the data for tsa models. For example, we may want to fit the model on a subset of the available date range (for example for pseudo out-of-sample forecasting), or we may want to allow passing in "too much" `exog` data, so that we don't have to worry about passing new exog back in when forecasting.

A few days ago when plotting a forecast from a VAR I noticed that the labels on the X-axis were not date-like. I'll look into this sometime in the next month or so; working with matplotlib always feels like pulling teeth.

I find it so cumbersome to have to split my exog so that it fits into the model, and then try and figure out how to grab just the part I need for forecasting.

I have created a project for this topic https://github.com/statsmodels/statsmodels/projects/3 (hopefully helpful for organization, although I still don't quite understand how projects work).

I'm hoping that as people run across things that may work but not very "nicely", or if there are common tasks that are annoying to perform, that we could collect those things and hopefully improve some of them.

Any preference between bringing these up here vs. Issues vs the project page?

We use both, but the trend has been that discussions that are not general design issues occur in the github issues, which makes it easier to keep track of and search. Also, I'm posting sometimes on the mailing list when the topic too vague to figure out a specific issue that might be relevant for implementation.

Josef

Reply all

Reply to author

Forward