TSA models and carrying around your training data

19 views
Skip to first unread message

Nowan Ilfideme

unread,
Aug 11, 2017, 10:39:48 AM8/11/17
to pystatsmodels
Note: If I say something that's factually wrong, please correct me.
I'll be referring to statsmodels/pystatsmodels as SM.

Problem: The time series models in SM "carry around" their data. What I mean is, the model definition contains all the data it trained/fitted on. For many applications this behavior is fine - you normally want to work with one model per time series, R does exactly this, etc.
However, in a "machine learning" context, this becomes a problem. If one wants to save their model (e.g. ARIMA) in SM, they can't just save the parameters (e.g. ARIMA coefficients and, maybe, starting values) - they must save the whole time series. There is no (apparent) way to transfer parameters from one model to another.

How scikit-learn does things, is have a "model" object which can be "fitted", which changes its internal parameter values but does NOT save the entire dataset into the "model" object (unless it's e.g. k-nearest-neighbors).
In another, really old, thread (here) has some arguments for SM's architecture, however I really need the model to be decoupled from the data. Is there any way to do this currently with SM?

I've been trying to do a workaround from the API side, but I haven't found any way to do that. It feels a bit like I'll have to modify the internals by hand, which would be unfortunate. Or write it myself, which would be even more unfortunate. ;)

P.S. I was about to go on a rant discussing a way to appease both camps, but that would cause significant API changes, and I'm pretty sure backwards-compatibility is really important to a lot of folks, so I won't waste my breath/typing/time. :)

josef...@gmail.com

unread,
Aug 11, 2017, 11:12:27 AM8/11/17
to pystatsmodels
The main general reason for statsmodels is keeping the data is to be able to get additional, lazily evaluated results about the estimation, which scikit-learn does not provide.
To support predict in those cases, I added the `remove_data` option.

However, timeseries has the additional problem that we need to keep part of the historical data around to be able to make a forecast. scikit-learn doesn't handle this type of (auto) correlated data and doesn't need to worry about it. (nonparametrics is another case where the "parameters" are the entire dataset.)

The older tsa model like ARIMA have no support to keep the minimal required information for forecasting.
In the new statespace models the `state` plus parameters can be used as minimal requirement but there is not much premade, out-of-the box support.
One case where pickling and updating state worked, AFAIR, is in this issue https://github.com/statsmodels/statsmodels/issues/2867
This might work in more general cases of statespace models, but I haven't seen any other examples. I don't know enough about this area to tell what user friendly helper functions would be for this use case.

The old models like ARIMA (which will most likely deprecated in favor of SARIMAX and the new statespace models) would need two sets of functions, one to save the minimal information for forecasting and then a user facing predict function that uses some minimal information.

The above would work for continuing forecasting with the same time series (as in the above issue).
For a new time series the historical information needs to be used to estimate the `state` even with given parameters. For example to get the contribution of the moving average part we need the past one-step forecast errors or innovations. This means that it should almost be equivalent to estimating a new model with fit(start_params=unpickled_params, maxiter=0).
The only cases where this will not be necessary are pure AR or similar models.

Josef



Nowan Ilfideme

unread,
Aug 14, 2017, 5:27:23 AM8/14/17
to pystatsmodels

Thanks for the information. I more or less understood the reasoning behind it, it's just a bit frustrating. The current solution I'm working with uses ARIMA - I might end up trying to move to SARIMAX and making my own wrapper. And I need to look at what you mentioned as the "new statespace models"... But for now I'm stuck with the ARIMA class.

What would be nice from the user-perspective, assuming models which don't keep their 'state') is to supply the past of the time series I want to forecast. An example implementation would estimate the state at the end of the time series (for ARIMA, that would be the last q errors) and then predict N ahead (obviously, eventually converting towards the mean, with ARIMA).

Thanks again!

josef...@gmail.com

unread,
Aug 14, 2017, 9:23:35 AM8/14/17
to pystatsmodels
This would be relatively easy to implement without bells and whistles (e.g. datetime handling, confidence intervals). scipy.signal.lfilter or tsa.ArimaProcess can be used for the filtering and ARIMA module has a standalone (internal) forecast function. It should be relatively easy to combine them.

Adding datetime and exog/trend handling, integration and confidence intervals would need to replicate most of the model and results classes and wouldn't gain much compared to fit(maxiter=0) or maxiter=1.

(maxiter=0 is not yet supported by all classes AFAIK.)

Josef
 

Thanks again!

Reply all
Reply to author
Forward
0 new messages