I am glad to see interest in automatic forecasting for GSOC 2018! I thought I'd write a brief description of where we are and what I think we'd like to see integrated. I'm not an expert on this topic, though, so hopefully other people will reply with things they'd like to see. Also students should feel free to put other features in their proposal.References--------------The basic reference is Hyndman and Khandakar (2008), which can be found at https://www.jstatsoft.org/article/view/v027i03/v27i03.pdfThis is implemented in the R `forecast` package in the function auto.arima (https://www.rdocumentation.org/packages/forecast/versions/8.1/topics/auto.arima). We cannot use / translate code from the forecast package (including this function) because it has an incompatible license. However, we can look at the signature and description on this link.E-views implements Hyndman and Khandakar (2008), and they describe their process here: http://www.eviews.com/help/helpintro.html#page/content/series-Automatic_ARIMA_Forecasting.htmlHyndman also allows for automatic selection of exponential smoothing models, as in the function `ets`, see e.g. https://www.rdocumentation.org/packages/forecast/versions/8.1/topics/ets and also https://otexts.org/fpp2/estimation-and-model-selection.htmlModels in Statsmodels----------------------------There are primarily three types of models from which we'll want to consider forecasts:- SARIMAX- Unobserved components (UC)- Exponential smoothing (ES)I anticipate that GSOC proposals will probably use these models and will not construct new models, but that doesn't have to be the case if a student has something particular in mind.Within SARIMAX / UC models information criteria (IC) can be used to select a model, and within ES models IC can be used, but IC cannot be used to select between e.g. SARIMAX and ES. See for example https://otexts.org/fpp2/arima-ets.html.
However, we can produce a comparison using out-of-sample forecasting exercises (e.g. estimate the parameters on a subset of the data and then compare on MSE of h-step ahead forecasts on the remaining data). This is what Hyndman refers to as time series cross validation, see https://www.otexts.org/fpp/2/5 and https://robjhyndman.com/hyndsight/tscv/.
On Sat, Mar 3, 2018 at 11:08 PM, Chad Fulton <chadf...@gmail.com> wrote:I am glad to see interest in automatic forecasting for GSOC 2018! I thought I'd write a brief description of where we are and what I think we'd like to see integrated. I'm not an expert on this topic, though, so hopefully other people will reply with things they'd like to see. Also students should feel free to put other features in their proposal.References--------------The basic reference is Hyndman and Khandakar (2008), which can be found at https://www.jstatsoft.org/article/view/v027i03/v27i03.pdfThis is implemented in the R `forecast` package in the function auto.arima (https://www.rdocumentation.org/packages/forecast/versions/8.1/topics/auto.arima). We cannot use / translate code from the forecast package (including this function) because it has an incompatible license. However, we can look at the signature and description on this link.E-views implements Hyndman and Khandakar (2008), and they describe their process here: http://www.eviews.com/help/helpintro.html#page/content/series-Automatic_ARIMA_Forecasting.htmlHyndman also allows for automatic selection of exponential smoothing models, as in the function `ets`, see e.g. https://www.rdocumentation.org/packages/forecast/versions/8.1/topics/ets and also https://otexts.org/fpp2/estimation-and-model-selection.htmlModels in Statsmodels----------------------------There are primarily three types of models from which we'll want to consider forecasts:- SARIMAX- Unobserved components (UC)- Exponential smoothing (ES)I anticipate that GSOC proposals will probably use these models and will not construct new models, but that doesn't have to be the case if a student has something particular in mind.Within SARIMAX / UC models information criteria (IC) can be used to select a model, and within ES models IC can be used, but IC cannot be used to select between e.g. SARIMAX and ES. See for example https://otexts.org/fpp2/arima-ets.html.Not necessarily truecrucial is "and the likelihood is computed in different ways"In our MLE models we use a consistent definition across models, llf is always the full likelihood value and we don't drop terms that are irrelevant for the optimization but necessary for the comparison across models (with the same distributional assumption).Also I think in linear models the sum of squares definition for "quasi-normal" models should allow a consistent comparison across models.A possible inconsistency can arise by whether auxiliary parameters like scale are counted in k_params or not.
Are the ES models actually implemented and released, i.e in statsmodels 0.8.0?
Hi Chad,Thanks for the advice. I haven't started working on the proposal part. I'll get started on it right now. I just had a look at the proposal template. In the code sample part, I haven't made enough patches to the sub-org except for the one herehttps://github.com/statsmodels/statsmodels/pull/4290,
if you have any issues related to this project, can you please help me find them. I am new to the code base.
Apart from this, Is there any other suggestions for the proposal that you want me to look into?
On Wed, Mar 7, 2018 at 8:02 PM, Abhijeet Panda <abhijeet...@gmail.com> wrote:Hi Chad,Thanks for the advice. I haven't started working on the proposal part. I'll get started on it right now. I just had a look at the proposal template. In the code sample part, I haven't made enough patches to the sub-org except for the one herehttps://github.com/statsmodels/statsmodels/pull/4290,
if you have any issues related to this project, can you please help me find them. I am new to the code base.Do you have any background already in econometrics or statistics? That could help me find an appropriate place for you to make an initial contribution for your code sample (but note that the code sample does not have to be very large or complex for you to have a successful proposal).
Apart from this, Is there any other suggestions for the proposal that you want me to look into?I gave a link to Eviews' documentation in my original e-mail. In my mind the eventual implementation in Statsmodels should look a lot like that (in terms of supported features and our general approach).
Thank you for this advice, I'll look through the documentation and put my first version of the proposal as soon as possible.
For the ARIMA model it would be better to focus on SARIMAX and include
choosing seasonal order and differencing.
ARIMA.
I don't see anything about choosing a trend which will be important if
there is a trend and no differencing is used, otherwise ARMA might not
converge or estimate inappropriate parameters.
On Thu, Mar 22, 2018 at 11:14 PM, Abhijeet Panda
<abhijeet...@gmail.com> wrote:
> Hi Josef,
>
>> For the ARIMA model it would be better to focus on SARIMAX and include
>> choosing seasonal order and differencing.
>> ARIMA.
>>
> I have updated the document for SARIMAX and how can we choose the seasonal
> order and differencing partner by using successive unit root tests.
> Are the tests already available in statsmodels or shall we write it?
We don't have seasonal unit root tests yet in statsmodels.
One problem will be to get the tables for the distribution an p-values
for different
season lenghts, AFAIR.
If we only need them for a rough preliminary specification search, then having
very good p-values might not be necessayr
There might be some simple preliminary tests, like checking whether
seasonal polynomials
are significant.
If users don't provide a frequency, or even if they do, a check for
spikes in the
spectral density might also be useful.
>
>> I don't see anything about choosing a trend which will be important if
>> there is a trend and no differencing is used, otherwise ARMA might not
>> converge or estimate inappropriate parameters.
>>
> Can you help me with like how to automatically choose the trend?
> One approach is to see the p-values and know if the time variable is
> statistically significant.
> Using this we can decide if there is a linear trend or not.
I would just run an initial regression on a trendline, or test whether means
differ across sequential subsamples.
If it works in can be done in the context or SARIMAX or similar models by
choosing the constant/trend.options based on significance or cross-validation.
However, we got several reports about convergence and non-stationarity
problem in fitting ARMA/SARIMAX and similar models, when a stationary ARMA is
just not appropriate because there is a trend or some other non-stationary
pattern in the data. These are problems that we should be able to avoid
by some preliminary testing or diagnostics.
It might also be possible to start with a SARIMAX model that includes
trend and seasonal components, which might avoid some of the
convergence problems and then drop parts if they are not needed or don't improve
prediction.
I never did a systematic reading of the automatic forecasting literature.
A related issue, where I did some readings for the PR, is about using
Box-Cox transformation where the implemented method checks
what make the variance stable in subsamples.
https://github.com/statsmodels/statsmodels/pull/3477
and discussion in issues and PR leading up to this.
IMO: choosing the order in (S)AR(I)MA(X) or options for other models
is the core of the automatic forecasting specification search.
But if we and users throw arbitrary data like sales data at it, then
there will be messy data that might not fit to many of those candidates.
Josef
>
>
> Abhijeet
Hi Chad Fulton, we should look over to PDE's for more improvement and bigo notations for more efficiency and less time
On Sun, Mar 25, 2018 at 11:16 AM, <ja20...@gmail.com> wrote:
> So it means, the proposals on this topic has been closed and I should look
> forward to another topic?
No, nothing is decided yet, of course. It affects the chances of being accepted.
If there are two very good proposals on the same topic, then we still have to
choose only one of them.
If there is one good proposal on another topic, then, assuming it is good
enough for a GSOC project, it will compete for the number of available
slots but doesn't have to compete on the same topic.
Josef