SARIMAX - Warning messages and model suitability

2,153 views
Skip to first unread message

steven_e...@homedepot.com

unread,
May 9, 2018, 3:17:49 PM5/9/18
to pystatsmodels
Hello !
I'm currently developing some Time Series models using SARIMAX.  
I only have 2 weeks worth of data at this point but still wanted to start the modeling process.
The grain of the data is Hourly over the 2 weeks (full 7 days per week).
Based on the seasonal order documentation I've read, I thought it appropriate to use 24 as the seasonal specification to account for the variance of the hour by day.   

The SARIMAX model specification is:
mod = sm.tsa.statespace.SARIMAX(view_hour['distinct_freq_sum'], order=(5,0,0),seasonal_order=(3,0,0,24))

The set of warnings are:
1. ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
2. Covariance matrix calculated using the outer product of gradients (complex-step).
3. Covariance matrix is singular or near-singular, with condition number 1.82e+31. Standard errors may be unstable.

I've run the standard diagnostics pre and post model execution. 
In respect to post model execution, although the residuals are not normally distributed(i.e., JB test, stats.normaltest) the residuals do not exhibit any serial correlation (i.e., Durbin-Watson).

Below, is the output graph.  Upon closer inspection, you will see that the accuracy of the model is not very good.  The RMSE is 3218.933
As I write this I'm beginning to think that the primary problem may be the weekends.  
So, in summary, I'd like to better understand what is driving the warnings and how can seasonal order take care of the weekends.

I've attached the csv data file if you'd like to recreate.  The code to create the datetime index to allow the file to run in SARIMAX is:

view_hour['datetime'] = pd.to_datetime(view_hour['date_hour'])
view_hour.reset_index(inplace=True)
view_hour = view_hour.set_index('datetime')
view_hour.sort_index(inplace=True)


Thanks for your time and consideration.
Steve


view_hour.csv

Chad Fulton

unread,
May 9, 2018, 11:09:34 PM5/9/18
to Statsmodels Mailing List
Hi Steve,

I agree that the weekends are the primary modeling issue here. Before getting to that, though, the first thing to think about is that your current approach does not include a non-stationary component and so eventually your forecasts will go to zero. I think that's not what you want, so first you might want to set seasonal differencing to 1 (e.g. you might use seasonal_order=(3, 1, 0, 24)).

Modeling the weekends with SARIMAX is tough, because it acts sort of like a multiplicative term rather than a level term, in that the troughs are roughly the same level all week (slightly lower on the weekends) whereas the peaks are much different during the weekdays vs weekends.

I think you might want to model this with multiple seasonal effects - one for the daily cycle and one for the weekly cycle. In fact, we recently merged in a PR that allows flexible seasonal modeling in the `UnobservedComponents` models, and this might be a good use case.

Here's a link to an example notebook:


Best,
Chad


josef...@gmail.com

unread,
May 9, 2018, 11:58:45 PM5/9/18
to pystatsmodels


On Wed, May 9, 2018 at 11:09 PM, Chad Fulton <chadf...@gmail.com> wrote:
Given that I like to open issues for semi-random thoughts
I played for a bit with hourly electricity load forecasting with daily and weekly cycles, but I think in many real dataset there are many other factors that need to be included in the exog part when the data does not follow a nice enough pattern for SARIMA and seasonal polynomials. 
E.g. for electricity forecasting temperature or the weather in general can shift the weekly cycle up or down, and the daily pattern can get messed up by holidays.

Josef
 

steven_e...@homedepot.com

unread,
May 11, 2018, 3:47:05 PM5/11/18
to pystatsmodels
Hey Chad,
Thanks so much for your prompt reply.... very helpful !
I tried the simplest way to use the new UC function by cutting and pasting your structural.py into my local ./statespace folder.
I restarted my IDE and executed the set of imports, including statsmodels.  
Unfortunately, I'm still missing a new function....
 File "/Users/swe03/anaconda3/lib/python3.6/site-packages/statsmodels/tsa/statespace/structural.py", line 20, in <module>
    from .tools import (

ImportError: cannot import name 'prepare_exog'

I'm hoping I can simply include this function in the new structural.py code.  
Please advise....

Thanks,
Steve

josef...@gmail.com

unread,
May 11, 2018, 4:00:10 PM5/11/18
to pystatsmodels
On Fri, May 11, 2018 at 3:47 PM, <steven_e...@homedepot.com> wrote:
Hey Chad,
Thanks so much for your prompt reply.... very helpful !
I tried the simplest way to use the new UC function by cutting and pasting your structural.py into my local ./statespace folder.
I restarted my IDE and executed the set of imports, including statsmodels.  
Unfortunately, I'm still missing a new function....
 File "/Users/swe03/anaconda3/lib/python3.6/site-packages/statsmodels/tsa/statespace/structural.py", line 20, in <module>
    from .tools import (

ImportError: cannot import name 'prepare_exog'

I'm hoping I can simply include this function in the new structural.py code.  
Please advise....


There are too many changes since 0.8, so copying just some parts from 0.9 to 0.8 will not work in most cases.

The 0.9 release will hopefully be out by the end of the weekend, but I just got stuck for a day or three in bug hunting, and have not yet completely finished the hunt.

Josef

steven_e...@homedepot.com

unread,
May 11, 2018, 4:37:43 PM5/11/18
to pystatsmodels
Hey Josef,
Thanks for your prompt reply and great service !
You guys ROCK !
Steve 

daniel mackay

unread,
May 15, 2018, 10:26:55 AM5/15/18
to pystat...@googlegroups.com
Just started using the new statsmodels package to do some SARIMA errors work and have run into a problem I can't fathom out. Basically, I can't reconcile Stata output with statsmodels output. (R output is similar to Stata). I'm estimating a SARIMA model with lags at 1,5,8,12 with exogenous regressors. A level shift variable, 'ban', an underlying trend 't', and a post ban trend variable 'postbanXslope'.


THE OUTPUT FROM STATSMODELS 9 IS:





X =df[['ban','postbanXslope','t']]

ar = (1,1,0,0,1,0,0,1,0,0,0,1)
ma = (0)
mod1 = sm.tsa.statespace.SARIMAX(df['LOGTGT60_ALL_RATE'], tend=('c'), exog=X, order=(ar,0,ma), enforce_stationarity=True, 
                                 mle_regression=True, harvey_representation=True)
ressBASE = mod1.fit()
print(ressBASE.summary())
                                  Statespace Model Results                                
===========================================================================================
Dep. Variable:                   LOGTGT60_ALL_RATE   No. Observations:                  192
Model:             SARIMAX((1, 2, 5, 8, 12), 0, 0)   Log Likelihood                  33.731
Date:                             Tue, 15 May 2018   AIC                            -49.462
Time:                                     15:10:10   BIC                            -20.144
Sample:                                 01-31-2000   HQIC                           -37.588
                                      - 12-31-2015                                        
Covariance Type:                               opg                                        
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
ban                -2.2390      0.188    -11.930      0.000      -2.607      -1.871
postbanXslope    -0.0828      0.024     -3.442      0.001      -0.130      -0.036
t                     0.0828      0.011      7.361      0.000       0.061       0.105
ar.L1             0.9070      0.135      6.707      0.000       0.642       1.172
ar.L2             0.0570      0.164      0.347      0.728      -0.264       0.378
ar.L5             0.0027      0.130      0.021      0.984      -0.252       0.258
ar.L8            -0.0063      0.122     -0.051      0.959      -0.245       0.232
ar.L12           -0.0063      0.061     -0.104      0.917      -0.126       0.113
sigma2            0.0335      0.005      6.187      0.000       0.023       0.044
===================================================================================
Ljung-Box (Q):                       11.06   Jarque-Bera (JB):             47337.02
Prob(Q):                              1.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.12   Skew:                             7.73
Prob(H) (two-sided):                  0.00   Kurtosis:                        78.35
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).






COMPARING THIS WITH R OUTPUT AND YOU CAN SEE THAT THE COEFFICIENTS ARE OUT BY A LONG WAY:




Regression with ARIMA(12,0,0) errors 

Coefficients:
         ar1     ar2  ar3  ar4      ar5  ar6  ar7     ar8  ar9  ar10  ar11    ar12  intercept      ban        t  postbanXslope
      0.4454  0.1416    0    0  -0.1670    0    0  0.1284    0     0     0  0.2541     4.2119  -0.0871  -0.0026         0.0019
s.e.  0.0707  0.0711    0    0   0.0559    0    0  0.0554    0     0     0  0.0588     0.0439   0.0386   0.0009         0.0012

sigma^2 estimated as 0.00374:  log likelihood=271.39
AIC=-522.78   AICc=-521.57   BIC=-490.21

Training set error measures:
                       ME       RMSE        MAE         MPE    MAPE      MASE        ACF1
Training set 0.0001202479 0.05855246 0.04660284 -0.02059052 1.17613 0.6937953 -0.01028234





THE STATA OUTPUT IS:




Sample: 1 - 192 Number of obs= 192
Wald chi2(8)= 279.41
Log likelihood =271.3919 Prob > chi2= 0.0000
OPG
LOGTGT60_ALL_RATECoef.Std. Err.z P>z[95% Conf. Interval]
LOGTGT60_ALL_RATE
t-.0025935.0008257-3.14 0.002-.0042119 -.0009752
ban-.0870792.0360625-2.41 0.016-.1577603 -.016398
postbanXslope.0019448.00115211.69 0.091-.0003133 .0042028
_cons4.211966.0372866112.96 0.0004.138885 4.285046
ARMA
ar
L1..4455028.07899965.64 0.000.2906664 .6003391
L2..141507.07880041.80 0.073-.012939 .2959529
L5.-.1670361.058251-2.87 0.004-.2812059 -.0528663
L8..1284578.05396612.38 0.017.0226862 .2342293
L12..2541029.05181264.90 0.000.152552 .3556538
/sigma.058553.003262217.95 0.000.0521592 .0649467







WHICH IS VERY SIMILAR TO R. DIAGNOSTICS ON THE R AND STATA MODELS ARE EXCELLENT. THE SUM OF THE AR COEFFS IS CLOSE TO 1...... BUT THIS ISN'T A PROBLEM. THE SERIES IS TREND STATIONARY. IF IT HAD A UNIT ROOT ETC THAT WOULD SHOW UP IN THE RESIDUALS BUT THESE ARE PERFECT.

IS ANYONE ABLE TO POINT OUT WHY STATSMODELS MODELS SHOULD BE SO DIFFERENT TO STATA AND R.? ( I TEND TO KEEP MY ERRORS MODELS PARSIMONIOUS HENCE WHY NO CONTINUOUS LAGS)



MANY THANKS FOR ANY HELP


Dan





























 

josef...@gmail.com

unread,
May 15, 2018, 10:37:36 AM5/15/18
to pystatsmodels
On Tue, May 15, 2018 at 10:26 AM, daniel mackay <dm64...@gmail.com> wrote:
Just started using the new statsmodels package to do some SARIMA errors work and have run into a problem I can't fathom out. Basically, I can't reconcile Stata output with statsmodels output. (R output is similar to Stata). I'm estimating a SARIMA model with lags at 1,5,8,12 with exogenous regressors. A level shift variable, 'ban', an underlying trend 't', and a post ban trend variable 'postbanXslope'.


THE OUTPUT FROM STATSMODELS 9 IS:





X =df[['ban','postbanXslope','t']]

ar = (1,1,0,0,1,0,0,1,0,0,0,1)
ma = (0)
mod1 = sm.tsa.statespace.SARIMAX(df['LOGTGT60_ALL_RATE'], tend=('c'), exog=X, order=(ar,0,ma), enforce_stationarity=True, 


try with  `enforce_stationarity=False`

I'm not sure how imposing stationarity works in SARIMAX, but there is a problem with reparameterization for stationarity, at least according to the description of another R package.

Josef

Chad Fulton

unread,
May 15, 2018, 10:49:08 AM5/15/18
to Statsmodels Mailing List
One thing is that you have set `tend=('c')` rather than `trend=('c')` as I think you wanted. 

daniel mackay

unread,
May 15, 2018, 11:28:24 AM5/15/18
to pystat...@googlegroups.com
Thanks Josef and Chad for getting back to me so quickly. Josef, which R package is it that you are referring to so I can take a look...



Re-running with your suggestions results in a closer match but still way out for some coeffs.....I'm really not sure what the problem is here...





ar = (1,1,0,0,1,0,0,1,0,0,0,1)
ma = (0)
mod1a = sm.tsa.statespace.SARIMAX(df['LOGTGT60_ALL_RATE'], trend=('c'), exog=X, order=(ar,0,ma), enforce_stationarity=False, 
                                 mle_regression=True, harvey_representation=True)

ressBASE = mod1a.fit()



print(ressBASE.summary())

                                  Statespace Model Results                                
===========================================================================================
Dep. Variable:                   LOGTGT60_ALL_RATE   No. Observations:                  192
Model:             SARIMAX((1, 2, 5, 8, 12), 0, 0)   Log Likelihood                 243.843
Date:                             Tue, 15 May 2018   AIC                           -467.685
Time:                                     16:21:15   BIC                           -435.756
Sample:                                 01-31-2000   HQIC                          -454.739

                                      - 12-31-2015                                        
Covariance Type:                               opg                                        
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
intercept         0.2836     21.022      0.013      0.989     -40.919      41.486
ban              -0.1038      0.040     -2.606      0.009      -0.182      -0.026
postbanXslope     0.0003      0.006      0.052      0.958      -0.012       0.012
t                -0.0721      4.793     -0.015      0.988      -9.467       9.323
ar.L1             0.7191      0.093      7.774      0.000       0.538       0.900
ar.L2            -0.0357      0.091     -0.394      0.693      -0.213       0.142
ar.L5            -0.0381      0.075     -0.510      0.610      -0.184       0.108
ar.L8             0.1096      0.066      1.664      0.096      -0.019       0.239
ar.L12            0.2469      0.060      4.121      0.000       0.129       0.364
sigma2            0.0042      0.001      7.540      0.000       0.003       0.005
===================================================================================
Ljung-Box (Q):                       48.83   Jarque-Bera (JB):                 0.15
Prob(Q):                              0.16   Prob(JB):                         0.93
Heteroskedasticity (H):               1.03   Skew:                            -0.05
Prob(H) (two-sided):                  0.92   Kurtosis:                         2.90
===================================================================================





















Chad Fulton

unread,
May 15, 2018, 11:43:18 AM5/15/18
to Statsmodels Mailing List
It looks like R and Stata are finding better parameters (i.e. higher likelihood), and I would guess that this is probably because their starting parameter estimation method is better for your particular problem. It's hard to say more without having access to the dataset. Any chance you can post it publicly?

Chad

josef...@gmail.com

unread,
May 15, 2018, 11:46:53 AM5/15/18
to pystatsmodels
On Tue, May 15, 2018 at 11:28 AM, daniel mackay <dm64...@gmail.com> wrote:
Thanks Josef and Chad for getting back to me so quickly. Josef, which R package is it that you are referring to so I can take a look...

It was a long time ago that I looked at this, so I don't remember which reference it was.
I'm pretty sure it was by McLeod

Essentially, AFAIR, he argued that reparameterization to impose stationarity does not preserve the zero constraints on the lag coefficients. He uses or recommends zero constraints on the transformed coefficients.

(I didn't try to understand the math, so this is IIRC and IIUC.)


Josef

josef...@gmail.com

unread,
May 15, 2018, 11:56:04 AM5/15/18
to pystatsmodels
If it's start_params and not different method, then using the Stata or R parameter estimates as start_params would show it.

Josef

 
Chad

daniel mackay

unread,
May 15, 2018, 12:24:18 PM5/15/18
to pystat...@googlegroups.com
Hi Chad,

See attached csv file.














sarimadata.csv

Chad Fulton

unread,
May 15, 2018, 1:10:05 PM5/15/18
to Statsmodels Mailing List
On Tue, May 15, 2018 at 12:24 PM, daniel mackay <dm64...@gmail.com> wrote:
Hi Chad,

See attached csv file.




Thanks!

The main issue appears to just be that since your exog contains the trend and the post-period dummy variable, you'll also want to include the constant as part of `exog` rather than using the `trend` argument.

e.g. you could do:

X = sm.add_constant(df[['ban','postbanXslope','t']])
mod1 = sm.tsa.statespace.SARIMAX(df['LOGTGT60_ALL_RATE'], exog=X, order=(ar,0,ma))

Then ressBASE = mod1.fit() gets me pretty close (log-likelihood = 270.862).

You can get to the same value as Stata or R by doing a second call to fit using a different optimization algorithm, e.g.:

initial_fit = mod1.fit()  # this gets us to 270.862
ressBASE = mod1.fit(initial_fit.params, method='nm', maxiter=1000)  # this gets us all the way to 271.392

Best,
Chad

daniel mackay

unread,
May 15, 2018, 1:29:02 PM5/15/18
to pystat...@googlegroups.com
Excellent !  Many, many thanks ! Simply didn't occur to me to add the constant into the exog....guess that's my newby python naivety  :(


thanks again for the support. It's much appreciated !!!


atb

Dan

Chad Fulton

unread,
May 15, 2018, 2:14:37 PM5/15/18
to Statsmodels Mailing List
Just to follow up on this, I was incorrect about the starting parameters. In fact, R appears to set the starting parameters for the exogenous coefficients to the OLS estimates and the AR coefficients to 0, and the scale is concentrated out of the likelihood function.

(e.g.:

dta <- read.csv('sarimadata.csv')
res <- arima(dta$y, order=c(12, 0, 0), fixed=c(NA, NA, 0, 0, NA, 0, 0, NA, 0, 0, 0, NA, NA, NA, NA, NA), xreg=dta[c('t', 'step', 'post')], optim.control=list(maxit=0))
)

But when I use those starting parameters, I actually get a slightly worse result from SARIMAX (259.779 vs 270.862 using our default start parameters):

tmp = sm.OLS(endog, exog).fit()
start_params = np.r_[tmp.params.values, [0] * 5, (tmp.resid**2).sum()]
res_alt = mod.fit(start_params)

In general, the starting parameters / optimization tuning for state space models is a good idea probably. But I don't know too much about it.

Best,
Chad

daniel mackay

unread,
May 16, 2018, 4:15:50 PM5/16/18
to pystat...@googlegroups.com
OK Chad. Thanks very much for all your assistance with this.I agree about the tuning.....nonetheless, reassuring to know that you can get close to other software in a round about sort of way. Maybe fine tuning for SARIMAX could be put on the statsmodels10 wish list ..

thanks again

Dan

Chad Fulton

unread,
May 16, 2018, 4:37:26 PM5/16/18
to pystat...@googlegroups.com
On Wed, May 16, 2018 at 4:15 PM daniel mackay <dm64...@gmail.com> wrote:
OK Chad. Thanks very much for all your assistance with this.I agree about the tuning.....nonetheless, reassuring to know that you can get close to other software in a round about sort of way. Maybe fine tuning for SARIMAX could be put on the statsmodels10 wish list ..

thanks again

Dan


I should also mention that this kind of noise around the maximum likelihood estimates is pretty common in my experience between R, Stata, and Statsmodels. And often times we come up with superior results. Nonetheless, better tuning is always a good goal.

Chad


Reply all
Reply to author
Forward
0 new messages