statsmodel predict start and end indices

user1...@gmail.com

unread,

Nov 26, 2013, 3:19:51 PM11/26/13

to pystat...@googlegroups.com

I am trying to implement the prediction function from statsmodel package for ARIMA model

prediction = results.predict(start=1,end=len(test),exog=test)

The dates of the input, test, and the output prediction are inconsistent. I get 1/4/2012 to 7/25/2012 for the former and 4/26/2013 to 11/13/2013 for the latter. Part of the difficulty is that I don't have a completely recurring frequency - I have daily values excluding weekends and holidays. What is the appropriate way to set the indices?

x = psql.frame_query(query,con=db)

x = x.set_index('date')

train = x[0:len(x)-50]

test = x[len(x)-50:len(x)]

arima = tsa.ARIMA(train['A'], exog=train, order = (2,1,1))

results = arima.fit()

prediction = results.predict(start=test.index[0],end=test.index[-1],exog=test)

I get the error

There is no frequency for these dates and date 2013-04-26 00:00:00 is not in dates index. Try giving a date that is in the dates index or use an integer

Here's the first set of data

2013-04-26 -0.9492

2013-04-29 2.2011

...

2013-11-12 0.1178

2013-11-13 2.0449

Skipper Seabold

unread,

Nov 26, 2013, 4:47:22 PM11/26/13

to pystat...@googlegroups.com

On Tue, Nov 26, 2013 at 8:19 PM, <user1...@gmail.com> wrote:

I am trying to implement the prediction function from statsmodel package for ARIMA model

prediction = results.predict(start=1,end=len(test),exog=test)

The dates of the input, test, and the output prediction are inconsistent. I get 1/4/2012 to 7/25/2012 for the former and 4/26/2013 to 11/13/2013 for the latter. Part of the difficulty is that I don't have a completely recurring frequency - I have daily values excluding weekends and holidays. What is the appropriate way to set the indices?

x = psql.frame_query(query,con=db)
x = x.set_index('date')

train = x[0:len(x)-50]
test = x[len(x)-50:len(x)]

arima = tsa.ARIMA(train['A'], exog=train, order = (2,1,1))

It looks like you're including train['A'] in both the endogenous and exogenous variable. This probably isn't what you want to do.

results = arima.fit()
prediction = results.predict(start=test.index[0],end=test.index[-1],exog=test)

I get the error

There is no frequency for these dates and date 2013-04-26 00:00:00 is not in dates index. Try giving a date that is in the dates index or use an integer

You can't use dates if there's not a frequency for the index. You can either not you use a pandas object. E.g.,

arima = tsa.ARIMA(train['A'].values, order=...)

or do what the error message says and use an integer index instead of a date for start and end.

user1...@gmail.com

unread,

Nov 26, 2013, 6:21:03 PM11/26/13

to pystat...@googlegroups.com

So if I choose not use panda objects

arima = tsa.ARIMA(train['A'].values, order=...)

Can I still have exogenous variables?

Also, if I choose to index by integers would I do something like this

prediction = results.predict(start=len(xc),end=len(x),exog=test)

user1...@gmail.com

unread,

Nov 27, 2013, 12:33:40 AM11/27/13

to pystat...@googlegroups.com

I tried as you suggested

prediction = results.predict(start=1,end=len(x),exog=x.drop('A',axis=1))

but the prediction and actual dates don't lineup. Here's prediction[len(train):len(x)]

2012-01-13 -2.589300

2012-01-16 -1.818400

2012-01-17 21.388807

Here's actual test[test.columns[0]]

2012-01-12 0.5879

2012-01-13 -0.7385

2012-01-17 0.5744

On Tuesday, November 26, 2013 3:19:51 PM UTC-5, user1...@gmail.com wrote:

user1...@gmail.com

unread,

Nov 27, 2013, 12:49:47 AM11/27/13

to pystat...@googlegroups.com

In addition, when I try I larger set of examples, I get this

330 0.064591

331 -0.705979

...

469 -1.380294

470 0.179332

In this case, the prediction isn't even indexed by dates. And the actual is

2013-04-26 -0.9492

2013-04-29 2.2011

...

2013-11-12 0.1178

2013-11-13 2.0449

I'm completely lost!

On Tuesday, November 26, 2013 3:19:51 PM UTC-5, user1...@gmail.com wrote:

Skipper Seabold

unread,

Nov 27, 2013, 3:02:13 AM11/27/13

to pystat...@googlegroups.com

On Tue, Nov 26, 2013 at 11:21 PM, <user1...@gmail.com> wrote:

So if I choose not use panda objects

arima = tsa.ARIMA(train['A'].values, order=...)

Can I still have exogenous variables?

Certainly. Just don't include your endogenous variable in your exogenous variables.

Also, if I choose to index by integers would I do something like this

prediction = results.predict(start=len(xc),end=len(x),exog=test)

Yep. See the docstring for predict (or maybe ARIMA.predict).

Skipper Seabold

unread,

Nov 27, 2013, 3:10:36 AM11/27/13

to pystat...@googlegroups.com

On Wed, Nov 27, 2013 at 5:49 AM, <user1...@gmail.com> wrote:

In addition, when I try I larger set of examples, I get this

330 0.064591
331 -0.705979
...
469 -1.380294
470 0.179332

In this case, the prediction isn't even indexed by dates. And the actual is

2013-04-26 -0.9492
2013-04-29 2.2011
...
2013-11-12 0.1178
2013-11-13 2.0449

I'm completely lost!

The prediction won't be indexed by dates if you don't give dates to the ARIMA model. If you just use .values or an array-like object with no indexing (in the endogenous AND the exogenous), then you'll get no-dates. This is what you're likely going to have to do in this situation. And then fill in the dates yourself afterwards.

Think of it this way, if you give an index of times that has weekends missing and random holes (holidays) in the data, how can predict know what the out of sample dates should be? It can't. Many holidays are locale specific and some even move from year to year. Predict needs a *periodic* frequency to be able to fill-in dates automatically. I'm actually surprised that this produces any dates at all. You should get the error message from before in these cases I imagine.

However, if you can do business days only. I.e., skip the weekends but somehow fill in the holidays with interpolation or something, then you can use pandas business-day frequency, and it should work fine. You might be able to get somewhere by using custom business days, as I mentioned before. But I've never tried. If you give it a go, and it doesn't work, if you post a fully reproducible example, I'd be happy to try and support these as much as we can.

http://pandas.pydata.org/pandas-docs/dev/timeseries.html#custom-business-days-experimental

user1...@gmail.com

unread,

Nov 27, 2013, 4:37:34 AM11/27/13

to pystat...@googlegroups.com

Thanks a lot for the reply. I've been banging my head on the wall over this, and I really appreciate your help. I guess the part I'm still confused on is how to repopulate the dates. For example, when predict returns values indexed 1...470, what dates does each index correspond to? How do I know which row corresponds to which date? Then given this, how would I ultimately translate it to the days that I am interested in? If you can provide any insight on how to code this, it'd be extremely appreciated.

Skipper Seabold

unread,

Nov 27, 2013, 5:02:12 AM11/27/13

to pystat...@googlegroups.com

On Wed, Nov 27, 2013 at 9:37 AM, <user1...@gmail.com> wrote:

Thanks a lot for the reply. I've been banging my head on the wall over this, and I really appreciate your help. I guess the part I'm still confused on is how to repopulate the dates. For example, when predict returns values indexed 1...470, what dates does each index correspond to? How do I know which row corresponds to which date? Then given this, how would I ultimately translate it to the days that I am interested in? If you can provide any insight on how to code this, it'd be extremely appreciated.

Well, yes, that's the problem you need to solve! The predictions have absolutely no notion of dates. The estimation isn't doing anything with the dates. They're there solely as a convenience for indexing and prediction. So each prediction at time t is just for time t + 1. It's up to you whether you want to interpret t + 1 as tomorrow, Monday to Friday, or skipping holidays. There's simply no automatic way to do it given that you have holidays in there (unless you try to the custom calendar I sent in previous e-mails). You need to recreate the dates. Presumably, you know what the next date in your series is, then you go forward. You can probably get there by creating a pandas PeriodIndex of business days, then deleting the holidays, and adding to your series at the end.

To be clear, a naive ARIMA is going to treat a Wednesday prediction from Tuesday, a Monday prediction from the previous Friday, and an Easter Monday prediction the same as any other Friday-Monday prediction.

Skipper

user1...@gmail.com

unread,

Nov 27, 2013, 6:01:05 AM11/27/13

to pystat...@googlegroups.com

Is there any simple interpolation method I could use to fill in the holidays/holes? Does statsmodels or pandas provide this? Would this significantly hurt the end results?

Skipper Seabold

unread,

Nov 27, 2013, 6:08:35 AM11/27/13

to pystat...@googlegroups.com

On Wed, Nov 27, 2013 at 11:01 AM, <user1...@gmail.com> wrote:

Is there any simple interpolation method I could use to fill in the holidays/holes? Does statsmodels or pandas provide this? Would this significantly hurt the end results?

I have no idea. It's your data after all... Holidays will be locale dependent, likely industry dependent, etc.

Skipper

Skipper Seabold

unread,

Nov 27, 2013, 6:14:40 AM11/27/13

to pystat...@googlegroups.com

http://stackoverflow.com/questions/1986207/holiday-calendars-file-formats-et-al

http://stackoverflow.com/questions/2224742/business-days-in-python

PyData list may have some more ideas.

Skipper

user1...@gmail.com

unread,

Nov 27, 2013, 7:58:41 PM11/27/13

to pystat...@googlegroups.com

Thanks for the help. I took your advice and interpolated the missing days so now I have a dataset of business days. It worked much better. However, the predictions I got seem really weird. They are not even in the same ballpark as the actual values, even for the set that I trained on. Here is my code

endogenous = 'A'

x = psql.frame_query(query,con=db)

x = x.set_index('date')

// interpolate for missing days

train = x[0:len(x)-140]

test = x[len(x)-140:len(x)]

arima = tsa.ARIMA(train[endogenous], exog=train.drop(endogenous,axis=1), order=(2,2,0),freq='B')

results = arima.fit()

prediction = results.predict(start=1,end=len(x)-1,exog=x.drop(endogenous,axis=1))

My actual dataset is this

2012-01-05 659.010

2012-01-06 650.020

2012-01-09 622.940

...

2013-11-08 1016.03

2013-11-11 1010.59

2013-11-12 1011.78

2013-11-13 1032.47

Prediction gives me this

2012-01-05 -10.551134

2012-01-06 -8.937889

2012-01-09 -27.941221

...

2013-11-08 14.739148

2013-11-11 22.567270

2013-11-12 1.844993

2013-11-13 -42.794671

The output (after a bunch of iterations is)

CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH

Warning: more than 10 function and gradient

evaluations in the last line search. Termination

may possibly be caused by a bad search direction.

Cauchy time 0.000E+00 seconds.

Subspace minimization time 0.000E+00 seconds.

Line search time 1.853E+01 seconds.

Total User time 1.865E+01 seconds.

user1...@gmail.com

unread,

Nov 27, 2013, 9:01:21 PM11/27/13

to pystat...@googlegroups.com

Just to be clear on what I am trying to do... I want to train the model on 'train' and then evaluate the model on 'test'. I am computing predict on 'x' because I want to see how well the model does on samples that it trained on and also the test set.

On Tuesday, November 26, 2013 3:19:51 PM UTC-5, user1...@gmail.com wrote:

josef...@gmail.com

unread,

Nov 28, 2013, 9:09:32 AM11/28/13

to pystatsmodels

can you try the forecast with only differencing once order=(2,1,0)
instead of order=(2,2,0)?
Do the fitted parameters look "reasonable"?

From a quick look at the forecast function, I have the impression that
we integrate (cumsum) only once.
But I'm not very familiar with this part of statsmodels.

Josef

josef...@gmail.com

unread,

Nov 28, 2013, 10:15:44 AM11/28/13

to pystatsmodels

looks like a bug to me

it's a bit tricky to integrate twice, since we need the starting
values and difference as integration constants.

>>> a = np.random.randn(10)
>>> a - np.cumsum(np.r_[a[0], np.cumsum(np.r_[a[1]-a[0], np.diff(a, n=2)])])
array([ 0.00000000e+00, 0.00000000e+00, 1.11022302e-16,
2.22044605e-16, 3.33066907e-16, 2.22044605e-16,
1.11022302e-16, 2.77555756e-16, 5.55111512e-16,
8.88178420e-16])

>>> a - np.concatenate((a[:2], np.cumsum(np.cumsum(np.diff(a, n=2))) + a[0] + np.arange(2, len(a)) * (a[1] - a[0])))
array([ 0.00000000e+00, 0.00000000e+00, 1.11022302e-16,
2.22044605e-16, 3.33066907e-16, 2.22044605e-16,
1.11022302e-16, 2.77555756e-16, 5.55111512e-16,
1.77635684e-15])

There might be a better calculation for 2nd order difference
equations, but I cannot see them.

Josef

>
> Josef

Skipper Seabold

unread,

Nov 28, 2013, 10:25:13 AM11/28/13

to pystat...@googlegroups.com

Quite possible. I'll have to look.

it's a bit tricky to integrate twice, since we need the starting
values and difference as integration constants.

>>> a = np.random.randn(10)
>>> a - np.cumsum(np.r_[a[0], np.cumsum(np.r_[a[1]-a[0], np.diff(a, n=2)])])
array([ 0.00000000e+00, 0.00000000e+00, 1.11022302e-16,
2.22044605e-16, 3.33066907e-16, 2.22044605e-16,
1.11022302e-16, 2.77555756e-16, 5.55111512e-16,
8.88178420e-16])

>>> a - np.concatenate((a[:2], np.cumsum(np.cumsum(np.diff(a, n=2))) + a[0] + np.arange(2, len(a)) * (a[1] - a[0])))
array([ 0.00000000e+00, 0.00000000e+00, 1.11022302e-16,
2.22044605e-16, 3.33066907e-16, 2.22044605e-16,
1.11022302e-16, 2.77555756e-16, 5.55111512e-16,
1.77635684e-15])

There might be a better calculation for 2nd order difference
equations, but I cannot see them.

I recall going through this, though I thought I handled it. If someone files a ticket, I'll take a look at some point.

Josef

>
> Josef

user1...@gmail.com

unread,

Nov 28, 2013, 1:22:06 PM11/28/13

to pystat...@googlegroups.com

Hey, I tried (2,1,0) and I still didn't get reasonable results. I'm not sure it converged since I got the following

ABNORMAL_TERMINATION_IN_LNSRCH

Line search cannot locate an adequate point after 20 function

and gradient evaluations. Previous x, f and g restored.

Possible causes: 1 error in function or gradient evaluation;

2 rounding error dominate computation.

Cauchy time 0.000E+00 seconds.

Subspace minimization time 0.000E+00 seconds.

Line search time 0.000E+00 seconds.

Total User time 3.532E+00 seconds.

The parameter values that I got were

const -1.471887

B 5.621082

C -1.004346

D -0.000910

E -0.254927

F -0.000000

G 0.287348

H -0.135181

I 0.038895

J 0.000248

K -0.041795

L 0.061735

M -0.002211

N 0.088741

O 0.000180

P -0.078926

Q 0.000320

R -0.093938

S -0.001125

T 0.201757

U -0.001633

ar.L1.D.A 0.086583

ar.L2.D.A -0.031486

dtype: float64

josef...@gmail.com

unread,

Nov 28, 2013, 2:17:19 PM11/28/13

to pystatsmodels

Are your B to U all exogenous/X variables ?
Are they well behaved?
>>> from scipy import linalg
>>> linalg.svdvals(exog)

The other possibility is to try out different optimizers `solver` to
figure out what might be going wrong.

The AR coefficients are very small, and with no MA part, this should
be easy to solve.

Josef

user1...@gmail.com

unread,

Nov 28, 2013, 2:58:06 PM11/28/13

to pystat...@googlegroups.com

Thanks for looking into this for me. Yes, the B through U are exog variables and A is the endogenous. I also think I will need an MA term eventually to get a good fit for my data. I just used that order as an example since with all order combinations I tried I was getting a poor fit (I looped through I bunch of orders and the predictions were still way off.) svdvals gave me this

[ 2.20197023e+07 3.29160617e+06 2.84961784e+04 8.34724044e+03

1.27207218e+03 6.42009179e+02 4.85474456e+02 2.72485319e+02

2.67255957e+02 2.11043415e+02 9.34207150e+01 8.09238281e+01

6.18004101e+01 4.89865842e+01 4.79419346e+01 2.48602160e+01

2.16720812e+01 6.76227674e+00 4.72595509e+00 4.01329897e+00

3.62018005e+00]

I'm not sure how to interpret this. How can I tell if the data is well-behaved? I tried all the solvers and none of them improved the result from the default one. I tried orders (0,0,1) through (0,0,5) and I got slightly better results.

2011-06-01 525.609592

2011-06-02 528.045381

2011-06-03 523.161218

...

2013-11-08 609.294063

2013-11-11 613.876106

2013-11-12 607.953707

2013-11-13 1033.129234

Thanks for the help. I'm completely lost on this and really appreciate it.

josef...@gmail.com

unread,

Nov 28, 2013, 3:40:00 PM11/28/13

to pystatsmodels

As I said I'm not familiar with this part.
My impression is that we integrate/cumsum x*beta, so exog in the
forecast function should be np.diff(exog) if ARIMA order is (p,1,q).

What you could try given that you have many explanatory variables is
to estimate with OLS y on x, and then fit the residuals to an ARIMA
without exog. To forecast you can add OLSresults.predict +
ARIMAresults.forecast.
This should give a good idea what the forecast should approximately
look like, even if it's maybe not a statistically correct procedure.
It will also give a check for the ARIMAX parameters and whether the
starting values for the X part in ARIMA need to be changed.

You could also test the OLS residuals if there even is any autocorrelation left.

Josef

user1...@gmail.com

unread,

Nov 28, 2013, 7:49:47 PM11/28/13

to pystat...@googlegroups.com

Do you mean something like this?

(a,b,c) = (0,0,4)

olsResults = sm.OLS(train[endogenous],train.drop(endogenous,axis=1)).fit()

prediction = olsResults.predict(x.drop(endogenous,axis=1))

arima = tsa.ARIMA(train[endogenous],order=(a,b,c),freq='B')

results = arima.fit(transparam=True, dynamic=True)

prediction = prediction[b:] + results.predict(start=b,end=len(x)-1)

OLS by itself did much better than ARIMA by itself. Doing the above procedure negligibly improved the results from OLS. Also, I don't quite understand the intuition behind this. Is this a way to approximate ARIMA?

josef...@gmail.com

unread,

Nov 28, 2013, 8:43:33 PM11/28/13

to pystatsmodels

On Thu, Nov 28, 2013 at 7:49 PM, <user1...@gmail.com> wrote:
> Do you mean something like this?
>
> (a,b,c) = (0,0,4)
> olsResults = sm.OLS(train[endogenous],train.drop(endogenous,axis=1)).fit()
> prediction = olsResults.predict(x.drop(endogenous,axis=1))
> arima = tsa.ARIMA(train[endogenous],order=(a,b,c),freq='B')

If I understand your variables correctly, I meant:

arima = tsa.ARIMA(olsResults.resid, order=(a, b, c), freq='B')

> results = arima.fit(transparam=True, dynamic=True)
> prediction = prediction[b:] + results.predict(start=b,end=len(x)-1)
>
> OLS by itself did much better than ARIMA by itself. Doing the above
> procedure negligibly improved the results from OLS. Also, I don't quite
> understand the intuition behind this. Is this a way to approximate ARIMA?

using resid in ARIMA will give you approximately the full ARIMAX

(Doing it in two stages won't be quite right. OLS will not give
unbiased or consistent parameter estimates unless the exog are
strictly or strongly (?) exogenous or there is no autocorrelation in
the residuals. The standard errors are also wrong.)

The idea is that you use OLS to get the effect from your explanatory
variables. However, if there is autocorrelation in the residuals, then
the past residuals still contain information that you can use to get a
better short-term forecast. Using ARMA on the residuals can capture
that part of the forecast.
ARIMAX combines both in an efficient way.

ARIMA or ARIMAX is largely used for univariate forecasting. If you
have to forecast the next several periods, then you still need also a
forecast of your explanatory variables. It might be difficult to
forecasts the explanatory variables well enough that we actually do
better than the univariate ARIMA forecast.

(Alternatively, we could also regress on our explanatory variables
after they have been lagged by the number of periods that we want to
forecast, then we don't need future explanatory variables.)

Josef

user1...@gmail.com

unread,

Nov 28, 2013, 10:20:31 PM11/28/13

to pystat...@googlegroups.com

Sorry for the so much back and forth. So I updated as advised

olsResults = sm.OLS(train[endogenous], train.drop(endogenous,axis=1)).fit()

prediction = olsResults.predict(x.drop(endogenous,axis=1))

arima = tsa.ARIMA(olsResults.resid,order=(a,b,c),freq='B')

results = arima.fit(transparam=True, dynamic=True)

prediction = prediction[b:] + results.predict(start=b,end=len(x)-1)

The prediction part due to ARIMA had a negligible effect and most of the prediction was due to OLS. The prediction for ARIMA was

2009-01-05 -0.000028

2009-01-06 0.651377

2009-01-07 -0.276330

2009-01-08 2.556484

2009-01-09 0.508202

2009-01-12 0.678381

...

2013-11-05 -0.000028

2013-11-06 -0.000028

2013-11-07 -0.000028

2013-11-08 -0.000028

2013-11-11 -0.000028

2013-11-12 -0.000028

2013-11-13 -0.000028

Interestingly it varies a lot in the training set, but its more or less constant in the test set.

josef...@gmail.com

unread,

Nov 28, 2013, 10:57:01 PM11/28/13

to pystatsmodels

On Thu, Nov 28, 2013 at 10:20 PM, <user1...@gmail.com> wrote:
> Sorry for the so much back and forth. So I updated as advised
>
> olsResults = sm.OLS(train[endogenous], train.drop(endogenous,axis=1)).fit()
> prediction = olsResults.predict(x.drop(endogenous,axis=1))
> arima = tsa.ARIMA(olsResults.resid,order=(a,b,c),freq='B')
> results = arima.fit(transparam=True, dynamic=True)
> prediction = prediction[b:] + results.predict(start=b,end=len(x)-1)
>
> The prediction part due to ARIMA had a negligible effect and most of the
> prediction was due to OLS. The prediction for ARIMA was

Yes, already from your small ARMA params in the initial version it
looked like there is little serial dependence left after taking
account of the explanatory variables.

You could run http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.diagnostic.acorr_breush_godfrey.html
on the OLS residuals to test if there is any autocorrelation in the
OLS residuals.

>
> 2009-01-05 -0.000028
> 2009-01-06 0.651377
> 2009-01-07 -0.276330
> 2009-01-08 2.556484
> 2009-01-09 0.508202
> 2009-01-12 0.678381
> ...
> 2013-11-05 -0.000028
> 2013-11-06 -0.000028
> 2013-11-07 -0.000028
> 2013-11-08 -0.000028
> 2013-11-11 -0.000028
> 2013-11-12 -0.000028
> 2013-11-13 -0.000028
>
> Interestingly it varies a lot in the training set, but its more or less
> constant in the test set.

That's pretty much expected. If the estimated ARMA process is
stationary, then the forecast will eventually just converge to the
estimated mean.

Note, AFAICS you are doing long-term forecasting with the ARMA. ARMA
is usually more useful for short-term forecasting.
For example, if you want forecasts for the next five periods (business
days), then it would be more useful to roll the forecast over the test
set. Make 5 period forecast at day one, at day 2, .... and compare
with the corresponding actual values.

In the training set during estimation we essentially minimize the one
period ahead forecast error (might not be exactly true), and the
one-step ahead forecasts will fluctuate a lot more.

Josef

user1...@gmail.com

unread,

Nov 28, 2013, 11:58:42 PM11/28/13

to pystat...@googlegroups.com

Thanks for explaining all of this to me. I tried the autocorrelation for the OLS results object and got the following

(32.066543637900963, 0.042594398399338193, 1.5842717051199988, 0.049691768714850607)

How do I interpret that?

I understand your explanation, but the fact that ARIMAX did so poorly and that this new approach to estimating ARIMAX isn't much different from OLS makes me really suspicious that I'm not doing this correctly. Time series is a pretty crucial part of my data and I am just really surprised ARIMAX isn't performing much better (even on test samples just few days after the training period).

josef...@gmail.com

unread,

Nov 29, 2013, 12:23:59 AM11/29/13

to pystatsmodels

On Thu, Nov 28, 2013 at 11:58 PM, <user1...@gmail.com> wrote:
> Thanks for explaining all of this to me. I tried the autocorrelation for the
> OLS results object and got the following
>
> (32.066543637900963, 0.042594398399338193, 1.5842717051199988,
> 0.049691768714850607)
>
> How do I interpret that?

0.043 and 0.0497 are the p-value for the test with Null hypothesis
that there is no serial correlation in the residual, for the two
versions of the test.
This means we barely reject that there is no autocorrelation. The
serial correlation that is picked up by ARIMA might be pretty small
then.

>
> I understand your explanation, but the fact that ARIMAX did so poorly and
> that this new approach to estimating ARIMAX isn't much different from OLS
> makes me really suspicious that I'm not doing this correctly. Time series is
> a pretty crucial part of my data and I am just really surprised ARIMAX isn't
> performing much better (even on test samples just few days after the
> training period).

Without knowing anything about your data it's difficult to tell. It
could be that your explanatory variable already capture all the time
series features there are.

Suppose we want to predict whether it's raining tomorrow, and as
explanatory variable we use whether people on the street carry
umbrellas.
Regress rain on umbrellas with OLS.
Then forecast the probability of rain for the next day, assuming that
we know (today) whether people carry umbrellas tomorrow.
If there will be many umbrellas, then we predict rain. There is no
additional need for any additional time series based weather
forecasts.

(Minor problem, we still have to predict how many umbrellas will be on
the street.)

Josef

Reply all

Reply to author

Forward