Opposite of np.diff(np.log(data), axis=0)

2,038 views
Skip to first unread message

burakb

unread,
Aug 22, 2012, 10:00:08 AM8/22/12
to pystat...@googlegroups.com
Hi again, after forecasting using statsmodels.tsa, how do I reverse np.diff(np.log(data), ..) on the predicted time series, since prediction will be in terms of this new scale. Should I do first np.cumsum and then np.exp ?

Nathaniel Smith

unread,
Aug 22, 2012, 10:13:01 AM8/22/12
to pystat...@googlegroups.com
np.cumsum is not quite the inverse of np.diff, because np.diff throws
away information -- you can't tell from the output of np.diff what the
first entry in the original array was. So you need to stick that value
back on the front of the array before you call cumsum.

In [18]: d = np.random.exponential(size=10)
In [19]: transformed_d = np.diff(np.log(d))
In [20]: np.allclose(d,
np.exp(np.cumsum(np.concatenate(([np.log(d[0])], transformed_d)))))
Out[20]: True

-n

Skipper Seabold

unread,
Aug 22, 2012, 10:19:10 AM8/22/12
to pystat...@googlegroups.com
On Wed, Aug 22, 2012 at 10:00 AM, burakb <burakb...@gmail.com> wrote:
Hi again, after forecasting using statsmodels.tsa, how do I reverse np.diff(np.log(data), ..) on the predicted time series, since prediction will be in terms of this new scale. Should I do first np.cumsum and then np.exp ?

You've turned the data into (roughly) percentage changes with diff(log(.))

[~/]
[5]: (52.5 - 51.1)/51.1
[5]: 0.027397260273972573 


[~/]                                                                              
[6]: np.log(52.5) - np.log(51.1)                                                  
[6]: 0.027028672387919173    


So you're forecasting percentage changes. You can just treat them like this to back out the quantity that you want. For instance, you could do something like this to get the predicted cumulative change in levels based on some known starting point (e.g., the last value of the series you have)

base_level = 50 # in original units
predicted_change = np.array([.05, .12, .025])
predicted_level = base_level * np.cumprod(1+predicted_change)

hth,

Skipper

burakb

unread,
Aug 22, 2012, 10:46:10 AM8/22/12
to pystat...@googlegroups.com
Thank you both
Message has been deleted

Skipper Seabold

unread,
Aug 22, 2012, 11:06:30 AM8/22/12
to pystat...@googlegroups.com
On Wed, Aug 22, 2012 at 10:58 AM, burakb <burakb...@gmail.com> wrote:
Hmm.. now that things are working, the prediction seems to be way off. It climbs too fast.

import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt
 
def pad(data):
    bad_indexes = np.isnan(data)
    good_indexes = np.logical_not(bad_indexes)
    good_data = data[good_indexes]
    interpolated = np.interp(bad_indexes.nonzero()[0], good_indexes.nonzero()[0], good_data)
    data[bad_indexes] = interpolated
    return data

data = np.genfromtxt('/home/burak/daily3.csv', skip_header=1, delimiter=',')

a = np.squeeze(data[:,0])
b = np.squeeze(data[:,1])

data = np.vstack((a,b)).T
last = data[-1,:]
print "last", last

data = np.diff(np.log(data), axis=0)
data = np.apply_along_axis(pad, 0, data)

model = VAR(data)
res = model.fit(4)

f = res.forecast(data[4:], 30)
print f

a=np.append(a,last[0] * np.cumprod(1+f[:,0]))
b=np.append(b,last[1] * np.cumprod(1+f[:,1]))

plt.plot(range(len(a)),a,'.')
plt.show()
plt.plot(range(len(b)),b,'.')
plt.show()

If you want to run it, I can send the data.

Is the model you estimated stationary?

np.abs(res.roots)

or

res.is_stable()

josef...@gmail.com

unread,
Aug 22, 2012, 11:22:50 AM8/22/12
to pystat...@googlegroups.com
and how large and well estimated are the constant in the VAR in
differences. standard deviation ? much different from the mean of the
differenced series.

Nathaniel's version would use the same transformation as the original.
I don't know what the approximation error is for discrete steps in
Skipper's product version, it depends on the size of f (I think it
can be large if f is large, and there might be even a way to calculate
the direction bias. it won't be relevant if f is small.)

Josef

burakb

unread,
Aug 22, 2012, 3:22:59 PM8/22/12
to pystat...@googlegroups.com
I tested the code with macrodata.csv too, and the direction of the prediction is always the same. It sort of fits to realcons, but realinv forecasting does not fit. There is not interpolation in this one, so the error cannot be there. Unless I am doing some other thing wrong of course.. I used 4 lag and prediction to 30 quarters ahead.


import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt

data = np.genfromtxt('macrodata.csv', skip_header=1, delimiter=',')
data = data[:,3:5]
orig = data[:]

data = np.diff(np.log(data), axis=0)


model = VAR(data)
res = model.fit(4)

f = res.forecast(data[4:], 30)
print f

a=np.append(orig[:,0],orig[-1,0] * np.cumprod(1+f[:,0]))
b=np.append(orig[:,1],orig[-1,1] * np.cumprod(1+f[:,1]))


plt.plot(range(len(a)),a,'.')
plt.show()
plt.plot(range(len(b)),b,'.')
plt.show()


Skipper Seabold

unread,
Aug 22, 2012, 3:34:29 PM8/22/12
to pystat...@googlegroups.com
On Wed, Aug 22, 2012 at 3:22 PM, burakb <burakb...@gmail.com> wrote:
I tested the code with macrodata.csv too, and the direction of the prediction is always the same. It sort of fits to realcons, but realinv forecasting does not fit. There is not interpolation in this one, so the error cannot be there. Unless I am doing some other thing wrong of course.. I used 4 lag and prediction to 30 quarters ahead.


import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt

data = np.genfromtxt('macrodata.csv', skip_header=1, delimiter=',')
data = data[:,3:5]
orig = data[:]

data = np.diff(np.log(data), axis=0)


model = VAR(data)
res = model.fit(4)

f = res.forecast(data[4:], 30)

Again, you want negative indexing here to forecast beyond the series, unless this is intended for comparison.
 
print f

a=np.append(orig[:,0],orig[-1,0] * np.cumprod(1+f[:,0]))
b=np.append(orig[:,1],orig[-1,1] * np.cumprod(1+f[:,1]))


plt.plot(range(len(a)),a,'.')
plt.show()
plt.plot(range(len(b)),b,'.')
plt.show()


What do you expect this to look like? Both of the original series have trends in them. This continued upward trend is expected, but as you see, the prediction isn't very useful beyond a few periods out. It just converges to the long-run expectation of the data so that the predicted level becomes (log-) linear.
 

justin

unread,
Aug 22, 2012, 6:19:10 PM8/22/12
to pystat...@googlegroups.com
Here is my final blog post for GSOC.

http://landopystat.blogspot.com/

As I noted on the blog, I only have one pull request (out of 4 branches)
ready to be merged but do not fear, just because GSOC is over doesn't
mean I'm done.

burakb

unread,
Aug 23, 2012, 2:34:09 AM8/23/12
to pystat...@googlegroups.com
I ran

print np.abs(res.roots)
print res.is_stable()

on the forecast result on macrodata.csv, the results are

[ 2.04788417  1.71790585  1.66206368  1.66206368  1.65242305  1.65242305
  1.62915139  1.62915139]

True

burakb

unread,
Aug 23, 2012, 2:37:52 AM8/23/12
to pystat...@googlegroups.com
Negative indexing? Do you mean using

f = res.forecast(data[-4:], 30)

instead of

burakb

unread,
Aug 23, 2012, 7:12:54 AM8/23/12
to pystat...@googlegroups.com

This is the R version

library("vars")

file = "daily.csv"
a <- read.csv(file, header = TRUE, sep = ",", na.strings="")

impute.med <- function(x) {
    z <- median(x, na.rm = TRUE)
    x[is.na(x)] <- z
    return(x)
}

a2 <- sapply(a, function(x){
    if(is.numeric(x) & any(is.na(x))){
            impute.med(x)
        } else {
            x
        }
    }
)

out <- VAR(a2, p = 2, type = "const")
out.prd <- predict(out, n.ahead = 30, ci = 0.95)

The out.prd variable contains 30 day forecasts. I didnt have to remember last values, log, diff, etc.

Skipper Seabold

unread,
Aug 23, 2012, 11:32:44 AM8/23/12
to pystat...@googlegroups.com
On Thu, Aug 23, 2012 at 2:37 AM, burakb <burakb...@gmail.com> wrote:
Negative indexing? Do you mean using

f = res.forecast(data[-4:], 30)

instead of

f = res.forecast(data[4:], 30)

Yes, you probably want to use the last four values of your series to predict not the first four starting at index 4.

[~/]
[1]: a = range(1,11)

[~/]
[2]: a[4:]
[2]: [5, 6, 7, 8, 9, 10]

[~/]
[3]: a[-4:]
[3]: [7, 8, 9, 10]


Skipper Seabold

unread,
Aug 23, 2012, 11:33:04 AM8/23/12
to pystat...@googlegroups.com
I'm not sure what you're getting at here. You can do the same thing in Python and not have to think about your data too.

import statsmodels.api as sm
import pandas

a = pandas.read_csv("daily_2.csv", header=None)
a2 = a.fillna(dict(zip(a.columns, a.median(axis=0))))

var_mod = sm.tsa.VAR(a2.values)
var_res = var_mod.fit(4)

Unfortunately, I can't fit this model in Python or R with the data you sent me earlier because it's singular after the imputation. If you care to resend, I can see what their predict can do that ours cannot.

Reply all
Reply to author
Forward
0 new messages