sparse vector autoregressive model

270 views
Skip to first unread message

Sean Gibbons

unread,
May 9, 2016, 8:06:06 PM5/9/16
to pystatsmodels
Hello,

I notice a recent method for generating sparse VAR models: http://amstat.tandfonline.com/doi/abs/10.1080/10618600.2015.1092978

I would love to see this incorporated into statsmodels! I'd offer to help, but I'm afraid my coding skills are meager.

Additionally, I wonder if there is an easy way to apply L1 and L2 regularization to VAR fitting? I notice there is a 'method' argument in the VAR.fit() funciton, but the only option is 'ols'. I tried defining other model fitting methods from scipy (e.g. ridge, or lasso), but it didn't appear to have an effect (no matter what I defined the method as the output summary showed 'OLS').

cheers,
Sean

josef...@gmail.com

unread,
May 9, 2016, 9:12:33 PM5/9/16
to pystatsmodels
brief answer: no, there is no alternative estimation method

longer answer: I never looked at what would need to be done to
implement this or whether another VAR type model, SVAR or VARMAX,
could be used and more easily extended for this.


Essentially, VAR is like OLS and just uses simple linear algebra to
get the estimates. It can do this because it only implements the
special case of multivariate endog that all depend on exactly the same
regressors.
In econometrics, this is seemingly unrelated regression, SUR, with
identical regressors. The theoretical result is that MLE or GLS
(generalized least squares) reduces to just using OLS.

Once, we deviate from the simple assumptions, e.g. by imposing
equation specific penalization or parameter restrictions, we can
either use GLS with more complicated linear algebra or MLE with
numerical optimization.

The new VARMAX is build as a statespace model and uses MLE, AFAIR.
However the statespace representation currently uses dense matrices
and the statespace could get pretty large in sparse VAR, I think.
(My guess is also that the statespace computation will be considerably
slower than an extension of straightforward VAR in spite of statespace
being in C/cython, just because it does a lot of computation in each
period for the updating. The statespace has a lot of other advantages
that are more visible in more complicated models than VAR.)


I don't think extending VAR would be really difficult because we have
most of the setup and helper functions, but still it would be quite a
bit of work to get it implemented and tested.
Except for some maintenance I haven't worked on the VAR code, so the
above is all I know right now. We have a Vector Error Correction VECM
project this year in GSOC, and I will be looking a bit more at VAR
starting soon.


a quick glance at the paper, 54 pages is too long to print and to read.
Appendix equations A.4 and A.5 have the two GLS equations that are
iterated over for the constraint estimation, and they refer to
Lutkepohl book which is the standard reference. I guess adding L2
normalization will be just a bit of adjustment in A.4 by adding an
informative "prior". I have no idea about L1, I didn't read anything
yet.

What's your use case and the dimension of your estimation problem,
number of observation and number of variables?


Josef


>
> cheers,
> Sean

josef...@gmail.com

unread,
May 9, 2016, 9:39:28 PM5/9/16
to pystatsmodels
browsing a bit of code
SVAR doesn't help, AFAICS, it just uses the same OLS-VAR estimate but
imposes the constraints on the error covariance matrix

VARMAX does standard MLE by nonlinear optimization and doesn't seem to
impose any restriction/assumptions on the model like VAR does. So, I
think it would be possible to subclass it and add penalization or/and
add the restriction on the parameters by transformation.

Josef

Sean Gibbons

unread,
May 9, 2016, 9:53:46 PM5/9/16
to pystatsmodels
Hey Josef,

Thanks for the quick response! I work on microbial community data sets, so the dimensionality can be quite large. I have a time series with ~400 time points, with ~500 species abundances per time point. I can trim that down a bit by just looking at the top X abundant species.

If there was an easy pythonic way to implement the method from the sparse var paper then I'd be very happy :)

Please keep me appraised!

-sean

josef...@gmail.com

unread,
May 9, 2016, 10:56:56 PM5/9/16
to pystatsmodels
On Mon, May 9, 2016 at 9:53 PM, Sean Gibbons <monsie...@gmail.com> wrote:
> Hey Josef,
>
> Thanks for the quick response! I work on microbial community data sets, so
> the dimensionality can be quite large. I have a time series with ~400 time
> points, with ~500 species abundances per time point. I can trim that down a
> bit by just looking at the top X abundant species.

400 data points sounds very small for the number of variables (species).

Are there assumptions on the covariance matrix of the error?
With X variables, an unconstrained covariance matrix has X * (X - 1)
parameters, and the mean parameters are X * nlags per equation.

(a bit later)
I looked a bit at the Google flue example in the article to see how
they handle large number of variables. This looks very sparse relative
to the full parameter space. I guess they estimate also a lot of
sparsity for the covariance matrix.

This kind of sparsity is not really an area where statsmodels has any
good foundation or examples yet, and I'm not sure whether anyone will
get to something like this soon. There are some interesting links to
spatial and network applications.

Josef

Sean Gibbons

unread,
May 9, 2016, 11:08:11 PM5/9/16
to pystatsmodels
I usually work with the ~50 most abundant species when I try to model the time series. We can assume that the covariance matrix is sparse for these data.

josef...@gmail.com

unread,
May 10, 2016, 1:07:08 AM5/10/16
to pystatsmodels
This looks like an interesting hot topic with several new R packages
(unfortunately GPL licensed) and several papers or articles from last
year or recent years.

sparsevar uses ENET and SCAD
BigVAR uses structured LASSO
Davis et al don't have their R code, that is referenced in the
article, online yet, AFAICS

Josef

Sean Gibbons

unread,
May 10, 2016, 8:48:48 AM5/10/16
to pystatsmodels
Hey Josef,

Attached is the R code from the Davis paper. They were kind enough to send it to me. Their manuscript was recently accepted, so I assume they will make this publicly available very soon.

Do you think it would be statistically appropriate to sparsify a VAR by zeroing-out coefficients that don't give a significant p-value (t-test), and then feeding this filtered model back to generate new residuals from the filtered model? Is this something that the current code would accommodate?

-sean
sparseVAR_code_public.zip

josef...@gmail.com

unread,
May 10, 2016, 10:10:01 AM5/10/16
to pystatsmodels
On Tue, May 10, 2016 at 8:48 AM, Sean Gibbons <monsie...@gmail.com> wrote:
> Hey Josef,
>
> Attached is the R code from the Davis paper. They were kind enough to send
> it to me. Their manuscript was recently accepted, so I assume they will make
> this publicly available very soon.

Thanks, I browsed it briefly to see what kind of functions they are
using. The code looks pretty long and seems to be doing everything
from scratch.
But I'm not allowed to read it if it's not license compatible.


>
> Do you think it would be statistically appropriate to sparsify a VAR by
> zeroing-out coefficients that don't give a significant p-value (t-test), and
> then feeding this filtered model back to generate new residuals from the

It's possible, and I also thought that pairwise Granger causality
testing could be used to pre-screen variables that are not related at
all.
However, that's the same issue as stepwise regression versus variable
selection by penalization. The second has usually better theoretical
(oracle) properties to catch the correct model, while the first might
select a model just based on the selection path which might be ok but
no guarantees for best model.

> filtered model? Is this something that the current code would accommodate?

No, that was my initial reply. The current models are not set up for
zero or other constraints directly.

400 by 50 might look small, but the unrestricted VAR with 4 lags has
(if my quick calculations are correct)

>>> 50**2 * 4
1,0000 parameters

>>> 50**2 * 4 * 400
4,000,000 size of unrestricted regressor

>>> 50**2 * 4 * 400 * 50
200,000,000 size of stacked regressor

which, most likely, rules out all approaches that just put zero
coefficients in the parameters of the current models. (Maybe it's
still feasible on a larger computer than mine.)

The error covariance matrix is only 50 by 50 so no problem. My guess
is that also impulse response function and similar shouldn't cause any
size problems, but the current code might need adjustment to not
assume all the data is available.

My best bet for estimation would be whitening the variables with the
error covariance and iterate over single equation OLS. I merged
Kerby's elastic net last week, so L1/L2 with coordinate descend is
available for OLS. (We don't have group coordinate descend yet.)
However, this estimates the model in the transformed variables, and I
haven't checked what they do to get sparseness in the original
variables.

Josef

josef...@gmail.com

unread,
May 10, 2016, 12:42:49 PM5/10/16
to pystatsmodels
I don't see a real problem with iterating using single equation OLS.

I started to read parts of
Basu, Sumanta, and George Michailidis. "Regularized estimation in
sparse high-dimensional time series models." The Annals of Statistics
43.4 (2015): 1535-1567.
(to catch up with some theory, and referenced by sparsevar package)

They (section 4.2) switch to moment matrices X'X X'y or the weighted
versions, which would reduce memory requirements a lot, but we don't
have estimators yet that are just based on sufficient statistics.
(I didn't check what algorithm for block coordinate descend they use
in the online appendix)


I need to stop reading and need to go back to my unscheduled detours
that I'm already working on.
I'm planning to go back to penalization and regularization, but not yet.

Maybe someone else can get involved. BigVAR in R was a GSOC project.

Josef

josef...@gmail.com

unread,
May 10, 2016, 3:36:06 PM5/10/16
to pystatsmodels
Correction: I was thinking of a different kind of transformation. IIRC. whitening transformation don't change the parameter space.

Josef

josef...@gmail.com

unread,
May 10, 2016, 5:54:52 PM5/10/16
to pystatsmodels


I opened an issue to keep track of this
https://github.com/statsmodels/statsmodels/issues/2933

Josef

josef...@gmail.com

unread,
May 11, 2016, 10:16:26 PM5/11/16
to pystatsmodels
(I couldn't resist)

I read their 2012 Arxiv version because it has nicer formatting for reading on an ipad.

I didn't realize before that they were negative on LASSO, nice :)   
(typical biased results for parameter estimation by LASSO also occurs in a time series context.)

a few comments:

Their distinction between Lasso-SS and Lasso-LL is more like OLS versus GLS. In Lasso-SS they don't even allow for different variances across equations, AFAICS. LL is not full MLE it's iterative GLS.

I don't know how easy it is to implement the partial spectral coherence. My attempts at multivariate fft 6,7 years ago were not successful, but that's a long time ago and scipy has changed in the meantime. nipy.nitime was specialized on frequency domain statistics and coherence (and Granger causality in frequency domain, IIRC).
(One computational worry I have with this is that it needs to be calculated for many frequencies, same as the number of time periods if I read it correctly.)

To calculate a large sparse VAR, I think going the route through single equation OLS loops will be the easiest, and it will be relatively easy to plug in different forms of penalization or 1st stage variable selection.
Both the Davis et al and the Basu Michailidis paper do something like this, either explicitly or implicitly. 

Lasso is not very good at parameter estimation and variable selection in settings like this, but I guess it could be a good devise for the initial screening, essentially as alternative to the PSC.

SCAD or adaptive Lasso penalization wouldn't have the same bias/distortion problems as plain Lasso, and I would expect those to be competitive with any "ad-hoc" variable selection.

In terms of implementation:

If we can hold the entire array of regressors for a single equation in memory, then we can use existing tools, that just use all regressors or loop over them, and we have the helper function to build the full regressor matrix with lags in one step.
In this case one tricky part will be to keep track of the indices for what is in and what is out.

If we cannot fit the full regressor aray in memory, then we need to create it on the fly, which makes housekeeping more difficult and existing models would need a callback instead of an array of explanatory variables.

As I mentioned before the VAR postestimation shouldn't have inherent memory problems, but the classes might not be designed to work without full data arrays, I haven't checked this.

Davis et al working paper doesn't mention selection in the error covariance matrix and seems to use the full covariance matrix. Basu and Michailidis have a section on correlation thresholding that I didn't read. There is nothing in statsmodels for that yet, but AFAIK scikit-learn would have several functions for it.


Aside: 
Before this I only knew of factor VAR model for the sparse case. They are not sparse in individual effects, but they dynamic is driven by a few factors (principal components). This is more appropriate if there are just a few shocks/driving factors in the environment that affect all individual pieces, like a recession or boom in a MacroEconomy, or a drought, or not enough snow in the environment.
The sparsity in the above case will be more appropriate for spatial and network effects.
(Wasn't there a request for panel or spatial VAR?)


Josef

josef...@gmail.com

unread,
May 12, 2016, 10:22:24 AM5/12/16
to pystatsmodels
Completely aside:
GEE would also estimate the same kind of VAR model for the full parameterization. It allows for restrictions, but, AFAIR, only for all equations regressors at the same time
(interpreting groups as time periods so we have contemporaneous or spatial correlation, but no intertemporal correlation)

---
Stata allows constraints in var, and the manual has an example for zero constraints based on p-values.


I'm putting this now on my GSOC todo list, but only the core large sparse VAR with restrictions on individual parameters or on single equations without cross-equation restrictions, and only for the pretest estimator, i.e. variable selection by t and F-test without penalization in the first round of coding.
(The excuse is that I need to get back into VAR for mentoring, and this looks a digestible amount of work.)

Josef

 

Josef

Reply all
Reply to author
Forward
0 new messages