Proposal to speed up grid-based GPs

72 views
Skip to first unread message

Mike Lawrence

unread,
Mar 1, 2017, 7:19:26 AM3/1/17
to stan-...@googlegroups.com
Hi folks,

I came up with an idea to speed up sampling on Gaussian Processes when X is sampled on a grid, in which case the pairwise distance matrix has high redundancy (nX unique distances repeated over (nX^2-nX)/2 cells of the matrix for the 1D X case). By pre-computing the unique pairwise distances and their indices in the full pairwise distance matrix, you can compute unique values for the subsequent covariance matrix then fill the full covariance matrix using the aforementioned indices. 

This ends up being about 2x faster than a standard GP (also faster than a GP where the pairwise distances are pre-computed, but uniqueness isn't considered) when the covariance matrix is computed by hand.

Now, I will note that computing on the unique distances then filling the covariance matrix by hand *isn't* much faster than using the in-built cov_exp_quad() function. Well, I have observed a 30% speedup over cov_exp_quad() in some situations, but that's not much to rave about. But where I have no idea what magic is happening in cov_exp_quad() to make it so fast, I wonder if there might be room to achieve further speedups (in, say, a new cov_exp_quad_grid() function that takes in the unique distances & indices thereof) by combining my ideas with whatever sourcery cov_exp_quad() is doing internally. 

Thoughts?

Oh, and here's gist demonstrating the unique-distances approach with comparison to cov_exp_quad() and the non-unique-but-pre-computed-differences approaches:


Mike

--
Mike Lawrence
Graduate Student
Department of Psychology & Neuroscience
Dalhousie University

~ Certainty is (possibly) folly ~

Andrew Gelman

unread,
Mar 1, 2017, 8:29:35 PM3/1/17
to stan-...@googlegroups.com, Aki Vehtari, Seth Flaxman, Ben Goodrich, Jonah Sol Gabry, Rob Trangucci, Michael Betancourt
Cool!  We should perhaps create a GP's wiki where we put all these ideas.  In meantime I'm cc-ing Team GP.
A

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,
Mar 1, 2017, 9:12:22 PM3/1/17
to stan-...@googlegroups.com
It's always a good idea to share repeated computations.

The built in C++ functions can also include analytic gradients,
which can make things go faster.

I hope someone who understands GPs better than me will have
a better idea what to do.

What we really need to do is build some of these ideas into
the underlying matrix math efficiently.

The problem we're facing now is how to control the sprawl of
specialized functions in Stan.

- Bob

> On Mar 1, 2017, at 7:18 AM, Mike Lawrence <mike....@gmail.com> wrote:
>

Andrew Gelman

unread,
Mar 1, 2017, 9:21:36 PM3/1/17
to stan-...@googlegroups.com
Regarding speed-ups, one thing I've been thinking about lately is the way that lme4 speeds up computation by pre-computing X^tX.
In each step of the outer iteration, lme4 needs to compute X_*^tX_*, where X_* is an augmented version of X. But in typical problems, most of this computation comes from the part of this that is X^tX.

The thing that's bugging me, though, is that X^tX is part of the (analytic) solution of the conditional MLE but it's _not_ part of the likelihood computation. So it's not quite clear how Stan could make use of this pre-computation. And maybe it's not possible to do this at all. But I'd like to think it is.

A

Seth Flaxman

unread,
Mar 2, 2017, 5:41:53 AM3/2/17
to Stan users mailing list
Hey Mike, are you talking about a regular grid? Sounds to me like what
you have in mind is Toeplitz / circulant structure (e.g.
https://web.stanford.edu/~shenoy/GroupPublications/CunninghamShenoySahaniICML2008.pdf)
and indeed there are significant speed-ups to be had by exploiting
this structure. But the problem, as Bob discusses, is that getting any
of this implemented in Stan has proved to be quite the project in
terms of the plethora of specialized functions it all requires. And in
this case, I don't think Eigen has circulant / Toeplitz built in (but
maybe I just missed it while searching), so that's another factor to
consider.

Seth

Mike Lawrence

unread,
Mar 2, 2017, 10:57:13 AM3/2/17
to stan-...@googlegroups.com
The unique-distances trick works best on a regular grid, but so long as there is redundancy amongst the pairwise distances, there will be reduced computation compared to the naive approach. 

Unfortunately that paper is very much beyond my current state of expertise to decipher.


>> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

>> To post to this group, send email to stan-...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Aki Vehtari

unread,
Mar 2, 2017, 1:18:16 PM3/2/17
to Stan users mailing list
 But where I have no idea what magic is happening in cov_exp_quad() to make it so fast, 

In the current version most of the speedup comes from making the gradient computation faster by using analytic gradients for the end result of the computation instead of autodiffing through the small parts of the computation. 
Similar speed-up can be obtained in also in other cases where such compound function can reduce the size of the autodiff expression tree considerably.

I came up with an idea to speed up sampling on Gaussian Processes when X is sampled on a grid, in which case the pairwise distance matrix has high redundancy (nX unique distances repeated over (nX^2-nX)/2 cells of the matrix for the 1D X case). By pre-computing the unique pairwise distances and their indices in the full pairwise distance matrix, you can compute unique values for the subsequent covariance matrix then fill the full covariance matrix using the aforementioned indices. 

This is called Toeplitz matrix https://en.wikipedia.org/wiki/Toeplitz_matrix
In addition to speeding up the construction of Toeplitz matrix, you can do more.
For SPD Toeplitz matrix you need to store only n elements instead of usual n^2
You can solve matrix equation and make a matrix decomposition in O(n^2) instead of usual O(n^3)
And if you embed the Toeplitz in circular matrix, you can compute matrix-vector product with FFT in O(n log(n))

Thus even if you don't go to FFT, you could still get significant speedup from reduced autodiff tree not using O(n^3) Cholesky, even if not writing a compound function.

Mike, if you are interested in looking more to this then google "cholesky decomposition of toeplitz matrix" and you can find simple algorithms for computing the Cholesky in O(n^2)

Aki

Bob Carpenter

unread,
Mar 2, 2017, 11:59:21 PM3/2/17
to stan-...@googlegroups.com
It's trivial to precompute X' * X.

data {
matrix[K, K] X;

transformed data {
matrix[K, K] Xtr_X = X' * X;

But you didn't say what X_ was or how X_' * X_ relates
to X' * X.

- Bob

Andrew Gelman

unread,
Mar 4, 2017, 9:23:56 PM3/4/17
to stan-...@googlegroups.com
Hi, X_* and Sigma_* are defined in equation (14.24) of BDA3. We can discuss more at Stan meeting perhaps.

Aki Vehtari

unread,
Mar 5, 2017, 1:04:01 PM3/5/17
to Stan users mailing list, gel...@stat.columbia.edu
On Thursday, March 2, 2017 at 4:21:36 AM UTC+2, Andrew Gelman wrote:
Regarding speed-ups, one thing I've been thinking about lately is the way that lme4 speeds up computation by pre-computing X^tX.
In each step of the outer iteration, lme4 needs to compute X_*^tX_*, where X_* is an augmented version of X.  But in typical problems, most of this computation comes from the part of this that is X^tX.

The thing that's bugging me, though, is that X^tX is part of the (analytic) solution of the conditional MLE but it's _not_ part of the likelihood computation.  So it's not quite clear how Stan could make use of this pre-computation.  And maybe it's not possible to do this at all.  But I'd like to think it is.

You can re-use the computation easily if the prior for linear model weights is Gaussian and observation model is Gaussian. In case of non-Gaussian observation model, it's still possible, but requires then latent variables approach which may remove the computational benefit when using MCMC to integrate over the latent value posterior.

Aki. 
Reply all
Reply to author
Forward
0 new messages