57 views

Skip to first unread message

Mar 1, 2017, 7:19:26 AM3/1/17

to stan-...@googlegroups.com

Hi folks,

I came up with an idea to speed up sampling on Gaussian Processes when X is sampled on a grid, in which case the pairwise distance matrix has high redundancy (nX unique distances repeated over (nX^2-nX)/2 cells of the matrix for the 1D X case). By pre-computing the unique pairwise distances and their indices in the full pairwise distance matrix, you can compute unique values for the subsequent covariance matrix then fill the full covariance matrix using the aforementioned indices.

This ends up being about 2x faster than a standard GP (also faster than a GP where the pairwise distances are pre-computed, but uniqueness isn't considered) when the covariance matrix is computed by hand.

Now, I will note that computing on the unique distances then filling the covariance matrix by hand *isn't* much faster than using the in-built cov_exp_quad() function. Well, I have observed a 30% speedup over cov_exp_quad() in some situations, but that's not much to rave about. But where I have no idea what magic is happening in cov_exp_quad() to make it so fast, I wonder if there might be room to achieve further speedups (in, say, a new cov_exp_quad_grid() function that takes in the unique distances & indices thereof) by combining my ideas with whatever sourcery cov_exp_quad() is doing internally.

Thoughts?

Oh, and here's gist demonstrating the unique-distances approach with comparison to cov_exp_quad() and the non-unique-but-pre-computed-differences approaches:

Mike

--

Mike Lawrence

Graduate Student

Department of Psychology & Neuroscience

Dalhousie University

~ Certainty is (possibly) folly ~

--

Mike Lawrence

Graduate Student

Department of Psychology & Neuroscience

Dalhousie University

~ Certainty is (possibly) folly ~

Mar 1, 2017, 8:29:35 PM3/1/17

to stan-...@googlegroups.com, Aki Vehtari, Seth Flaxman, Ben Goodrich, Jonah Sol Gabry, Rob Trangucci, Michael Betancourt

Cool! We should perhaps create a GP's wiki where we put all these ideas. In meantime I'm cc-ing Team GP.

A

--

You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

To post to this group, send email to stan-...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Mar 1, 2017, 9:12:22 PM3/1/17

to stan-...@googlegroups.com

It's always a good idea to share repeated computations.

The built in C++ functions can also include analytic gradients,

which can make things go faster.

I hope someone who understands GPs better than me will have

a better idea what to do.

What we really need to do is build some of these ideas into

the underlying matrix math efficiently.

The problem we're facing now is how to control the sprawl of

specialized functions in Stan.

- Bob

The built in C++ functions can also include analytic gradients,

which can make things go faster.

I hope someone who understands GPs better than me will have

a better idea what to do.

What we really need to do is build some of these ideas into

the underlying matrix math efficiently.

The problem we're facing now is how to control the sprawl of

specialized functions in Stan.

- Bob

Mar 1, 2017, 9:21:36 PM3/1/17

to stan-...@googlegroups.com

Regarding speed-ups, one thing I've been thinking about lately is the way that lme4 speeds up computation by pre-computing X^tX.

In each step of the outer iteration, lme4 needs to compute X_*^tX_*, where X_* is an augmented version of X. But in typical problems, most of this computation comes from the part of this that is X^tX.

The thing that's bugging me, though, is that X^tX is part of the (analytic) solution of the conditional MLE but it's _not_ part of the likelihood computation. So it's not quite clear how Stan could make use of this pre-computation. And maybe it's not possible to do this at all. But I'd like to think it is.

A

In each step of the outer iteration, lme4 needs to compute X_*^tX_*, where X_* is an augmented version of X. But in typical problems, most of this computation comes from the part of this that is X^tX.

The thing that's bugging me, though, is that X^tX is part of the (analytic) solution of the conditional MLE but it's _not_ part of the likelihood computation. So it's not quite clear how Stan could make use of this pre-computation. And maybe it's not possible to do this at all. But I'd like to think it is.

A

Mar 2, 2017, 5:41:53 AM3/2/17

to Stan users mailing list

Hey Mike, are you talking about a regular grid? Sounds to me like what

you have in mind is Toeplitz / circulant structure (e.g.

https://web.stanford.edu/~shenoy/GroupPublications/CunninghamShenoySahaniICML2008.pdf)

and indeed there are significant speed-ups to be had by exploiting

this structure. But the problem, as Bob discusses, is that getting any

of this implemented in Stan has proved to be quite the project in

terms of the plethora of specialized functions it all requires. And in

this case, I don't think Eigen has circulant / Toeplitz built in (but

maybe I just missed it while searching), so that's another factor to

consider.

Seth

you have in mind is Toeplitz / circulant structure (e.g.

https://web.stanford.edu/~shenoy/GroupPublications/CunninghamShenoySahaniICML2008.pdf)

and indeed there are significant speed-ups to be had by exploiting

this structure. But the problem, as Bob discusses, is that getting any

of this implemented in Stan has proved to be quite the project in

terms of the plethora of specialized functions it all requires. And in

this case, I don't think Eigen has circulant / Toeplitz built in (but

maybe I just missed it while searching), so that's another factor to

consider.

Seth

Mar 2, 2017, 10:57:13 AM3/2/17

to stan-...@googlegroups.com

The unique-distances trick works best on a regular grid, but so long as there is redundancy amongst the pairwise distances, there will be reduced computation compared to the naive approach.

Unfortunately that paper is very much beyond my current state of expertise to decipher.

>> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

>> To post to this group, send email to stan-...@googlegroups.com.

>> For more options, visit https://groups.google.com/d/optout.

>

> --

> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Mar 2, 2017, 1:18:16 PM3/2/17

to Stan users mailing list

But where I have no idea what magic is happening in cov_exp_quad() to make it so fast,

In the current version most of the speedup comes from making the gradient computation faster by using analytic gradients for the end result of the computation instead of autodiffing through the small parts of the computation.

Similar speed-up can be obtained in also in other cases where such compound function can reduce the size of the autodiff expression tree considerably.

I came up with an idea to speed up sampling on Gaussian Processes when X is sampled on a grid, in which case the pairwise distance matrix has high redundancy (nX unique distances repeated over (nX^2-nX)/2 cells of the matrix for the 1D X case). By pre-computing the unique pairwise distances and their indices in the full pairwise distance matrix, you can compute unique values for the subsequent covariance matrix then fill the full covariance matrix using the aforementioned indices.

This is called Toeplitz matrix https://en.wikipedia.org/wiki/Toeplitz_matrix

In addition to speeding up the construction of Toeplitz matrix, you can do more.

For SPD Toeplitz matrix you need to store only n elements instead of usual n^2

You can solve matrix equation and make a matrix decomposition in O(n^2) instead of usual O(n^3)

And if you embed the Toeplitz in circular matrix, you can compute matrix-vector product with FFT in O(n log(n))

Thus even if you don't go to FFT, you could still get significant speedup from reduced autodiff tree not using O(n^3) Cholesky, even if not writing a compound function.

Mike, if you are interested in looking more to this then google "cholesky decomposition of toeplitz matrix" and you can find simple algorithms for computing the Cholesky in O(n^2)

Aki

Mar 2, 2017, 11:59:21 PM3/2/17

to stan-...@googlegroups.com

It's trivial to precompute X' * X.

data {

matrix[K, K] X;

transformed data {

matrix[K, K] Xtr_X = X' * X;

But you didn't say what X_ was or how X_' * X_ relates

to X' * X.

- Bob

data {

matrix[K, K] X;

transformed data {

matrix[K, K] Xtr_X = X' * X;

But you didn't say what X_ was or how X_' * X_ relates

to X' * X.

- Bob

Mar 4, 2017, 9:23:56 PM3/4/17

to stan-...@googlegroups.com

Hi, X_* and Sigma_* are defined in equation (14.24) of BDA3. We can discuss more at Stan meeting perhaps.

Mar 5, 2017, 1:04:01 PM3/5/17

to Stan users mailing list, gel...@stat.columbia.edu

On Thursday, March 2, 2017 at 4:21:36 AM UTC+2, Andrew Gelman wrote:

Regarding speed-ups, one thing I've been thinking about lately is the way that lme4 speeds up computation by pre-computing X^tX.

In each step of the outer iteration, lme4 needs to compute X_*^tX_*, where X_* is an augmented version of X. But in typical problems, most of this computation comes from the part of this that is X^tX.

The thing that's bugging me, though, is that X^tX is part of the (analytic) solution of the conditional MLE but it's _not_ part of the likelihood computation. So it's not quite clear how Stan could make use of this pre-computation. And maybe it's not possible to do this at all. But I'd like to think it is.

You can re-use the computation easily if the prior for linear model weights is Gaussian and observation model is Gaussian. In case of non-Gaussian observation model, it's still possible, but requires then latent variables approach which may remove the computational benefit when using MCMC to integrate over the latent value posterior.

Aki.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu