issues with inference on lengthscale parameter of gaussian process in stan

828 views
Skip to first unread message

Rachael Meager

unread,
Mar 21, 2016, 11:38:43 AM3/21/16
to Stan users mailing list
Hi Folks, 

First, thank you all so much for rstan and for staffing this forum. My research would be approximately 5000 times harder without you and I'm forever grateful .

Right now, I'm simulating Gaussian Processes in rstan and trying to fit models for inference about the covariance functions. I find that this sometimes produces relatively poor inference on the covariance kernel's lengthscale parameter ("rho_sq"), and am wondering if I'm making some obvious mistake that I just don't have the experience to spot. I've attached my R file and the session log file, including the execution of sessionInfo().

To generate data, I'm using the stan simulation model in the manual in section 15.2 (on page 160, in version 2.9.0). First I construct an "x" vector of 100 points on [0,1], then draw from a GP on the space, like this:

data {
int<lower=1> N;
real x[N];
}
transformed data {
vector[N] mu;
cov_matrix[N] Sigma;
for (i in 1:N)
mu[i] <- 0;
for (i in 1:N)
for (j in 1:N)
Sigma[i,j] <- exp(-0.1*pow(x[i] - x[j],2))
+ if_else(i==j, 3, 0.0);
}
parameters {
vector[N] y;
}
model {
y ~ multi_normal(mu,Sigma);
}


So you can see I've set eta to 1, rho_sq to 0.1 and sigma to 3. I then fit the following simplified version of the stan model in the manual in section 15.3 to one of the (vector) draws from the GP above. This is model 1 in the R file:
data {
int<lower=1> N;
vector[N] x;
vector[N] y;
}
transformed data {
vector[N] mu;
for (i in 1:N) mu[i] <- 0;
}
parameters {

real<lower=0> inv_rho_sq;
real<lower=0> sigma_sq;
}
transformed parameters {
real<lower=0> rho_sq;
rho_sq <- inv(inv_rho_sq);
}
model {
matrix[N,N] Sigma;
// off-diagonal elements
for (i in 1:(N-1)) {
for (j in (i+1):N) {
Sigma[i,j] <-  exp(-rho_sq * pow(x[i] - x[j],2));
Sigma[j,i] <- Sigma[i,j];
}
}
// diagonal elements
for (k in 1:N)
Sigma[k,k] <- 1 + sigma_sq; // + jitter

inv_rho_sq ~ cauchy(0,5);
sigma_sq ~ cauchy(0,5);
y ~ multi_normal(mu,Sigma);
}

This model works pretty well, however, the inference on rho_sq is patchy relative to the inference on sigma_sq. The mean of the rho_sq posterior jumps around because of the heavy right tail and this is quite sensitive to the dispersion in the priors (compare output of models 1, 2, in the attached file). I tried to get rid of the heavy tail using uniform priors but that does worse inference overall (see output of model 3). The inference breaks totally if I do either of these 2 things: (1) try to have improper uniform priors to do MLE (model 4), or (2) try to put a prior on rho_sq instead of inv_rho_sq (model 5).

Obviously you can see I've dispensed with inference on the eta parameter for now, setting it to its correct value of 1. I did this because at first I thought the instances of poor inference were due to a weak identification issue with eta in there, but this model is not performing any better without it. Fortunately, the inference on sigma_sq is always really good, which is the main goal of my project -- but I do want to know what's going on with rho_sq. 

Does anyone know if there's a way to avoid the fat right tail on the rho_sq without the pathologies of the uniform? Should I be using normals here instead - and would that be generally helpful or just for this specific rho_sq? And do we know why the posterior gives up when I put a prior on rho_sq directly, or use an improper uniform prior? 

As I said, I'm attaching an R file that does this all together. I usually save the stan models separately as per the manual's suggestion, but I figured it would be more convenient for you guys to have it in 1 place. I'm also attaching the log file from running this R script on the unix server I have access to at MIT. 

I hope this is clear, let me know if I should provide further information or clarification. Thanks in advance for any input or suggestions you have! 

Cheers
Rachael 
GP_in_rstan_log_file.txt
GP_in_rstan_manual_simulations_no_eta_for_stan_users_group.R

Aki Vehtari

unread,
Mar 21, 2016, 12:27:07 PM3/21/16
to Stan users mailing list
Lengthscale in squared exponential (exponentiated quadratic) is often weakly identified. Here you have larger noise variance than signal variance, and thus it is likely that lengthscale can be large and only the prior limits how large. It is likely that half-Cauchy prior has too thick tail. Try with half-Student's t (with degrees of freedom 4<ny<7) or half-normal, Also I recommend setting the prior on lengthscale and not squared lengthscale.

Aki

Andrew Gelman

unread,
Mar 21, 2016, 12:33:58 PM3/21/16
to stan-...@googlegroups.com
Hi, just a quick thought which Seth F and I have been bouncing around recently . . . you could try just setting the length scale parameter to a fixed value.  There are a few reasons for this.  Most obviously it should make computing go faster.  At the same time, in any particular application such as yours, I think there is a lot of prior information so you should be able to choose a reasonable value for this parameter.

To put it another way:  from a Bayesian perspective, the right thing to do is to use an informative prior.  But in this case the prior info might be much stronger than the data anyway, so why not just treat this parameter as data?  Alternatively, a strong prior (for example, lognormal centered at the prior estimate with a small sd) should also work well.  In that case, just make sure to run Stan with reasonable inits.

A

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<GP_in_rstan_log_file.txt><GP_in_rstan_manual_simulations_no_eta_for_stan_users_group.R>

Mike Lawrence

unread,
Mar 21, 2016, 12:58:05 PM3/21/16
to stan-...@googlegroups.com
Not sure this will make much difference, but I like to:

(1) first rescale the x-axis data to range from 0-1 

(2) reparameterize in terms of inv_rho_sq (ie. `Sigma[i,j] <-  exp(-pow(x[i] - x[j],2)/inv_rho_sq)`) so that values toward zero reflect more linear data

(3) use a cauchy(0,1) prior (inc. using the tan trick, pg 214 manual v2.9.0) so that the peak prior reflects a linear prior, and half of the mass of the prior reflects length scales that are less than the range of the data


--
Mike Lawrence
Graduate Student
Department of Psychology & Neuroscience
Dalhousie University

~ Certainty is (possibly) folly ~

Rachael Meager

unread,
Mar 21, 2016, 3:47:56 PM3/21/16
to Stan users mailing list
Thanks Aki! I'll give it a shot with that! 

Rachael Meager

unread,
Mar 21, 2016, 3:51:55 PM3/21/16
to Stan users mailing list, gel...@stat.columbia.edu
Hi Andy, thanks for the reply, I've been considering that too! Frequentists set these kinds of parameters to fixed values in order to impose smoothing all the time, so it's not like we'd be out of order to do the same -- or to put a very strong prior that encourages smoothing, which is a gentler way of doing the frequentist thing. Given that I only really care about sigma, and the rho doesn't seem to affect the sigma estimate much, I may well go down this path. I eventually want to build a hierarchical GP and so will need every advantage I can get to wring out the signal in a reasonable computation time, so thanks for your encouragement, I will be bumping that strategy up to the top of my list.

Rachael Meager

unread,
Mar 21, 2016, 3:52:45 PM3/21/16
to Stan users mailing list
Thanks Mike, the x-axis data is already from 0-1 but I will try out the rest of it and see if it helps. 

Seth Flaxman

unread,
Mar 21, 2016, 3:59:01 PM3/21/16
to stan-...@googlegroups.com
Hi Rachael. Just to clarify what you have in mind--your fake data simulation creates 2000 draws from a GP, but you're interested in the standard case that you're modeling, which is just inference based on a single draw, right? (If you could use all 2000 draws that would certainly help with identification, but I'm guessing that's not the case.)

As Aki says, the noise variance (sd = 3) is overwhelming the signal variance (sd = 1).  But another issue is that with the covariance parameterized as exp(-.1 d^2), the lengthscale is 2.23 in the parameterization I like, which is exp(-.5 d^2 / l^2). But the data you're observing is on a regular grid in the interval [0,1], and thus the lengthscale is way too long to recover.

I switched it to noise variance of 0.1 and lengthscale 0.2 and prior on lengthscale and now it does a fine job recovering the parameters.

Broadly my thinking, which Andrew mentioned, is that informative priors on the hyperparameters make a lot of sense, and fixing those hyperparameters is just a very informative prior which can be reasonable in many cases. (As a practical matter, it can also help with finding bugs!) Another thing that I think is very helpful as you're testing out / debugging models is to estimate the hyperparameters from the data ("empirical Bayes") which as you note has a long history, and not just in frequentist settings. In Stan, I've had success finding learning parameters using optimizing(...), an example of which I also included in the attached. 

In terms of scalability in a hierarchical setting, you might also get some ideas from our paper, which has Stan code at the end: http://sethrf.com/files/fast-hierarchical-GPs.pdf
Seth
gp.stan
gp.r

Michael Betancourt

unread,
Mar 21, 2016, 5:21:59 PM3/21/16
to stan-...@googlegroups.com
I’m going to have to complicate things by disagreeing with Andy and Seth a nontrivial amount.

As most people have noted, GPs are horrendously non-identified.  This is in fact a fundamental
property of these objects!  It turns out that the functions supported by any two distinct set of
hyperparameters (even those varying by an epsilon amount) are completely different; in other
words you’ll never see the same function sampled from a GP with two different sets of
hyperparameters.  

Hence by taking only a single set of hyperparameters you are ignoring a giant space of functions 
that could interpolate between your data just as well as those supported by your distinguished
hyperparameters but in different ways.  This in term manifests as _drastic_ underestimates
of your predictive uncertainty away from the data.  

If you care only about predictive performance near your data then you might be okay, but
if you really want to know what’s going on between your data then you really do have to
marginalize over the hyperparameters.  Empirical Bayes has the same problem of misrepresenting
the posterior uncertainty.  WON’T SOMEBODY THINK OF THE UNCERTAINTY? <insert
Helen Lovejoy>.

Also, as Seth, Mike, and Aki note, spending some time to think about the relevant length scales
can help you build some really informative priors to clean things up.  As in Seth’s example, if
you can put together some good priors then you might be able to do a full marginalization in
practice.


<gp.stan><gp.r>

Andrew Gelman

unread,
Mar 22, 2016, 11:03:39 AM3/22/16
to stan-...@googlegroups.com
Mike:

I don’t see how you’re really disagreeing with me!  You write, “spending some time to think about the relevant length scales can help you build some really informative priors to clean things up.”  An informative prior is great.  A delta function prior (that is, treating the lengthscale parameter as known) can be a useful approximation and speed up computation, then the researcher can replace this by a strongly informative prior later.  It’s the usual “pick your battles” strategy.  GP with a reasonable fixed lengthscale will do the smoothing that is needed so the researcher can focus on the more important aspects of the model.

When we’re looking for a generic lowess replacement, it’s another story, we’d like the model to “learn” the lenghscale, requiring only a minimal amount of prior information.  But for specific applications, more and more I’ve been leaning toward running hier models with fixed hyperparameters, or with strong priors on the hyperparams.

A

Michael Betancourt

unread,
Mar 22, 2016, 11:25:47 AM3/22/16
to stan-...@googlegroups.com
We disagree that a delta prior is useful and let's the user worry about other parts of the model!  I don't think the lengthscale is ever sufficiently well known for such approximations to not induce significant predictive uncertainty underestimates that affect inference.

James Savage

unread,
Mar 22, 2016, 11:53:16 AM3/22/16
to Stan users mailing list
This is an interesting discussion. 

For what it's worth, I tried reparameterising the model slightly, and using slightly tighter priors. As Aki mentioned above, once you increase the signal, the parameter seems to be fairly identifiable. 

The issue is that with the posterior for the lengthscale parameter being so skewed, the modal estimate is quite wrong--optimisers beware. The relevant image is attached. 

If you've got rstan, reshape, dplyr and ggplot2 installed, this should run out of the bottle: 


JS
whywedobayes.pdf

Andrew Gelman

unread,
Mar 22, 2016, 1:54:35 PM3/22/16
to stan-...@googlegroups.com
Thanks!  Is it possible to set up these demos so that the Stan models are in separate .stan files rather than embedded in the R script?
A
<whywedobayes.pdf>

Rachael Meager

unread,
Mar 22, 2016, 2:53:25 PM3/22/16
to stan-...@googlegroups.com
Hi all! I have a few responses: First, I really understand this situation a lot better now, thank you all. The lengthscale I chose to simulate with just doesn't seem to be strongly identifiable on the domain [0,1] for the reasons Seth outlined. I've tried various priors, including tight normals, and they don't fix the problem.

What's interesting is that my main goal is to do inference on sigma_sq and all these strategies (fixing rho to a given value, tight prior on rho, normal prior on rho) change the inference on rho but don't really change the sigma_sq inference. That's good news but I guess it's confusing because I think it must mean the model can successfully separate out the information about sigma from the information (or lack of it) about rho. Is that right? 

Question for the thread: Are there analogously values of sigma_sq we wouldn't be able to identify in a GP setting? 

Seth -- thanks for replying and for your explanation of what is happening! It really helped. Yes, I can only use 1 draw from the GP for this problem, you're right that I'm simulating more than I need. I'm curious about your example because you said that the inference in the example you posted was good -- but when I run it, it's still only good on sigma. The true lengthscale is at the 2.5% percentile of the posterior for rho which is not a good situation, in my book. Is that what you get when you run it? I'm attaching the output I got. (PS: your paper on hierarchical GP was/is on my to-read list!)

Michael - thanks for replying! It's true that fixing rho doesn't help us with rho, and maybe it will hurt posterior predictive inference, but if I just want inference on sigma_sq it seems like it should be ok -- no? Or are you saying that our posterior distribution on sigma_sq is inappropriately tight due to the (incorrect) assumption that we know the true rho? That sounds intuitive, but in this very specific case I'm not sure that's true, given that changing rho doesn't seem to change sigma_sq inference very much even in the tails. But perhaps I'm just not looking at the right output carefully enough, and have overlooked something important?

Jim - thanks for replying, the reparameterisation is really neat and obviously I am a fan of anything that produces a graph called "Why we do full bayes".  (Also you helped me realise I made an error before when I conflated bayes with improper priors and MLE -- I was heuristically extrapolating from the models with unimodal symmetric posterior distributions in basic stats, but of course it's not generally true outside that class.)  I'm playing around with your example to see if I can break the inference on rho again despite the reparameterisation, and I'll report back if I find anything interesting. 


seth_gp_output.txt

Andrew Gelman

unread,
Mar 22, 2016, 3:06:43 PM3/22/16
to stan-...@googlegroups.com
Hi just a couple of quick comments.
1.  I doubt it makes sense to think in terms of sigma_sq.  I think it will be better to think in terms of sigma.
2.  Yes, it makes sense that rho and sigma are not strongly dependent in the posterior.
A

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<seth_gp_output.txt>

Bob Carpenter

unread,
Mar 22, 2016, 3:08:57 PM3/22/16
to stan-...@googlegroups.com

> On Mar 21, 2016, at 12:34 PM, Andrew Gelman <Gel...@stat.columbia.edu> wrote:
>
> Hi, just a quick thought which Seth F and I have been bouncing around recently . . . you could try just setting the length scale parameter to a fixed value. There are a few reasons for this. Most obviously it should make computing go faster.

Just wanted to point out this isn't always true. It will make the
calculation per iteration faster, but unless you choose the right scale,
it may be slower than fitting the right scale. At least that's what we've found
adding hierarchical priors over fixed priors with a too-broad variance. For
instance, in IRT models, taking

theta ~ normal(0, 2.5);

would be slower than

theta ~ normal(0, sigma);
sigma ~ cauchy(0, 2.5);

when the simulated value for sigma was around 1.

- Bob


Andrew Gelman

unread,
Mar 22, 2016, 3:21:28 PM3/22/16
to stan-...@googlegroups.com
It should be no problem in any particular application to pick a reasonable value for the lenghtscale parameter, I think.

Michael Betancourt

unread,
Mar 22, 2016, 5:33:49 PM3/22/16
to stan-...@googlegroups.com

Michael - thanks for replying! It's true that fixing rho doesn't help us with rho, and maybe it will hurt posterior predictive inference, but if I just want inference on sigma_sq it seems like it should be ok -- no? Or are you saying that our posterior distribution on sigma_sq is inappropriately tight due to the (incorrect) assumption that we know the true rho? That sounds intuitive, but in this very specific case I'm not sure that's true, given that changing rho doesn't seem to change sigma_sq inference very much even in the tails. But perhaps I'm just not looking at the right output carefully enough, and have overlooked something important?

I wouldn’t be surprised if sigma_sq ends up being reasonably identifiable
with moderately informative priors.  There’s always the possibility that
sigma_sq could get very large (in which case the rest of the GP would
be unconstrained) or get really small (in which case you overfit your
data) but if you’re not seeing either of those cases then you could
have enough data to constrain the natural variation.

If you’re using the kernel

k(x, y) = alpha * exp( - sum_{n} rho_n (x_n - y_n)^2 ) + sigma2 * delta_{xy}

then the biggest problem is usually a nonidentifiabilty between the rho
and the alpha.  So it would make sense that sigma2 is the first parameter
to become identifiable as you add more data.  That said, if you _fix_ rho
to specific values then you might get underestimated variance on sigma2
as you lose the correlation between the varying rho and the residual noise
needed to ensure a reasonable fit.

Ultimately in these systems I don’t trust my intuition, so I calculate and see.
If you can run a full MCMC and sigma2 looks good despite large variance
among the rho and the alpha then it might be okay to fix rho for this application.
(I am hedging enough yet?  :-p)

Rachael Meager

unread,
Mar 23, 2016, 10:10:35 AM3/23/16
to Stan users mailing list
That is some impressive hedging Michael! But it's good to be cautious and precise, and I really appreciate your careful explanation of the situation. Thanks again.

I'm going to set up a full MCMC to investigate the sensitivity of the posterior inference on sigma_sq to various strategies on rho, including fixing rho, and hopefully display the information in a useful way. Also, I am going to re-introduce alpha (called eta in the stan manual models) back into the problem. I'll update the thread when I have something useful to share. 

Andy mentioned moving to inference on sigma rather than sigma_sq, I'll check that out too. Andy also mentioned it should be no problem to pick a reasonable lengthscale. Speaking only for myself, I don't find it intuitive to pick a reasonable value, but hopefully getting more experience working with GPs will fix that. If anyone has a favoured reference for how we should think about picking these kinds of parameters, Bayesian or frequentist, I'd be keen to take a look at it. 
Reply all
Reply to author
Forward
0 new messages