Gaussian Processes: Joint Hyperparameter Fitting and Predictive Inference

199 views
Skip to first unread message

Daniel Emaasit

unread,
Jan 31, 2017, 2:51:57 PM1/31/17
to Stan users mailing list
Hello Folks,
I followed the example in the Stan manual on "Joint Hyperparameter Fitting and Predictive Inference" Section 17.4 to fit a GP on multivariate input data. Although, there's convergence (based on Rhat values), there were some errors produced that I don't understand. 
Here's the reproducible code:

 ### Step 1: Prepare the input data
```{r}
library(dplyr)
mtcars$id <- 1:nrow(mtcars)
train <- dplyr::sample_frac(mtcars, 0.7) # training dataset
test  <- dplyr::anti_join(mtcars, train, by = 'id') # testing dataset

y1 <- train$mpg
X1 <- as.matrix(train[, 3:7])
N1 <- nrow(X1)
D <- ncol(X1)

X2 <- as.matrix(test[, 3:7])
N2 <- nrow(X2)

my_data <- list(x1 = X1, N1 = N1, D = D, y1 = y1, x2 = X2, N2 = N2)
```

### Step 2: Build the model

```{r}
gp_joint <- "
data {
  int<lower = 1> D;      // dim of matrix
  int<lower=1> N1;
  vector[D] x1[N1];      // matrix
  vector[N1] y1;
  int<lower=1> N2;
  vector[D] x2[N2];     // matrix
}
transformed data {
  int<lower=1> N;
  vector[D] x[N1+N2];
  vector[N1+N2] mu;
  # cov_matrix[N1+N2] Sigma;
  N = N1 + N2;
  for (n in 1:N1)
    x[n] = x1[n];
  for (n in 1:N2)
    x[N1 + n] = x2[n];
  for (i in 1:N)
    mu[i] = 20;
}
parameters {
  vector[N2] y2;
  real<lower=0> eta_sq;
  real<lower=0> inv_rho_sq;
  real<lower=0> sigma_sq;
}
transformed parameters {
   real<lower=0> rho_sq;
   rho_sq = inv(inv_rho_sq);
}
model {
   vector[N] y;
   matrix[N, N] Sigma;


  // off-diagonal elements
  for (i in 1:(N-1)) {
    for (j in (i+1):N) {
      Sigma[i,j] = eta_sq * exp(-rho_sq * dot_self(x[i] - x[j]));
      Sigma[j,i] = Sigma[i,j];
    }
  }


  // diagonal elements
  for (k in 1:N)
    Sigma[k,k] = eta_sq + sigma_sq; // + jitter


  eta_sq ~ cauchy(0,5);     // prior on eta
  inv_rho_sq ~ cauchy(0,5);     // prior
  sigma_sq ~ cauchy(0,5);   // prior

  for (n in 1:N1)
    y[n] = y1[n];
  for (n in 1:N2)
    y[N1 + n] = y2[n];

  y ~ multi_normal(mu, Sigma);
}
"
```

### Step 3: Sample from the posterior
```{r}
my_fit_joint <- stan(model_code = gp_joint, data = my_data, iter = 200, chains = 3)
```

Here are the errors produced during sampling;
DIAGNOSTIC(S) FROM PARSER:
Warning (non-fatal):
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    y ~ multi_normal(...)

And

The following numerical problems occured the indicated number of times on chain 1
                                                                                                     count
Exception thrown at line 58: multi_normal_log: LDLT_Factor of covariance parameter is not positive d     1
When a numerical problem occurs, the Hamiltonian proposal gets rejected.
See http://mc-stan.org/misc/warnings.html#exception-hamiltonian-proposal-rejected
If the number in the 'count' column is small, do not ask about this message on stan-users.

Bob Carpenter

unread,
Jan 31, 2017, 3:18:25 PM1/31/17
to stan-...@googlegroups.com
Just to be clear, those are WARNINGS, not errors.

The first one is indicating that given y is
a derived quantity, you may need a Jacobian adjustment for
it. In this case, you don't, because you're just munging
data shape and there's no nonlinear transform. Stan can't
always figure that out. Rather than this:

> for (n in 1:N1)
> y[n] = y1[n];
> for (n in 1:N2)
> y[N1 + n] = y2[n];
>
> y ~ multi_normal(mu, Sigma);

you can just do this:

y[1:N1] = y1;
y[N1 + n:N1 + N2] = y2;

I'm afraid we don't have append for arrays, but we should,
so I'm adding a feature request:

https://github.com/stan-dev/math/issues/481

The second issue is because there were numerical problems with the
run. If you only get a handful of those, the problem can usually
be mitigated by increasing the adapt_delta control parameter.

You might also want to add tighter/narrower tailed priors,
which can avoid parameters getting into extreme values.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Daniel Emaasit

unread,
Feb 1, 2017, 7:27:16 PM2/1/17
to stan-...@googlegroups.com
Thanks Bob.
Now I have an issue when I use a dataset containing 1 variable and 200 points. The sampling takes really really LONG (t > 30 mins). I'd expected it to take seconds. I tried using the cholesky decomposition as the manual suggests but there's no significant speed improvements. My plan is to test this model first on small data, then use it for a much bigger dataset with N = 5000 and  15 or more variables. Is there another way of speeding up gps?

Andrew Gelman

unread,
Feb 1, 2017, 8:02:12 PM2/1/17
to stan-...@googlegroups.com
Hi, yes, this does sound slow.  I expect there are ways to rewrite your Stan program to make it much more efficient.
A
On Feb 1, 2017, at 7:27 PM, Daniel Emaasit <daniel....@gmail.com> wrote:

Thanks Bob.
Now I have an issue when I use a dataset containing 1 variable and 200 points, i.e X[200, 1]. The sampling takes really really LONG (t > 30 mins). I'd expected it to take seconds. I tried using the cholesky decomposition as the manual suggests but there's no significant speed improvements. My plan is to test this model first on small data, then use it for a much bigger dataset with N = 5000 and  15 or more variables. Is there another way of speeding up gps?

Daniel Lee

unread,
Feb 2, 2017, 9:16:39 AM2/2/17
to stan-...@googlegroups.com
Hi Daniel,

Did it converge? If not, looking at wall time isn't useful.

Have you tried a tighter prior? My guess is that there isn't enough data to pin down estimates for the hyperparameters and the Cauchy priors are really pushing the values to really large values. 


Daniel


On Feb 1, 2017, at 7:27 PM, Daniel Emaasit <daniel....@gmail.com> wrote:

Thanks Bob.
Now I have an issue when I use a dataset containing 1 variable and 200 points, i.e X[200, 1]. The sampling takes really really LONG (t > 30 mins). I'd expected it to take seconds. I tried using the cholesky decomposition as the manual suggests but there's no significant speed improvements. My plan is to test this model first on small data, then use it for a much bigger dataset with N = 5000 and  15 or more variables. Is there another way of speeding up gps?
Message has been deleted

andre.p...@googlemail.com

unread,
Feb 4, 2017, 12:08:26 AM2/4/17
to Stan users mailing list
Daniel,

  for (n in 1:N1)
    y[n] = y1[n];
  for (n in 1:N2)
    y[N1 + n] = y2[n];

  y ~ multi_normal(mu, Sigma);

y is a parameter, y1 is data. So you assign data to a parameter. I would do the y2 in the generated_quantities, and put
y1 ~ multi_normal(mu, Sigma) in the model block.

The time is ok. I got ~ 3 hours runtime in Stan with my GP, which slows down my development process.

One advanced idea you could do is to use cholesky decomposition take the inverse and map it to normal(0, 1), since the cholesky 
is a triangular the corresponding jacobi adjustment is only with the diagonals.
I tried that onetime and it gives some speed.

Andre

Daniel Emaasit

unread,
Feb 6, 2017, 1:31:59 PM2/6/17
to stan-...@googlegroups.com
Thanks Daniel,
Yes, it converged after about 2hrs. Is this a normal/standard amount of time for Gaussian Processes in Stan? This is the first time I am running GPs in Stan. 
Tighter priors reduce sampling time, but not significantly. Any other suggestions for improving sampling time?
(F.Y.I, I will be giving a series of meetup talks on Bayesian modeling with Stan at the Las Vegas R User groups).

-- Daniel

Daniel Emaasit

unread,
Feb 6, 2017, 1:36:53 PM2/6/17
to stan-...@googlegroups.com
Thanks Andre.
Is the normal/standard time for sampling from GPs in Stan this long (>=2 hours).  This is the first time I am running GPs in Stan.
I am now going to try your suggested improvement.

-- Daniel

Bob Carpenter

unread,
Feb 6, 2017, 3:48:35 PM2/6/17
to stan-...@googlegroups.com
If you're giving a talk on Stan, please post to the users list!

It all depends on the model fit and the number of data points
for solving.

- Bob

> On Feb 6, 2017, at 1:31 PM, Daniel Emaasit <daniel....@gmail.com> wrote:
>
> Thanks Daniel,
> Yes, it converged after about 2hrs. Is this a normal/standard amount of time for Gaussian Processes in Stan? This is the first time I am running GPs in Stan.
> Tighter priors reduce sampling time, but not significantly. Any other suggestions for improving sampling time?
> (F.Y.I, I will giving a series of meetup talks about Bayesian modeling with Stan at the Las Vegas R User groups).

Bob Carpenter

unread,
Feb 6, 2017, 3:49:26 PM2/6/17
to stan-...@googlegroups.com
If that's possible for GPs, we should rewrite our manual
examples (which I first wrote when I was just learning GPs,
so sorry if it's inefficient).

- Bob

Rob Trangucci

unread,
Feb 6, 2017, 5:44:35 PM2/6/17
to stan-...@googlegroups.com
Yes, I need to draft the GP section of the manual to mirror the coding outlined here: https://github.com/rtrangucci/gps_in_stan/blob/master/forecasts_in_stan.pdf

Daniel, 

If you have a normally distributed outcome variable and you want to put a Gaussian process prior on the regression function, the most efficient way to code it in Stan is like so:

data {
  int<lower=1> N;
  vector[N] y;
  real x[N];
  real<lower=0> A; // prior length-scale shape
  real<lower=0> B; // prior length-scale rate
  real<lower=0> C; // prior alpha scale   
  real<lower=0> D; // prior sigma scale
}
transformed data {
  vector[N] zeros;
  
  zeros = rep_vector(0, N);
parameters {
  real<lower=0> length_scale;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
model {
  matrix[N, N] L_cov;
  {
    matrix[N, N] cov;
    cov = cov_exp_quad(x, alpha, length_scale);
    for (n in 1:N)
      cov[n, n] = cov[n, n] + square(sigma);
    L_cov = cholesky_decompose(cov);
  }
  length_scale ~ gamma(A, B);
  alpha ~ normal(0, C);
  sigma ~ normal(0, D);
  y ~ multi_normal_cholesky(zeros, L_cov);
}

Efficiently sampling from the joint posterior over length-scale, Gaussian process marginal standard deviation (alpha), and noise standard deviation (sigma) can require informative priors for length-scale and alpha depending on how informative your dataset is about those two parameters. Definitely avoid flat priors on these hyperparameters.

Rob

On Mon, Feb 6, 2017 at 3:49 PM, Bob Carpenter <ca...@alias-i.com> wrote:
If that's possible for GPs, we should rewrite our manual
examples (which I first wrote when I was just learning GPs,
so sorry if it's inefficient).

- Bob

> On Feb 4, 2017, at 12:08 AM, andre.pfeuffer via Stan users mailing list <stan-...@googlegroups.com> wrote:
>
> Daniel,
>
>   for (n in 1:N1)
>     y[n] = y1[n];
>   for (n in 1:N2)
>     y[N1 + n] = y2[n];
>
>   y ~ multi_normal(mu, Sigma);
>
> y is a parameter, y1 is data. So you assign data to a parameter. I would do the y2 in the generated_quantities, and put
> y1 ~ multi_normal(mu, Sigma) in the model block.
>
> The time is ok. I got ~ 3 hours runtime in Stan with my GP, which slows down my development process.
>
> One advanced idea you could do is to use cholesky decomposition take the inverse and map it to normal(0, 1), since the cholesky
> is a triangular the corresponding jacobi adjustment is only with the diagonals.
> I tried that onetime and it gives some speed.
>
> Andre
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Daniel Emaasit

unread,
Feb 6, 2017, 5:52:41 PM2/6/17
to Stan users mailing list
AWESOME!! Thanks Rob.
Could you please add a generated quantities block for prediction on test(new) data? 

-- Daniel

Rob

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Rob Trangucci

unread,
Feb 6, 2017, 10:51:24 PM2/6/17
to stan-...@googlegroups.com
No problem.

functions {
  vector gp_pred_rng(real[] x_pred,
                     vector y,
                     real[] x,
                     real alpha,
                     real length_scale,
                     real sigma) {
    vector[size(x_pred)] f_pred;
    int N_pred;
    int N;
    N_pred = size(x_pred);
    N = rows(y);

    {
      matrix[N, N] L_Sigma;
      vector[N] K_div_y;
      matrix[N, N_pred] k_x_x_pred;
      matrix[N, N_pred] v_pred;
      vector[N_pred] f_pred_mu;
      matrix[N_pred, N_pred] cov_f_pred;
      matrix[N_pred, N_pred] nug_pred;
      matrix[N, N] Sigma;
      Sigma = cov_exp_quad(x, alpha, length_scale);
      for (n in 1:N)
        Sigma[n, n] = Sigma[n,n] + square(sigma);
      L_Sigma = cholesky_decompose(Sigma);
      K_div_y = mdivide_left_tri_low(L_Sigma, y);
      K_div_y = mdivide_right_tri_low(K_div_y',L_Sigma)';
      k_x_x_pred = cov_exp_quad(x, x_pred, alpha, length_scale);
      f_pred_mu = (k_x_x_pred' * K_div_y); 
      v_pred = mdivide_left_tri_low(L_Sigma, k_x_x_pred);
      cov_f_pred = cov_exp_quad(x_pred, alpha, length_scale) - v_pred' * v_pred;
      nug_pred = diag_matrix(rep_vector(1e-12,N_pred));

      f_pred = multi_normal_rng(f_pred_mu, cov_f_pred + nug_pred);
    }
    return f_pred;
  }
}x`
data {
  int<lower=1> N;
  int<lower=1> N_pred;
  vector[N] y;
  real x[N];
  real x_pred[N_pred];
  real<lower=0> A; // prior length-scale shape
  real<lower=0> B; // prior length-scale rate
  real<lower=0> C; // prior alpha scale   
  real<lower=0> D; // prior sigma scale
}
transformed data {
  vector[N] zeros;
  
  zeros = rep_vector(0, N);
parameters {
  real<lower=0> length_scale;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
model {
  matrix[N, N] L_cov;
  {
    matrix[N, N] cov;
    cov = cov_exp_quad(x, alpha, length_scale);
    for (n in 1:N)
      cov[n, n] = cov[n, n] + square(sigma);
    L_cov = cholesky_decompose(cov);
  }
  length_scale ~ gamma(A, B);
  alpha ~ normal(0, C);
  sigma ~ normal(0, D);
  y ~ multi_normal_cholesky(zeros, L_cov);
}
generated quantities {
  vector[N_pred] f_pred;
  vector[N_pred] y_pred;

  f_pred = gp_pred_rng(x_pred, y, x, alpha, length_scale, sigma);
  for (n in 1:N_pred)
    y_pred[n] = normal_rng(f_pred[n], sigma);
}


Rob

To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

andre.p...@googlemail.com

unread,
Feb 8, 2017, 2:04:38 AM2/8/17
to stan-...@googlegroups.com
    K_div_y = mdivide_left_tri_low(L_Sigma, y);
    K_div_y
= mdivide_right_tri_low(K_div_y',L_Sigma)';

Rob,

I have problems with the second line.

Thank you,
your code is a good template,
Andre

Daniel Emaasit

unread,
Feb 8, 2017, 8:16:45 PM2/8/17
to Stan users mailing list
Rob,
Instead of a single predictor (x and x_pred), I modified the code for multiple variables like so:


On Monday, February 6, 2017 at 7:51:24 PM UTC-8, Rob Trangucci wrote:
No problem.

functions {
  vector gp_pred_rng(real[] x_pred,
 

I changed x_pred to a vector like so:
vector[] x_pred,

 
                     vector y,
                     real[] x,

I also changed x to a vector like so:
vector[] x,

Now my data looks like so:
 
int<lower=1> N;
 
int<lower=1> D;

 
int<lower=1> N_pred;
  vector
[N] y;

  vector
[D] x[N];
  vector
[D] x_pred[N_pred];

Although, the chains converge and results look good, I get the following WARNING:

                                                                                   count
Exception thrown at line 69: cholesky_decompose: Matrix m is not positive definite     3
Exception thrown at line 66: cov_exp_quad: sigma is 0, but must be > 0!                1
When a numerical problem occurs, the Hamiltonian proposal gets rejected.
See http://mc-stan.org/misc/warnings.html#exception-hamiltonian-proposal-rejected

If the number in the 'count' column is small, do not ask about this message on stan-users.








 
Rob

Rob Trangucci

unread,
Feb 14, 2017, 7:46:30 PM2/14/17
to stan-...@googlegroups.com
Hi Andre,

Below is algorithm 2.1 of the GPML book for calculating the predictive mean and variance for new data points (book is here: http://www.gaussianprocess.org/gpml/chapters/RW.pdf). The algorithm is just efficiently calculating a conditional mean and variance for a multivariate normal density. The two lines you referenced are the steps required to produce the 3rd step in algorithm 2.1, the vector \alpha.






Rob

On Wed, Feb 8, 2017 at 2:04 AM, andre.pfeuffer via Stan users mailing list <stan-...@googlegroups.com> wrote:
    K_div_y = mdivide_left_tri_low(L_Sigma, y);
    K_div_y
= mdivide_right_tri_low(K_div_y',L_Sigma)';

Rob,

I have problems in understanding with the second line.
I don't understand.

Thank you,
your code is a good template,
Andre

Reply all
Reply to author
Forward
0 new messages