Efficiently draw Monte Carlo Samples from Posterior Predictive Distribution

279 views
Skip to first unread message

terrance savitsky

unread,
Mar 11, 2016, 11:06:05 AM3/11/16
to stan-...@googlegroups.com
Hi Stanners,  I'm using vb() in rstan to estimate a GP model similar to that described in the manual for NT = 2500 observations with P = 2 predictors under varying length scales in a squared exponential covariance formula.   The stan script, below, builds an MCMC model for the parameters of the covariance formula.  My interest is to extract the latent (de-noised) function, which i sample from the posterior predictive distribution in the generated quantities block.  The advi estimate is returned in about 2 hours, but it has not yet completed generating 1000 samples after a day.   The long time period is not due to returning samples of the parameters (since I have separately run that and validated such) but due to the monte carlo draws from the posterior predictive for the latent function.  Since the generated quantities block draws monte carlo samples, only, i'm assuming there are no gradients to compute.   I'd appreciate any tips on how I might re-configure the script in the generated quantities block to speed up the generated monte carlo samples.   

/*
  y[c] ~ N_{T}(u[c], sigma_y[c]),  where c = 1,..,NT are domain-time case. sigma_y[c] are known
  // squared exponential covariance function for NT x NT covariance matrix, Sigma
  Sigma[i,j] = 1/eta * sum_{c=1}^{NT}sum_{c'=1}^{NT}
                            sum_{p=1}^{P} exp(-1/rho[p](x[c,p] - x[c',p])^{2})
  u[1] ,..., u[NT] ~ * N_{NT}(0, Sigma),
  eta     ~ Gamma(1,1) - precision parameter of squared exponential covariance function 
  rho[k]  ~ Gamma(1,1) - predictor indexed length-scale parameters
  
  We may marginalize over the u to achieve
  y[c] ~ N_{T}(0, Sigma + diag(sigma2_y))
  
  Marginalizing over u adds the known response variances, sigma2_y, to the diagonals
  which stabilizes the cholesky computation for Sigma (which is very high-dimensional).
  We use the properties of the Gaussian wrapper in which the nonparametric GP is parameterized
  to draw u as a post-processing step using sampled values of c(rho,eta)

*/

data{
  int<lower=1> NT; // number of domain-months
  int<lower=1> P; // number of predictors
  vector[NT] y; // vectorized response variable, with T faster than N
  matrix[NT,P] x; // Set of NT, 1 x P predictors
  vector<lower=0>[NT] sigma2_y; // known domain variances
  real<lower=0> jitter; // stabilizing constant to add to diagonals of GP cov, Sigma */
} /* end data block */
  

transformed data{
  vector[NT] zros_NT;
  zros_NT <- rep_vector(0,NT); /* vector of zeros for sampling Tx1 fixed effects, beta */
} /* end transformed data block */

parameters{
  // parameters for squared exponential covariance function
  real<lower=0> eta; /* rational quadratic precision parameter */
  real<lower=0> rho[P]; /* length-scale parameters, indexed by predictor, x[,p] */
} /* end parameters block */

transformed parameters{
  matrix[NT,NT] Sigma; /* GP covariance matrix for NT x 1, u */
  matrix[NT,NT] K; /* GP covariance matrix for u */
  matrix[NT,NT] L; /* cholesky_decompose(Sigma) */
  
  { /* local block */
    real covformula_ijp;
    real s_ij;
     // off-diagonal
    for( i in 1:(NT-1) )
    {
      for( j in (i+1):NT )
      {
        s_ij <- 0;
        for( p in 1:P )
        {
          covformula_ijp      <- inv(rho[p])* pow(x[i,p] - x[j,p],2);
          s_ij                <- s_ij + covformula_ijp ;
        }/* end loop p over dimensions of x to fill cells of Sigma */
        s_ij       <-  exp( -s_ij ) ;
        s_ij       <-  inv(eta) * s_ij;
        K[i,j]     <- s_ij;
        K[j,i]     <- s_ij;
      } /* end loop j over columns to fill upper triangular cells of Sigma */
    } /* end loop i over rows to fill upper triangular cells of Sigma */
    
    // diagonal
    for( k in 1:NT )
    {
      for( p in 1:P )
      {
        K[k,k]     <- inv(eta) + jitter;
      } /* end loop p over dimensions of x to fill cells of Sigma */
    } /* end loop k over diagonal entries of Sigma */
  } /* end local block for building Sigma */
  
  /* add model noise on the diagonal to achieve the marginal distribution */
  /* of y */
  Sigma     <- K + diag_matrix( sigma2_y );
  L         <- cholesky_decompose( Sigma );
  
} /* end transformed parameters block */


model{
  
  // priors for mean and covariance cluster location 
  eta       ~ gamma( 1.0, 1.0 );
  rho       ~ gamma( 1.0, 1.0 );
  
   
  // observed response likelihood
  y         ~ multi_normal_cholesky(zros_NT,L);
  
  
  
} /* end model{} block */

generated quantities{
  matrix[NT,NT] K_transpose_div_Sigma; /* K'Sigma^-1 */
  matrix[NT,NT] Tau; /* posterior predictive covariance matrix for u */
  matrix[NT,NT] L_u;
  vector[NT] mu_u; /* predictive mean of NT x 1, u */
  vector[NT] u; /* GP function drawn from posterior predictive */
  /* compute MSE for fit evaluation */
  real<lower=0> TSE;
  real<lower=0> MSE;
  /* posterior predictive quantities */
  K_transpose_div_Sigma  <- K' / Sigma;
  mu_u                   <- K_transpose_div_Sigma * y;
  Tau                    <- K - K_transpose_div_Sigma * K;
  L_u                    <- cholesky_decompose( Tau );
  u   <- multi_normal_cholesky_rng(mu_u,L_u);
  TSE <- squared_distance(u,y);
  MSE <- TSE/(NT);
  
  
} /* end generated quantities block */

I also used a variation on the above script for generated quantities that leverages the cholesky decomposition of Sigma that is already computed, but there was no speed improvement.

generated quantities{
  matrix[NT,NT] div_L_K; /* L^-1 * K */
  matrix[NT,NT] Tau; /* posterior predictive covariance matrix for u */
  matrix[NT,NT] L_u; /* cholesky of Tau */
  vector[NT] mu_u; /* predictive mean of NT x 1, u */
  vector[NT] u; /* GP function drawn from posterior predictive */
  /* compute MSE for fit evaluation */
  real<lower=0> TSE;
  real<lower=0> MSE;
  /* posterior predictive quantities */
  div_L_K                <- mdivide_left_tri_low(L,K);
  mu_u                   <- div_L_K' * mdivide_left_tri_low(L,y);
  Tau                    <- K - div_L_K' * div_L_K;
  // for( i in 1:(NT-1))
  // {
  //   for( j in (i+1):NT)
  //   {
  //     Tau[i,j]  <- Tau[j,i];
  //   }
  // }
  L_u <- cholesky_decompose(Tau);
  u   <- multi_normal_cholesky_rng(mu_u,L_u);
  TSE <- squared_distance(u,y);
  MSE <- TSE/(NT);
  
  
} /* end generated quantities block */

Thanks very much your help and comments.

Terrance

--
Thank you, Terrance Savitsky

Bob Carpenter

unread,
Apr 7, 2016, 5:58:35 PM4/7/16
to stan-...@googlegroups.com
Can someone who knows more about GPs look at this?

What we could really use is a more detailed GP case study than
is in the manual.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Aki Vehtari

unread,
Apr 8, 2016, 7:17:15 AM4/8/16
to Stan users mailing list
Hi,

Based on your description we can focus on generative quantities block and thus the slowness is not due to the covariance matrix computation nor gradients. I don't see anything which should take so long time to compute in generated quantities. In my computer Cholesky for 2500x2500 matric takes about 0.2s (using several cores). Other computations should not be slower than Cholesky, and thus overall I would guess that one iteration in generated quantities should be about order of 1s and then 1000 samples should be less than 30 minutes. However, you are generating and writing to the csv file 1000x3 matrices of size 2500x2500, which (is I can calculate these correctly) makes about 1.7e11 bytes, which may take some time to write to the file. So my guess is that this I/O problem, and I guess you can solve it but using locally declared variables not stored in the output.

Aki

terrance savitsky

unread,
Apr 8, 2016, 11:41:04 AM4/8/16
to stan-...@googlegroups.com
Hi Aki and Bob, 

I attach the .stan script version I used, below.  My experience is as Aki predicted (pun intended).   In this application, I wish to perform inference on the function values (under a GP prior) at the training points.  Since we marginalize out the function for posterior inference, I sample them from the posterior predictive in the generated quantities block.   I use ADVI to approximate the joint posterior and the subsequent time to generate samples is slowed relative to parametric priors (e.g., a truncated DP prior) because of the choleskys, but it is fast enough.

The issue is that all of the choleskys - over the iterations to draw samples from the approximate posterior - are retained in RAM, no matter whether they are declared as temporary / local objects.   You will see that I have used local blocks to declare these objects as temporary/local.  I find doing such speeds up the time it takes ADVI to find the local optimum, but it has no impact on local memory storage of the choleskys - which I don't return.  The result is that when I increase the sample to N = 4000, the amount of local storage is over 64GB.    I wondered whether objects - even temporary objects -  are written to R memory from C++ on each sampling iteration.  It is possible - sometime in the future - to explicitly free memory at the C++ layer for local objects on each sampling iteration without transferring them to R?

Terrance

  real<lower=0> eta; /* squared exponential formula precision parameter */
  real<lower=0> rho[P]; /* length-scale parameters, indexed by predictor, x[,p] */
} /* end parameters block */

transformed parameters{
  matrix[NT,NT] K; /* GP covariance matrix for u */
/* add model noise on the diagonal to achieve the marginal distribution */
    /* of y */
    Sigma     <- K + diag_matrix( sigma2_y );
    L         <- cholesky_decompose( Sigma );
  } /* end local block for building Sigma */
  
  
} /* end transformed parameters block */


model{
  
  // priors for mean and covariance cluster location 
  eta       ~ gamma( 1.0, 1.0 );
  rho       ~ gamma( 1.0, 1.0 );
  
   
  // observed response likelihood
  y         ~ multi_normal_cholesky(zros_NT,L);
  
  
  
} /* end model{} block */

generated quantities{
vector[NT] u; /* GP function drawn from posterior predictive */
  /* compute MSE for fit evaluation */
  real<lower=0> TSE;
  real<lower=0> MSE;
  /* posterior predictive quantities */
  { /* environment to declare local variables */
    matrix[NT,NT] div_L_K; /* L^-1 * K */
    matrix[NT,NT] Tau; /* posterior predictive covariance matrix for u */
    matrix[NT,NT] L_u; /* cholesky of Tau */
    vector[NT] mu_u; /* predictive mean of NT x 1, u */
    div_L_K                <- mdivide_left_tri_low(L,K);
    mu_u                   <- div_L_K' * mdivide_left_tri_low(L,y);
    Tau                    <- K - div_L_K' * div_L_K;
    L_u <- cholesky_decompose(Tau);
    u   <- multi_normal_cholesky_rng(mu_u,L_u);
    TSE <- squared_distance(u,y);
    MSE <- TSE/(NT);
  } /* end local variable block */
  
  
} /* end generated quantities block */
--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aki Vehtari

unread,
Apr 9, 2016, 4:43:06 AM4/9/16
to Stan users mailing list
If the memory is the problem, I guess then the next step is to make all NTxNT matrices local (Bob can verify how generated quantities works if only local variables are big). Currently L is not local. To make L local you need to recompute the covariance matrix in generated quantities. Even with additional computation, it is possible that saving memory makes it faster. If this doesn't work then you need to move generated quantities computations out of Stan.

Aki
...

terrance savitsky

unread,
Apr 9, 2016, 9:39:10 AM4/9/16
to stan-...@googlegroups.com
L can't be local because it's used in both the model step and for generated quantities.   I suppose i could generate L, twice, in order to make it local.  i hadn't been motivated to do this because i don't believe that the locally-declared covariance matrices are actually treated as local in memory management.  i'll try it, anyway, and let you know.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,
Apr 9, 2016, 4:40:51 PM4/9/16
to stan-...@googlegroups.com
Local variables are indeed local. They're translated to local
variables in C++. But if they are declared anywhere other than
the transformed data block or generated quantities, they will create
variables on the autodiff stack that last until the derivatives are
evaluated. That's why it's better to define variables in transformed
data if at all possible.

Recalculations in generated quantities are cheap because no derivatives
are involved.

- Bob

terrance savitsky

unread,
Apr 9, 2016, 6:20:34 PM4/9/16
to stan-...@googlegroups.com
your comments are very helpful.  Thanks and i'll let you know.  Perhaps my application could serve as a template of a simple case where one wants to estimate the covariance parameters.   since i estimate the parameters, i can't declare L as transformed data, but building L twice in order that i can be local in each block that is used seems like memory should prevent memory usage from accumulating at the C++ layer as L is computed on each iteration.

Bob Carpenter

unread,
Apr 9, 2016, 7:09:52 PM4/9/16
to stan-...@googlegroups.com
The dynamic peak memory use is almost always dominated
by the cost to evaluate the gradient of the log density.
Stan uses on the order of 40 bytes per subexpression, all of
which needs to be stored at once. Plus another 24 bytes
per parameter (plus terms for Jacobian if it's constrained).
So -((a - b) / c)^2 has three variables and four subexpressions.

You can check out our arXiv paper on autodiff if you want
the gory details.

Also, generated quantities is only evaluated once per iteration,
not once per leapfrog step in HMC. Everything there gets translated
straightforwardly into C++ code and will behave as you expect.

- Bob
Reply all
Reply to author
Forward
0 new messages