Should the sum-to-zero constrain added before sampling or after sampling?

609 views
Skip to first unread message

Jiang Du

unread,
May 21, 2016, 8:57:32 PM5/21/16
to Stan users mailing list
Hi, I hope I can explain things clearly. I've seen different ways to add the sum-to-zero constraint. I roughly wrote these two codes for example:
#IF I am monitoring alpha_stz and beta_stz, use the raw parameter to sample, just monitor the stz version
parameters{
  vector[N] alpha;#random effect, MVN prior
  real beta;
  real<lower=1e-6> sigma2;
}
transformed parameters{
  real alpha_stz;
  real beta_stz;
  real mean_alpha;
  mean_alpha <- mean(alpha)
  alpha_stz <- alpha - mean_alpha;
  beta_stz <- beta + mean_alpha;
}
model{
  alpha ~  multi_normal_prec(zeros, Tau)#zeros, Tau defined somewhere else
  beta ~ normal(1,2)
  y ~ normal(alpha+beta, sigma2)
}
#another way: use the stz version to sample
parameters{
  vector[N] alpha;#random effect, MVN prior
  real beta;
  real<lower=1e-6> sigma2;
}

model{
  real alpha_stz;
  real beta_stz;
  real mean_alpha;
  
  beta_stz <- beta + mean_alpha;
  alpha ~  multi_normal_prec(zeros, Tau)#zeros, Tau defined somewhere else
  
  mean_alpha <- mean(alpha)
  alpha_stz <- alpha - mean_alpha;
  
  beta ~ normal(1,2)
  y ~ normal(alpha_stz+beta_stz, sigma2)
}


So basically, I have a vector of random effect alpha, for example, a spatial effect, and an intercept, beta. In order to identify these two, I want to add sum-to-zero constraint to alpha. Here are two ways to do it:
1. Don't care about the constraint during the sampling. After get the posterior samples, add the constraint for every iteration of the samples.
2. For every iteration, use the constraint to create local variable, alpha_stz and beta_stz, use these two to calculate the log likelihood.

I understand the probability density will be the same in either case, since the sum of alpha and beta will never change. However, there is some difference between these two methods, which I can't describe clearly. What is the correct way doing so?

Thank,

Jiang



Bob Carpenter

unread,
May 22, 2016, 10:20:20 PM5/22/16
to stan-...@googlegroups.com
I'd suggest the approach described in the Stan manual.
There's a section in the regression chapter: Parameterizing
Centered Vectors.

The approach you suggest won't work in Bayesian inference
because the parameters in your model won't have a proper
posterior. The general issue is discussed in the problematic
posteriors chapter of the manual.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jiang Du

unread,
May 23, 2016, 2:02:17 AM5/23/16
to Stan users mailing list
Hi Bob,

I do read those things but maybe the code I've written is followed by the habit from Winbug. And also I referred his Stan code to do the constraint (By Kyle Foreman in the post : https://groups.google.com/forum/#!searchin/stan-users/spatial%7Csort:date/stan-users/M7T7EIlyhoo/jli3CRobjeIJ):


The reason why I stopped using the "k-1" version of raw parameters is that I want a joint prior on the random effect. And also I don't know what will happen if I use the translated and scaled simplex method. 

Jiang

在 2016年5月22日星期日 UTC-5下午9:20:20,Bob Carpenter写道:

Bob Carpenter

unread,
May 23, 2016, 11:55:22 PM5/23/16
to stan-...@googlegroups.com
It is something people did in WinBUGS, but it's going to
be a problem in Stan because the posterior over the parameters
(that is, what's defined in the parameters block) isn't proper.
In particular, the no-U-turn sampler won't make U-turns --- it'll
just keep running.

- Bob

Jiang Du

unread,
May 24, 2016, 11:18:59 PM5/24/16
to Stan users mailing list
I see. Thank you Bob. I think the simplex method can be used, but I'm worried about the prior.
Suppose my vector u ~ MVN(0, Sigma) where Sigma is a covariance matrix, then how can I define the "scale" in the transformation? is that the Cholesky decomposition L? And will the simplex restriction kind of change the prior?

There is only univariate example in the manual.
Thanks,
Jiang

在 2016年5月23日星期一 UTC-5下午10:55:22,Bob Carpenter写道:

Bob Carpenter

unread,
May 25, 2016, 10:57:24 AM5/25/16
to stan-...@googlegroups.com
There are a couple of multivariate examples in the GP chapter
and in the change-of-variables chapter.

But I'm not sure what you want to do or what in particular
you're worried about. You can just put a prior on the centered
vector components directly. The transform is linear, so there's
no Jacobian needed.

- Bob

Jiang Du

unread,
May 25, 2016, 2:25:54 PM5/25/16
to Stan users mailing list
Hi Bob,
I think it's better to clarify my problem in code. I just wrote a scratch and there may some mistakes, but I just want to show the idea in my mind and I'm testing and modifying my code.

The whole idea about my model is jointly model two spatial variable, by explicitly define their cross covariance matrix (Not the precision, that's why the inverse can not be avoided). Suppose I have two spatial variables, y1 and y2, and let Y = c(y1, y2). Y is modeled by a Poisson regression framework, and using the log link, the 
log_p = alpha + u + error;
for each variable, there will be a different alpha
u is the spatial effect for y1 and y2, follows a MVN(0, Sigma), where Sigma is a matrix of 4 blocks. The diagonal blocks are the usual covariance matrix for spatial CAR models, and the off-diagonal blocks are the matrix telling how the two spatial variables are correlated across maps.

In order to identify alpha and u, I need to center u. That's why I followed Kyle Foreman's code. 
Here is the scratch of the code:
data{
  int<lower=0> J1;#number of spatial regions for map 1
  int<lower=0> J2;#number of spatial regions for map 2
  int Y[J1+J2];#event counts for two maps, modeled in Poisson regression
  int pop[J1+J2];#corresponding population, offset
  matrix[J1, J1] W_1; #adgacency matrix for the first map
  matrix[J2, J2] W_2; #adgacency matrix for the first map
  matrix[J1, J1] D_1; #diagonal matrix with diag elemetns being number of neighbors
  matrix[J2, J2] D_2; #diagonal matrix with diag elemetns being number of neighbors
  matrix[J1, J2] Psi_12# a matrix to indicate how the regions in 2 maps are correlated
}
transformed data {
  # zeros for MVN prior
  vector[N] zeros;
  zeros <- rep_vector(0.0, N);
}
parameters{
  real<lower=0, upper=1> rho[2]; #strength of spatial correlation, for two maps
  vector[J1+J2] u; #spatial effect for two maps
  real<lower=-0.3,upper=0.3> gamma12;#correlation between two maps, restriced for positive definiteness
  vector[2] alpha; #intercept for each spatial map
  vector[J1+J2] epsilon;# standard Normal
  real<lower=0> sigma2; #variance for the error term in log mean of Poisson 
  vector[2] delta;#variance compentent of the CAR model
}
transformed parameters{
  real<lower=0> sigma; #sqrt of sigma2
  sigma  <-  sqrt(sigma2);
}
model{
   matrix[J1, J1] B_1; #precision matrix for the 1st map
   matrix[J2, J2] B_2; #precision matrix for the 1st map
   matrix[J1+J2, J1+J2] Sigma;#covariance matrix for spatial effect;
   matrix[J1+J2, J1+J2] Sigma_full;#covariance matrix for spatial effect;
   vector[J1+J2] epsilon;# parameter for Poisson distribution in log scale
   vector[J1+J2] alpha_vec;#broadcast the intercept
   vector[J1+J2] delta_vec;#broadcast the variance compentent
   matrix[J1+J2, J1+J2] DELTA;#cange delta_vec to diag matrix
   
   delta_vec[1:J1] <- rep_vector(delta[1], J1);
   delta_vec[(J1+1):J2] <- rep_vector(delta[2], J2);
   DELTA <- diag_matrix(delta_vec)
   alpha_vec[1:J1] <- rep_vector(alpha[1], J1);
   alpha_vec[(J1+1):J2] <- rep_vector(alpha[2], J2);
   B_1 <- D_1 - rho[1] * W_1;
   B_2 <- D_2 - rho[2] * W_2;
   #assign Sigma in block
   #Sigma = DELTA%*%
     #B_1_inv             gamma12*Psi_12
     #t(gamma12*Psi_12)   B_2_inv
   #%*%DELTA
   Sigma[1:J1, 1:J1] <- inverse_spd(B_1);
   Sigma[(J1+1):(J1+J2), (J1+1):(J1+J2)] <- inverse_spd(B_2);
   Sigma[1:J1, (J1+J2)] <- gamma12 * Psi_12;
   Sigma[(J1+1), 1:J1] <- (Sigma[1:J1, (J1+J2)])';
   Sigma_full <- DELTA*Sigma*DELTA;
   #prior for the spatial effect
   u ~ multi_normal(zeros, Sigma_full);
   epsilon ~ normal(0,1);
   log_p <- log(pop) + alpha_vec + u + sigma*epsilon;
   y ~ poisson_log(log_p);#likelihood
}


And based on your suggestions, is this the correct way doing the centering?
parameters{
  simplex[J1+J2] z;#std normal
}

model{
  matrix[J1+J2, J1+J2] L;#cholesky of SIgma_full
  z ~ normal(0,1);
  L <- cholesky_decompose(Sigma_full);
  u <- L*z;
}

Thanks. Jiang



在 2016年5月25日星期三 UTC-5上午9:57:24,Bob Carpenter写道:

Bob Carpenter

unread,
May 26, 2016, 3:33:23 PM5/26/16
to stan-...@googlegroups.com
That's not the non-centered parameterization of a multivariate.
You don't need to restrict to a simplex.

I thought you meant how to represent a centered vector v
such that mean(v) = 0.

- Bob

Bob Carpenter

unread,
May 26, 2016, 3:33:24 PM5/26/16
to stan-...@googlegroups.com
Also, you don't ever want to be computing inverses
or doing full quadratic rescaling. We have a quad_form_diag()
function to rescale a covariance matrix. But you're better
off working with Cholesky factors of everything.

- Bob

Jiang Du

unread,
May 26, 2016, 9:51:13 PM5/26/16
to Stan users mailing list
Thanks for your advice. I've changed to use quad_form_diag when I was running the code. 

The way I want to build a covariance requires to inverse some matrix, say B. I even tried using its cholesky part L (LL'=B), and then inverse L. I think stan doesn't have the particular inverse for triangular matrices. But still, the inverse matrix is really slow and it takes more than a hour for 100 iterations. So I give up some "interpretation ability" of my current covariance matrix and find a way doing on the precision matrix, and the model gave me reliable results (a hour with 5000 iterations and convergence met). 

Here are some of my questions:
1. About the inverse. I did some Gibbs sampling involves the inverse of the covariance matrix as above in R, the speed is not that bad. Is that calculating the gradient making Stan not good at calculating the inverse? Since for a matrix not that huge (under 500*500), the inverse should be fast in C++ or R using the cholesky inverse method. (although  I  do try to avoid calculating the inverse every time when I'm programming, but this time the inverse is a must. )

2. When sampling from multivariate normal distribution, people usually use Cholesky decomposition to transfrom from a vector a N(0,1) random variables, which is fast. In stan, if u ~ multi_normal(0, Sigma), then which way is faster or better?
  •  One way is just specify as it is; 
  • the other way is  u<-L*z where z~N(0,1), LL'=Sigma.
  • Or something like: u ~ multi_normal_cholesky(0, L) where LL'=Sigma.
And when using multi_normal(0, Sigma), will it be expensive since Stan will calculate the inverse under the hood? Or, how does stan calculate the log density of multi_normal when given a covariance matrix, not precision one.

Thanks,

Jiang 


在 2016年5月26日星期四 UTC-5下午2:33:24,Bob Carpenter写道:

Bob Carpenter

unread,
May 28, 2016, 3:17:48 PM5/28/16
to stan-...@googlegroups.com
1. Yes, it's taking the gradients that makes things slow.
If you wind up doing something like inv(a) * b, it's better
to do a \ b and same for a * inv(b).

2. That's a lot of questions about the underlying code.
The best bet's to use a Cholesky-based parameterization of
the covariance matrix (usually by scaling a Cholesky-based
parameteriation of the correlation matrix in our applications),
because then the solving required for the determinant is fast.

The other big speedup comes from group observations and/or
means into an array so that the matrix is only factored once.
This isn't possible if the covariance matrix varies with each
observation, but if it does, that's probably going to be a very
hard model to fit.

- Bob
Reply all
Reply to author
Forward
0 new messages