Stan is too slow with my code compared to Winbugs, is there some big mistake with my code?

388 views
Skip to first unread message

Jiang Du

unread,
Oct 7, 2015, 8:01:28 PM10/7/15
to stan-...@googlegroups.com
data{
 
int<lower=1> I;//counties
 
int<lower=1> T;//time periods
 
int<lower=1> K;//genders
 
int<lower=1> J;//age groups
 
int<lower=0> y[I*T*K*J];//obsed data
  vector
[I*T*K*J] n;//obsed population
  matrix
[I,I] C;//adjacency matrix
  real max_rho
;//upper boundary of rho
}
transformed data
{
  vector
[1] zero ;
  zero
<- rep_vector(0,1);
}
parameters
{
  vector
[I] gamma;//spatial effects
  vector
[T-1] theta_raw;//time
  real alpha_raw
;//gender
  vector
[J-1] beta_raw;//age
  vector
[I*T*K*J] e;//error term
  vector
[I] mu;//spatial mean
  real
<lower=0> delta_1;// variance component of B
  real
<lower=0,upper=max_rho> rho;
}
transformed parameters
{
  vector
[T] theta;//time
  vector
[K] alpha;//gender
  vector
[J] beta;//age
  matrix
[I,I] delta_1_inv_B;//
  vector
[I*T*K*J] Z;
  vector
[I*T*K*J] poisson_mean;
  theta
<- append_row(zero, theta_raw);
  alpha
[1] <- 0;
  alpha
[2] <- alpha_raw;
  beta
<- append_row(zero, beta_raw);
  delta_1_inv_B
<- delta_1*inverse_spd(diag_matrix(rep_vector(1, I))-rho*C);
 
for(j in 1:J){
   
for(k in 1:K){
     
for(t in 1:T){
       
for(i in 1:I){
          Z
[i + (t-1)*I + (k-1)*I*T + (j-1)*I*T*K] <- gamma[i]+theta[t]                                                           +alpha[k]+beta[j];
       
}
     
}
   
}
 
}
 
  poisson_mean
<- exp(Z+e).*n;
 
//print (poisson_mean);
}


model
{
 
//for gamma
  mu
~ normal(0, 100);
  delta_1
~ inv_gamma(0.001, 0.001);
  rho
~ uniform(0, max_rho);
  gamma
~ multi_normal(mu, delta_1_inv_B);
 
//
  theta_raw
~ normal(0, 100);
  alpha_raw
~ normal(0, 100);
  beta_raw
~ normal(0, 100);
  e
~ normal(0, 100);
  y
~ poisson(poisson_mean);
}




Basically, I'm trying to build a simple Hierarchical model:
y_itkj ~ Poisson(p_itkj * n_itkj)
log(p_itkj) = gamma_i + theta_t + alpha_k + beta_j + e
where gamma follows a CAR process : gamma ~ mvnorm(mu, delta1 * B^(-1)) where B = I - rho * C, C is the adjacency matrix. For all the other parameters, I set the first element to be 0, and all the others follow a univariate Norma(0,100). e is the error term.
I=115, T=8, K=2 (that's why I set alpha to have 2 dim), J=4.

This model is a test model for me to make sure I understand the language of Stan. I have done this simple model in Winbugs and 10000 iterations takes about 5 mins. However, after compiling in Rstan, so far it just reached the 8th iteration (warmup) and it's already 1 hour....

So I'm thinking my problem may not be the issue of optimizing or vectorizing.... There should be some major mistake I want to change, but I can't find it out. The only thing I'm thinking is avoid the for loop of flatten the array Z to a vector, instead, just use it as array.

I'm learning Stan these days since Openbugs doesn't allow me to fit my full model due to the 1.6GB memory limit, and clearly, Stan has a lot of support of different functions, which is so helpful for my full model.

I wish I have made my problem clear, since it's my first time using this email list, if there's anything no clear, please let me know.


--
Regards,
Jiang Du
Additive_model.stan

Andrew Gelman

unread,
Oct 7, 2015, 8:15:07 PM10/7/15
to stan-...@googlegroups.com
The inv_gamma(0.001, 0.001) prior is bad news; you should get rid of that.  Beyond that, I have not looked at such models in awhile so I can’t really say anything else.  Maybe try it in a low-dimensional example where it runs fast to check that you’re getting the right answer?  There has to be some mistake if 8 iterations takes 1 hour!

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<Additive_model.stan>

Bob Carpenter

unread,
Oct 7, 2015, 8:21:28 PM10/7/15
to stan-...@googlegroups.com
This is more than just efficiency of evaluating functions,
though there's a lot you can do there. It sounds like Stan's
getting stuck trying to adapt.

For arithmtic stability, you want to replace

> poisson_mean <- exp(Z+e).*n;
...
> y ~ poisson(poisson_mean);


with

y ~ poisson_log((Z + e) + log(n));

That'll keep everything on the log scale. The poisson_log
takes a log-scale mean parameter.

I also don't think this will be very efficient:

delta_1_inv_B <- delta_1*inverse_spd(diag_matrix(rep_vector(1, I))-rho*C);

I'm not the best with linear algebra, but if at all possible,
you want to parameterize with Cholesky factors. I think that may
involve Cholesky factorizing that adjacency matrix and I don't even
know if that makes sense.

I'm also not sure about the inverses. Usually you want to use division,
because it's more stable:

a / b =def= a * inverse(b).

But I'm not sure that general rule holds here given that you can use the
spd version of inverse and we don't have a specialized division for
that.

The real trouble may be either in the initialization being way
out in the tail where Stan can't adapt or in the priors:

> mu ~ normal(0, 100);
> delta_1 ~ inv_gamma(0.001, 0.001);
> rho ~ uniform(0, max_rho);
> gamma ~ multi_normal(mu, delta_1_inv_B);
> //
> theta_raw ~ normal(0, 100);
> alpha_raw ~ normal(0, 100);
> beta_raw ~ normal(0, 100);
> e ~ normal(0, 100);

We recommend applying stronger priors unless your data's really on
this scale. That is, if you don't think 200 is a reasonable value
for mu, then the prior shouldn't be normal(0, 100). We also
recommend against the interval constrained prior for rho and the very
diffuse inverse gammas --- see the manual chapter on regression for discussion
and references. Not just in Stan --- for statistical reasons.

I'm hoping it's something simpler, though. I don't see any reason why
this model wouldn't fit. How big is the data both in terms of number of
observations and the mean and sd of y?

- Bob




> On Oct 7, 2015, at 8:01 PM, Jiang Du <jiang...@gmail.com> wrote:
>

Bob Carpenter

unread,
Oct 7, 2015, 8:24:45 PM10/7/15
to stan-...@googlegroups.com
Does the BUGS model give you the right answer for simulated data?
Are estimates of parameters like rho near their boundaries (0 and max_rho)?

- Bob

> On Oct 7, 2015, at 8:01 PM, Jiang Du <jiang...@gmail.com> wrote:
>

Jiang Du

unread,
Oct 8, 2015, 12:22:04 PM10/8/15
to stan-...@googlegroups.com
I didn't try the simulated data, however, the BUGS estimates for my real data set is almost as the ones from a frequetist methods. Your guess about rho is correct. In my data set, max_rho is around 0.173, and the posterior is quite skewed to the upper boundary giving the mode around 0.17 somehow. Actually I also tried to code my full model with 

log(p_itkj) = Z_itkj + error, where Z_itkj follows a partial informative normal prior, with precision matrix H being the Kronecker product of three precision matrix(115*8*2). 

I implemented my full model in R and use a simple MH to sample rho from the full conditional distribution, but due to the boundary issue, the acceptance rate is too low. That's why I want to learn Stan to see if it can help me.

And maybe another question, like in Stan, I can program my function of Kronecker product, however, every matrix in my model is about precision matrix. But if I want to use multi_norm, I need to give the covariance matrix, does that mean I have to inverted it? How can I avoid that?

And here is the BUGS code for the simple additive model I used:
model{ #likelihood
 
for (i in 1:I){
 
for (t in 1:T){
 
for (k in 1:K){
 
for (j in 1:J){
 y
[i, t, k, j] ~ dpois(lambda[i, t, k, j])
 
lambda[i, t, k, j] <- n[i, t, k, j] * p[i, t, k, j]
 log
(p[i, t, k, j]) <- z[i, t, k, j] + e[i, t, k, j]
 z
[i, t, k, j] <- gamma[i] + theta[t] + alpha[k] + beta[j]
 
#prior for e[i, t, k, j]
 e
[i, t, k, j] ~ dnorm(0, e.tau)
 rate_per_hthousands
[i, t, k, j] <- p[i, t, k, j]*100000
 
}
 
}
 
}
 
}
 
#K=2
 alpha
[1] <- 0
 alpha
[2] ~ dnorm(0, 1.0E-04)
 
 beta
[1] <- 0
 
for( j in 2:J){
 beta
[j] ~ dnorm(0,  1.0E-04)
 
}
 
 e
.tau ~dgamma(0.001, 0.001)
 
 
#for spatial smoothing
 gamma
[1:I] ~ car.proper(gamma_mu[1:I],C[], adj[], num[], M[], gamma_tau, gamma_rho)
 
for (i in 1:I) {
        M
[i] <- 1;
   
}
   
for (i in 1:sumNumNeigh) {
        C
[i] <- 1;
   
}
 
 
#intercept
 mu
~ dnorm(0,  1.0E-04)
 
for( i in 1:I){
 gamma_mu
[i] <-mu
 
}
 gamma_tau
~ dgamma(0.001, 0.001)


 rho
.min   <- max(0, min.bound(C[], adj[], num[], M[]))# we just want positive correlation, so start at #0, not rho.min
 rho
.max   <- max.bound(C[], adj[], num[], M[])
 gamma_rho
~ dunif(rho.min, rho.max)
 
 
#for temporal smoothing, provide S and identity matrix
 theta
[1] <- 0
 
for (t in 2:T){
 theta
[t] ~dnorm(0,  1.0E-04)
 
}
 
}



在 2015年10月7日星期三 UTC-5下午7:24:45,Bob Carpenter写道:

Jiang Du

unread,
Oct 8, 2015, 12:34:19 PM10/8/15
to stan-...@googlegroups.com
SAMPLING FOR MODEL 'Additive_model' NOW (CHAIN 1).


Chain 1, Iteration:  1 / 20 [  5%]  (Warmup)
Chain 1, Iteration:  2 / 20 [ 10%]  (Warmup)
Chain 1, Iteration:  4 / 20 [ 20%]  (Warmup)
Chain 1, Iteration:  6 / 20 [ 30%]  (Warmup)
Chain 1, Iteration:  8 / 20 [ 40%]  (Warmup)
Chain 1, Iteration: 10 / 20 [ 50%]  (Warmup)
Chain 1, Iteration: 11 / 20 [ 55%]  (Sampling)
Chain 1, Iteration: 12 / 20 [ 60%]  (Sampling)
Chain 1, Iteration: 14 / 20 [ 70%]  (Sampling)
Chain 1, Iteration: 16 / 20 [ 80%]  (Sampling)
Chain 1, Iteration: 18 / 20 [ 90%]  (Sampling)
Chain 1, Iteration: 20 / 20 [100%]  (Sampling)
#  Elapsed Time: 1896.77 seconds (Warm-up)
#                3111.6 seconds (Sampling)
#                5008.37 seconds (Total)


[1] "The following numerical problems occured the indicated number of times on chain 1"
                                                                                                     count
Exception thrown at line 56: multi_normal_log: Covariance matrix is not symmetric. Covariance matrix     2
[1] "If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,"
[2] "but if this warning occurs often then your model may be either severely ill-conditioned or misspecified."                              
Warning messages:
1: There were 10 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth.
2: Examine the pairs() plot to diagnose sampling problems


The log above is what I got just now.

The summary of y:
 
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 
0.000   0.000   1.000   3.453   5.000  22.000


The summary of n:
 
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   
1065    2108    3036    5903    7639   16620

They both are 115*8*2*4 vector. ( I used array in BUGS, but for vectorizing, I tried to do it as vector....)

Your suggestion about possion_log is very good. I'll do it right away.

In terms of calculating the covariance matrix :
     
delta_1_inv_B <- delta_1*inverse_spd(diag_matrix(rep_vector(1, I))-rho*C);


Is there a way to avoid that? Since I need a covariance for multi_normal anyway...  And also, C is not p.d. so Chol won't work for me.

I will modify my priors to be stronger. And for the initials, I used the ones from frequentist estimates, same as Openbugs, which turned to be almost the same as the posterior means.

Also, for : Exception thrown at line 56: multi_normal_log: Covariance matrix is not symmetric. Covariance matrix     2,
is that asymmetric caused by calculation inaccuracy? Since I used R to check and the matrix does symmetric.


在 2015年10月7日星期三 UTC-5下午7:21:28,Bob Carpenter写道:

Jiang Du

unread,
Oct 8, 2015, 12:38:31 PM10/8/15
to Stan users mailing list, gel...@stat.columbia.edu
That inv_gamma is a habit with BUGS.... I'll change it according to my estimates to let my model run first. I will simulate a small data set now. Thanks!

在 2015年10月7日星期三 UTC-5下午7:15:07,Andrew Gelman写道:

Bob Carpenter

unread,
Oct 8, 2015, 4:04:15 PM10/8/15
to stan-...@googlegroups.com

> On Oct 8, 2015, at 12:22 PM, Jiang Du <jiang...@gmail.com> wrote:
>
> I didn't try the simulated data, however, the BUGS estimates for my real data set is almost as the ones from a frequetist methods.

OK, that's a reasonable sanity check.

> Your guess about rho is correct. In my data set, max_rho is around 0.173, and the posterior is quite skewed to the upper boundary giving the mode around 0.17 somehow.

If the median of the posterior is around 0.17, you shouldn't set
the max at 0.173 ---- it will be problematic with the transform
we use to be that close to the boundary and may be causing the
issue you're having with tree depth in step size.

Our general advice is to provide plenty of room for the full posterior
in any constraints that you set.

> Actually I also tried to code my full model with
>
> log(p_itkj) = Z_itkj + error, where Z_itkj follows a partial informative normal prior, with precision matrix H being the Kronecker product of three precision matrix(115*8*2).
>
> I implemented my full model in R and use a simple MH to sample rho from the full conditional distribution, but due to the boundary issue, the acceptance rate is too low. That's why I want to learn Stan to see if it can help me.
>
> And maybe another question, like in Stan, I can program my function of Kronecker product, however, every matrix in my model is about precision matrix. But if I want to use multi_norm, I need to give the covariance matrix, does that mean I have to inverted it? How can I avoid that?

For statistical reasons of interpretability, we like to
parameterize with a scaled correlation matrix with an LKJ
prior. For efficiency reasons, it's best to code these
with Cholesky factors.

Having said that, we do allow the precision matrix formulation
of multi-normal. But it doesn't really save any work because you
still need to solve the system given by the matrix for the determinant.

We don't make any use of conjugacy in Stan.

To get rid of your asymmetry problems, you should define half the
matrix in terms of the other half --- numerically it can be very
unstable even if algebraically you are applying operations that look
like they should yield symmetry. And I mean brute force

for (m in 1:M)
for (n in (m+1):N)
Sigma[m,n] <- Sigma[n,m];

Not knowing GeoBUGS, I don't know what this does:

> gamma[1:I] ~ car.proper(gamma_mu[1:I],C[], adj[], num[], M[], gamma_tau, gamma_rho)


But presumably they have something specialized for that
density.

- Bob


> And here is the BUGS code for the simple additive model I used:

Andrew Gelman

unread,
Oct 8, 2015, 7:27:31 PM10/8/15
to stan-...@googlegroups.com

> On Oct 8, 2015, at 4:04 PM, Bob Carpenter <ca...@alias-i.com> wrote:
>
> If the median of the posterior is around 0.17, you shouldn't set
> the max at 0.173 ---- it will be problematic with the transform
> we use to be that close to the boundary and may be causing the
> issue you're having with tree depth in step size.
>
> Our general advice is to provide plenty of room for the full posterior
> in any constraints that you set.

In general we prefer soft constraints. Use hard constraints when they are an intrinsic part of the model (for example, a correlation that has to be between -1 and 1). Use soft constraints to encode prior information and to induce regularization.


Jiang Du

unread,
Oct 20, 2015, 12:54:25 PM10/20/15
to Stan users mailing list, gel...@stat.columbia.edu
Hi Andrew, 
I'm rethinking your suggestion.

In my case, rho is part of a precision matrix, and in order for the matrix to be positive definite, rho need to satisfy the condition <0, max_rho>.

Is that soft constraint or hard one?

Actually, in my model part, I will specify rho~ uniform(0, max_rho), by seeing your reply in another post, does it mean that setting the uniform prior for rho can enable me to specify rho as : real rho; without any bounds?

Thanks,
Jiang

在 2015年10月8日星期四 UTC-5下午6:27:31,Andrew Gelman写道:

Bob Carpenter

unread,
Oct 20, 2015, 6:25:18 PM10/20/15
to stan-...@googlegroups.com
That's a hard constraint governed by mathematical requirements of
positive definiteness. So you need to declare

real<lower=0, upper=max_rho> rho;

to make sure rho in (0, max_rho). It will get an implicit uniform
distribution on the valid values, so you don't need to include the uniform
distribution.

- Bob

Jiang Du

unread,
Oct 20, 2015, 6:43:45 PM10/20/15
to Stan users mailing list
Thank you Bob. I will revise my model right away.

Jiang

在 2015年10月20日星期二 UTC-5下午5:25:18,Bob Carpenter写道:

Andrew Gelman

unread,
Oct 20, 2015, 8:45:37 PM10/20/15
to Jiang Du, Stan users mailing list
Hi, it’s best to put your constraint on the covariance matrix as a whole (for example, using an LKJ prior), rather than trying to place constraints on one piece at a time.
A
Reply all
Reply to author
Forward
0 new messages