Big differences in parameter estimates between JAGS and STAN for same model

1,048 views
Skip to first unread message

atzap

unread,
Feb 23, 2013, 11:04:01 PM2/23/13
to stan-...@googlegroups.com
Hi all

I'm rather perplexed by the following problem.  I am fitting a basic linear model in rstan to a normally distributed variable y, and trying to predict the mean of this normal (mu) given different levels of a factor (alpha) and a series of covariates (x.true) their associated effects (beta's).  The model and data are available in the attached script.  I then re-run this exact same model in JAGS (see the link to the model file in the attached R file) and get perfectly correlated predictions for mu between STAN and JAGS.

The problem is that the beta and alpha estimates are nowhere near each other!  I realize that the model hasn't fully converged in JAGS (again showing the tremendous efficiency of HMC!), but it does eventually converge when everything is run about 10X longer, and the estimates are still nowhere near each other.  Looking at the posterior summaries, it looks like the taualpha estimate increases at the expense of taub in STAN as compared with JAGS, where the reverse seems to happen.  Entirely removing the hyper-parameter taub doesn't change the result in JAGS or STAN, so they're still different.  What can possibly be the cause of this?  What am I missing?  Hopefully, there is a relatively straight forward explanation that will save me a few days of frustration.

The attached R script should demonstrate the "problem", though JAGS must be installed on your system.  I am using rstan 1.1 and JAGS 3.3.

Many, many thanks,
Andrew
Rscript.txt

Ben Goodrich

unread,
Feb 24, 2013, 12:25:30 AM2/24/13
to stan-...@googlegroups.com
On Saturday, February 23, 2013 11:04:01 PM UTC-5, atzap wrote:
The problem is that the beta and alpha estimates are nowhere near each other!  I realize that the model hasn't fully converged in JAGS (again showing the tremendous efficiency of HMC!), but it does eventually converge when everything is run about 10X longer, and the estimates are still nowhere near each other.  Looking at the posterior summaries, it looks like the taualpha estimate increases at the expense of taub in STAN as compared with JAGS, where the reverse seems to happen.  Entirely removing the hyper-parameter taub doesn't change the result in JAGS or STAN, so they're still different.  What can possibly be the cause of this?  What am I missing?  Hopefully, there is a relatively straight forward explanation that will save me a few days of frustration.

I think part of the issue is the different parameterizations of the normal distribution between Stan and JAGS. In Stan, the second parameter is a standard deviation. In the BUGS-family, the second parameter is a precision. So, the priors in your Stan model

taualpha ~ normal(0,100)T[0,];
taus ~ normal(0,100)T[0,];

are very different from the priors in your JAGS model

tau.alpha ~ dnorm(0.00000E+00, 100)I(0,) taus ~ dnorm(0.00000E+00, 100)I(0,)
Also, it is not necessary to explicitly truncate the priors on taualpha and taus in Stan in this particular case because they are declared as non-negative and the amount of truncated mass is a constant.

Ben

atzap

unread,
Feb 24, 2013, 2:11:08 AM2/24/13
to
Thanks for the quick reply Ben.  Yes, I suspected that the precision vs SD might be the issue, but I'm not sure I immediately follow how I haven't accounted for this...  For example, in JAGS, I then transform 
taus ~ dnorm(0.00000E+00, 100)I(0,)
pre.alpha <- pow(tau.alpha, -2)

And pre.alpha goes into the normal prior for alpha.  Isn't this the same thing then as what I've done in STAN?  Or do I need to throw a sqrt on the 100 in STAN?...  I think this is me being daft!

Again, thanks,
Andrew

EDIT:   So pg 249 of the STAN manual suggests that normal(1,2) = normal(1,0.25).  In which case, normal(1, 0.25) is the same as the following: 
normal(1,pre)
pre <- pow(tau, -2)
tau <- 2

Isn't this what I've done in my script?...

Ben Goodrich

unread,
Feb 24, 2013, 2:18:32 AM2/24/13
to stan-...@googlegroups.com
On Sunday, February 24, 2013 1:54:50 AM UTC-5, atzap wrote:
Thanks for the quick reply Ben.  Yes, I suspected that the precision vs SD might be the issue, but I'm not sure I immediately follow how I haven't accounted for this...  For example, in JAGS, I then transform 
taus ~ dnorm(0.00000E+00, 100)I(0,)
pre.alpha <- pow(tau.alpha, -2)

And pre.alpha goes into the normal prior for alpha.  Isn't this the same thing then as what I've done in STAN?  Or do I need to throw a sqrt on the 100 in STAN?...  I think this is me being daft!

Again, thanks,
Andrew

EDIT:   So pg 249 suggests that normal(1,2) = normal(1,0.25).  In which case, normal(1, 0.25) is the same as the following: 
normal(1,pre)
pre <- pow(tau, -2)
tau <- 2

So, isn't this what I've done in my script, and so this isn't the problem?...

You're right, sorry. I didn't look closely enough at your .bug file. But a more likely thing is that in the .stan file you have

beta ~ normal(0.00000E+00, taub);

while in the .bug file you have

    for (i in 1:4) {
        beta
[i] ~ dnorm(0.00000E+00, 1e-3)
   
}

Although taub has a prior it seems disconnected from the rest of the DAG.

Ben

Bob Carpenter

unread,
Feb 24, 2013, 3:09:24 AM2/24/13
to stan-...@googlegroups.com
If you have the following in BUGS/JAGS, which
uses precision:

a ~ dnorm(b,c) I(0,)

you should this in Stan, which uses std dev:

a ~ normal(b,pow(c,-0.5));

- Bob

On 2/24/13 1:54 AM, atzap wrote:
> Thanks for the quick reply Ben. Yes, I suspected that the precision vs SD might be the issue, but I'm not sure I
> immediately follow how I haven't accounted for this... For example, in JAGS, I then transform
> taus ~ dnorm(0.00000E+00, 100)I(0,)
> pre.alpha <- pow(tau.alpha, -2)
>
> And pre.alpha goes into the normal prior for alpha. Isn't this the same thing then as what I've done in STAN? Or do I
> need to throw a sqrt on the 100 in STAN?... I think this is me being daft!
>
> Again, thanks,
> Andrew
>
>
> On Sunday, February 24, 2013 12:25:30 AM UTC-5, Ben Goodrich wrote:
>
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

atzap

unread,
Feb 24, 2013, 3:16:23 AM2/24/13
to stan-...@googlegroups.com

You're right, sorry. I didn't look closely enough at your .bug file. But a more likely thing is that in the .stan file you have

beta ~ normal(0.00000E+00, taub);

while in the .bug file you have

    for (i in 1:4) {
        beta
[i] ~ dnorm(0.00000E+00, 1e-3)
   
}

Although taub has a prior it seems disconnected from the rest of the DAG.

Ben


No worries, lots to read.  And now it is my turn to apologize.  I was testing a few things and accidentally saved that changed in the bug file.  The different answers should persist if the prior in the bug file uses taub.  I've corrected the file.  You should be able to see now that this still leads to different parameter estimates...

Andrew 

atzap

unread,
Feb 24, 2013, 3:24:24 AM2/24/13
to stan-...@googlegroups.com
Isn't this what I've done by doing the following in BUGS/JAGS:
a ~ dnorm(b, c)I(0,)
c <- pow(d, -2)
d ~ dunif(0, 100)

Which in STAN should be equal to:
a ~ normal(b,pow(pow(d, -2),-0.5))

In which case the powers cancel out and you're left with
a ~ normal(b, d)

Which is what I've done...?

Thanks
Andrew

Bob Carpenter

unread,
Feb 24, 2013, 3:40:43 AM2/24/13
to stan-...@googlegroups.com
I was just looking at the response from Ben, who said

--------------
taualpha ~ normal(0,100)T[0,];
taus ~ normal(0,100)T[0,];

are very different from the priors in your JAGS model

tau.alpha ~ dnorm(0.00000E+00, 100)I(0,)
taus ~ dnorm(0.00000E+00, 100)I(0,)
--------------------

The Stan sampling statement equivalent to the above JAGS is:

tau_alpha ~ normal(0,0.1) T[0,];

taus ~ normal(0,0.1) T[0,];

But as Ben said, you don't need the truncation in Stan because
the location and scale (i.e., mean and dev) parameters are
constants, so you could just as well use:

tau_alpha ~ normal(0,0.1);

taus ~ normal(0,0.1);

- Bob

Ben Goodrich

unread,
Feb 24, 2013, 1:24:48 PM2/24/13
to stan-...@googlegroups.com
On Sunday, February 24, 2013 3:16:23 AM UTC-5, atzap wrote:
You should be able to see now that this still leads to different parameter estimates...

Yes, now that I actually ran your code, although I shortened the Stan run, which might not have been the best idea in retrospect. Now, I am guessing that there may be multiple modes?

mu[i] <- beta[1] * xitrue[1,i] + beta[2] * xitrue[2,i] + rans[i] + beta[3]*xitrue[1,i]*xitrue[2,i] + alpha[PRED[i]];

It seems like having both rans[i] and alpha[PRED[i]] could induce multimodality that would leave the mus unaffected, but I couldn't investigate that because rans didn't get saved. But I could sort of see a problem by looking at the chains individually.

> round(t(colMeans(as.array(fit.stan))), 3)
          chains
parameters chain:1 chain:2 chain:3
  taus       0.178   0.229   0.398
  tau        0.335   0.333   0.333
  taub       0.349   0.346   0.301
  beta[1]   -0.124  -0.093  -0.192
  beta[2]    0.087   0.101  -0.245
  beta[3]    0.084   0.071   0.232
  beta[4]    0.054   0.055   0.071
  taualpha   2.183   2.046   1.814
  alpha[1]  -1.062  -1.058  -0.536
  alpha[2]  -0.978  -0.985  -1.111
  alpha[3]  -1.127  -1.122  -1.020
  mu[1]     -1.217  -1.217  -1.211
  mu[2]     -1.245  -1.245  -1.245
  mu[3]     -1.530  -1.527  -1.531
  mu[4]     -1.013  -1.012  -1.010
  mu[5]     -0.876  -0.874  -0.874
  mu[6]     -0.780  -0.781  -0.779
  mu[7]     -0.944  -0.947  -0.945
  mu[8]     -1.282  -1.282  -1.282
  lp__     492.389 490.340 479.517
> round(t(colMeans(jagsfit8$BUGSoutput$sims.array)), 3)
             [,1]    [,2]    [,3]
alpha[1]   -0.013  -0.031  -0.038
alpha[2]   -0.006  -0.020  -0.015
alpha[3]   -0.035  -0.034  -0.031
beta[1]     1.238   0.885   1.021
beta[2]     1.154   1.017   1.173
beta[3]     0.795   0.938   1.229
beta[4]     0.247   0.330  -0.045
deviance  514.308 514.954 514.549
mu[1]      -1.208  -1.218  -1.213
mu[2]      -1.244  -1.246  -1.243
mu[3]      -1.532  -1.526  -1.526
mu[4]      -1.010  -1.010  -1.012
mu[5]      -0.874  -0.876  -0.878
mu[6]      -0.775  -0.777  -0.775
mu[7]      -0.947  -0.943  -0.947
mu[8]      -1.285  -1.285  -1.288
tau         0.334   0.335   0.335
tau.alpha   0.084   0.091   0.084
taub        2.366   2.158   2.318
taus        0.092   0.076   0.078

Chain 3 in the Stan run definitely went to a different place than the first two chains. One plot I like to do if you have a television-sized monitor is

pairs(fit.stan, panel = function(...) smoothScatter(..., add = TRUE))

which is a two-dimensional kernel density estimate for each pair of parameters. As best I can tell from that plot on my laptop, Stan didn't sample the tau variables very well, but that may just be a function of me shortening the run.

Ben

atzap

unread,
Feb 24, 2013, 1:52:04 PM2/24/13
to stan-...@googlegroups.com
On Sunday, February 24, 2013 1:24:48 PM UTC-5, Ben Goodrich wrote:
On Sunday, February 24, 2013 3:16:23 AM UTC-5, atzap wrote:
You should be able to see now that this still leads to different parameter estimates...

Yes, now that I actually ran your code, although I shortened the Stan run, which might not have been the best idea in retrospect. Now, I am guessing that there may be multiple modes?

mu[i] <- beta[1] * xitrue[1,i] + beta[2] * xitrue[2,i] + rans[i] + beta[3]*xitrue[1,i]*xitrue[2,i] + alpha[PRED[i]];

It seems like having both rans[i] and alpha[PRED[i]] could induce multimodality that would leave the mus unaffected, but I couldn't investigate that because rans didn't get saved. But I could sort of see a problem by looking at the chains individually.



Thanks for doing that Ben.  I hadn't thought of multimodality.  But if this was the case shouldn't taus and taualpha be correlated, which they're not?

Anyway, if you do run for longer, you can get STAN to converge.  It does so in a consistently different place than where JAGS converges.  Without comparing the different software, I would have assumed that the model fitted.  Now, I don't know whether to trust where it ends up or whether there is in fact some multimodality that cannot be detected because within a single program (i.e. STAN or JAGS), it always converges to the same spot.  I guess that in addition to what is going on here, I'm also asking how can I trust the results?...  Again, the same mu's are predicted between the two programs.  

Thanks for bearing with this,
Andrew

Ben Goodrich

unread,
Feb 24, 2013, 2:27:59 PM2/24/13
to stan-...@googlegroups.com
On Sunday, February 24, 2013 1:52:04 PM UTC-5, atzap wrote:
On Sunday, February 24, 2013 1:24:48 PM UTC-5, Ben Goodrich wrote:
On Sunday, February 24, 2013 3:16:23 AM UTC-5, atzap wrote:
You should be able to see now that this still leads to different parameter estimates...

Yes, now that I actually ran your code, although I shortened the Stan run, which might not have been the best idea in retrospect. Now, I am guessing that there may be multiple modes?

mu[i] <- beta[1] * xitrue[1,i] + beta[2] * xitrue[2,i] + rans[i] + beta[3]*xitrue[1,i]*xitrue[2,i] + alpha[PRED[i]];

It seems like having both rans[i] and alpha[PRED[i]] could induce multimodality that would leave the mus unaffected, but I couldn't investigate that because rans didn't get saved. But I could sort of see a problem by looking at the chains individually.



Thanks for doing that Ben.  I hadn't thought of multimodality.  But if this was the case shouldn't taus and taualpha be correlated, which they're not?

Maybe, but I would suspect correlation in the levels rather than the standard deviations. In other words, you could add a constant to all the rans and subtract a constant from all the alphas and get the same mus without affecting any of the variances. One thing you might do is restrict the alpha vector to have a sum of zero by say, making the third element be the additive inverse of the sum of the first two.
 

Anyway, if you do run for longer, you can get STAN to converge.  It does so in a consistently different place than where JAGS converges.  Without comparing the different software, I would have assumed that the model fitted.  Now, I don't know whether to trust where it ends up or whether there is in fact some multimodality that cannot be detected because within a single program (i.e. STAN or JAGS), it always converges to the same spot.  I guess that in addition to what is going on here, I'm also asking how can I trust the results?...  Again, the same mu's are predicted between the two programs.  

This is a limitation of the between-chain variance vs. within-chain variance approach to convergence diagnostics (and effective sample size calculations). If there is a lack of identification of some parameters or multimodality then some parameters may look good and others bad. Then again, genuine multimodality is a hard problem to deal with using MCMC. For your longer runs, can you check whether some of the purely within-chain convergence diagnostics in the coda package indicate lack of convergence for the chains individually?

Ben

atzap

unread,
Feb 27, 2013, 5:38:39 PM2/27/13
to
Sorry for the delay in replying.  I’ve been testing things further.



Thanks for doing that Ben.  I hadn't thought of multimodality.  But if this was the case shouldn't taus and taualpha be correlated, which they're not?

Maybe, but I would suspect correlation in the levels rather than the standard deviations. In other words, you could add a constant to all the rans and subtract a constant from all the alphas and get the same mus without affecting any of the variances. One thing you might do is restrict the alpha vector to have a sum of zero by say, making the third element be the additive inverse of the sum of the first two.

A fair point!  I've plotted out the rans’s and alpha’s and they don't look particularly correlated.  However, running for longer and using the within-chain diagnostics (Geweke; Heidelberg and Welch) shows that the chains aren't really converging despite what neff, Rhat, and visual inspection of the chain traces say...  I’ve also tried making the level of alpha with only 1 observation an additive inverse of the other two.  The result is that JAGS continues to go to a different place in parameter space than STAN, i.e. beta's are non-overlapping zero while alpha's are in STAN.   Though, the model is still not converging even after very long runs (millions of iterations).

I then accepted that I am probably over-fitting this model, and had a long, deep think about what I was doing.  This was partly motivated by the fact that I kept observing STAN diverging from JAGS in all sorts of other formulations of my model; though the ran’s were not really converging.  The good news is that I now think the ran’s are part of the problem.  Essentially, I am modelling i observations of some variable y_ij that was recorded in j groups.  I am assuming that y_ij is normal with an estimated mean mu_j and standard deviation theta.  I have then been modelling mu_j given two covariates x_j and z_j and accounting for random variation among the j groups using eta_j that is ~N(0, theta2).  This then leads to mu_j being described as:
mu_j = beta0 + beta1*x_j + beta2*z_j + beta3*x_j*z_j + eta_j.
This fits fine.  

Where it turns to custard is that I've been assuming that x_j and z_j are in fact unobserved and come from a multivariate normal with mean equal to the observed x’_j and z’_j.  In other words, x_j can be described as equal to x’_j + e_j and z’_j + e’_j, where e_j and e’_j are estimated errors from the multivariate.  Essentially, what I think I end up doing here is throwing in three levels of j-level variance.  This is introducing correlations that then prevent the estimates of the ran’s from converging, and STAN (for some reason) consistently wanders off into a different part of parameter space than JAGS.  I think this explains why an errors-in-variables model with the ran’s but without the alpha’s still differs between STAN and JAGS and has difficulty converging.  The estimates of e_j and e’_j can converge when I remove eta_j because they are jointly derived from the VCV of the multivariate.

Hopefully this makes sense up to here?  So now that I managed to fit the above equation for mu_j with the errors-in-variables approach, I decided to try and replace beta0 with a three-class factor observed at the level of the j's (the alpha from previously).  But again, JAGS and STAN have difficulty converging and each are wandering off into different parts of parameter space.  Is this because there is only one observation at the level of alpha[3]?  I realize this may be more of a statistical than STAN issue, but bear with me so I can tell you why I think it might be both.    

If I remove the whole errors-in-variables approach and throw the ran's back in along with the alpha's (equivalent to the function written above but replacing beta0 with alpha), STAN has different parameter estimates than JAGS.  I've saved the runs in the following workspace (https://dl.dropbox.com/u/7489119/Workspace3.RData), and the attached script describes the model and diagnostics.  The models look pretty close to converging, though there are issues with some of the Heidelberg and Welch tests.  The most concerning bit here is that the maximum likelihood solution fitted with the lmer function in R closely resembles STAN not JAGS.  I do appreciate that the parameter estimates for the beta’s are massively correlated (the errors-in-variables approach was used to account for covariation between the xi's), but I still don't understand why JAGS goes somewhere else in parameter space than STAN?  Presumably it must be because, even after millions of iterations, JAGS hasn't converged on the ML solution, while HMC manages to do so?  The "problem" in JAGS may also be due to covariation, because the alpha’s and beta’s are positively correlated, so as the beta's wander away from the ML estimates, the alpha's follow.

To summarize, it looks like STAN is doing a better job of converging than JAGS, and this is likely the source of the original differences.  Within-chain convergence diagnostics are important for helping to identify this problem.  The one lingering question is why my errors-in-variables approach has difficulty dealing with only one observation at the level of alpha[3] when the ran's are removed (see model in STAN_norans.txt)?  The one observation isn't a problem when I don't estimate x_j and z_j, and use their true observed values.

Thanks heaps for reading all this.  I know that it is a lot to ask.
Andrew
Rscript3.txt
STAN_norans.txt

Andrew Gelman

unread,
Feb 27, 2013, 5:39:35 PM2/27/13
to stan-...@googlegroups.com
Hi, Andrew T.  I'm glad that Stan is helpful.  Just quickly:  the R-hat we use now is a split R-hat, which includes within as well as between-chain information.  So I don't think it should happen that the chains aren't really converging despite what neff, Rhat, and visual inspection of the chain traces say.  If this _is_ happening, I'd like to understand what's going on.  So if you could send the example of the results that weren't converging despite Rhat etc., please do so and I can take a look.  Thanks.
AG

atzap

unread,
Feb 27, 2013, 6:18:18 PM2/27/13
to stan-...@googlegroups.com, gel...@stat.columbia.edu
Thanks for the quick reply Andrew G. and sorry again for the dense rambling post.

Please download the workspace linked above (https://dl.dropbox.com/u/7489119/Workspace3.RData) and then open up the Rscript3.txt file.  You will see that the output in the object fit.stan looks like it has converged but the heidel.diag and geweke.diag tests suggest otherwise for some of the parameters (e.g. ran's and taualpha).  More importantly, the following quote:
despite what neff, Rhat, and visual inspection of the chain traces
may be more relevant for the JAGS result.  Again, please see the jagsfit8 object in the linked workspace.  The within-chain diagnostics from coda suggest that some parameters in some of the chains haven't converged, and this is especially true when we compare the result to STAN and the lmer function.

Again, thanks for STAN!  Its performance is continuing to impress when compared to the other tools that are out there.
Andrew

Andrew Gelman

unread,
Feb 28, 2013, 10:30:35 PM2/28/13
to stan-...@googlegroups.com
Hi, Andrew T.  With the Jags fits, are you doing diagnostics using coda?  The R-hat in Coda is the old R-hat that does not fully use within-chain information.  We now use split R-hat which catches these problems.
Or, to put it another way, I looked at the Stan result and saw this:

Inference for Stan model: mSEM.
3 chains: each with iter=4e+06; warmup=2e+06; thin=6250; 640 iterations saved.

          mean se_mean  sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
taus       0.1     0.0 0.2   0.0   0.0   0.1   0.2   0.6   881    1
tau        0.3     0.0 0.0   0.3   0.3   0.3   0.3   0.4   960    1
taub       0.8     0.0 1.4   0.1   0.3   0.5   0.8   3.1   960    1
beta[1]    0.1     0.0 0.2  -0.2   0.0   0.1   0.1   0.5   960    1
beta[2]    0.4     0.0 0.2  -0.1   0.2   0.4   0.5   1.0   960    1
beta[3]    0.4     0.0 0.3  -0.1   0.3   0.5   0.6   1.2   960    1
beta[4]    0.1     0.1 1.9  -1.9  -0.3   0.0   0.3   2.0   921    1
taualpha   2.0     0.1 3.0   0.5   0.9   1.3   2.1   7.3   960    1
alpha[1]  -0.9     0.0 0.2  -1.2  -0.9  -0.9  -0.8  -0.4   960    1
alpha[2]  -0.8     0.0 0.2  -1.2  -0.9  -0.8  -0.8  -0.3   960    1
alpha[3]  -0.9     0.0 0.3  -1.5  -1.0  -0.9  -0.8  -0.2   960    1
mu[1]     -1.2     0.0 0.0  -1.3  -1.2  -1.2  -1.2  -1.2   960    1
mu[2]     -1.2     0.0 0.0  -1.3  -1.3  -1.2  -1.2  -1.2   960    1
mu[3]     -1.5     0.0 0.0  -1.6  -1.6  -1.5  -1.5  -1.5   960    1
mu[4]     -1.0     0.0 0.0  -1.1  -1.0  -1.0  -1.0  -0.9   921    1
mu[5]     -0.9     0.0 0.0  -1.0  -0.9  -0.9  -0.9  -0.8   828    1
mu[6]     -0.8     0.0 0.0  -0.8  -0.8  -0.8  -0.8  -0.7   946    1
mu[7]     -0.9     0.0 0.0  -1.0  -1.0  -0.9  -0.9  -0.9   925    1
mu[8]     -1.3     0.0 0.0  -1.3  -1.3  -1.3  -1.3  -1.2   960    1
rans[1]    0.0     0.0 0.1  -0.3   0.0   0.0   0.0   0.2   960    1
rans[2]    0.0     0.0 0.2  -0.5  -0.1   0.0   0.0   0.3   743    1
rans[3]    0.0     0.0 0.2  -0.4  -0.1   0.0   0.0   0.3   893    1
rans[4]    0.0     0.0 0.2  -0.4  -0.1   0.0   0.0   0.3   924    1
rans[5]    0.0     0.0 0.1  -0.2   0.0   0.0   0.1   0.3   836    1
rans[6]    0.0     0.0 0.2  -0.5   0.0   0.0   0.1   0.4   960    1
rans[7]    0.0     0.0 0.1  -0.3  -0.1   0.0   0.0   0.2   869    1
rans[8]   -0.1     0.0 0.2  -0.6  -0.1   0.0   0.0   0.3   818    1
lp__     489.5     0.2 7.0 475.5 485.2 489.4 493.8 503.8   904    1

My first thought was that you must be running it too long!  4 million iterations looks to be overkill!  Maybe 1000 iterations or 10,000 would suffice?  But are you saying that some of these parameters (e.g., taualpha and rans[1]) did not actually converge even in the simulations shown above?

AG

atzap

unread,
Feb 28, 2013, 11:03:37 PM2/28/13
to

On Thursday, February 28, 2013 10:30:35 PM UTC-5, Andrew Gelman wrote:
My first thought was that you must be running it too long!  4 million iterations looks to be overkill!  Maybe 1000 iterations or 10,000 would suffice?  But are you saying that some of these parameters (e.g., taualpha and rans[1]) did not actually converge even in the simulations shown above?


Thanks for looking at it Andrew.  I agree that it is probably is overkill, though there were problems on the shorter runs.  

I am saying that: 
1) the output you printed looks like the model has converged - correct?
2) but the other diagnostics that Ben mentioned in this thread suggest otherwise (see below for just the first chain)

> fit.stan2 <- as.array(fit.stan)
> fit.stan2 <- as.mcmc.list(list(as.mcmc(fit.stan2[,1,]), as.mcmc(fit.stan2[,2,]), as.mcmc(fit.stan2[,3,])))
> heidel.diag(fit.stan2)
[[1]]
                                       
         Stationarity start     p-value
         test         iteration        
taus     passed       1         0.7712 
tau      passed       1         0.6906 
taub     passed       1         0.5310 
beta[1]  passed       1         0.2641 
beta[2]  passed       1         0.4617 
beta[3]  passed       1         0.5898 
beta[4]  passed       1         0.6979 
taualpha passed       1         0.3349 
alpha[1] passed       1         0.9918 
alpha[2] passed       1         0.3374 
alpha[3] passed       1         0.2789 
mu[1]    passed       1         0.2417 
mu[2]    passed       1         0.3070 
mu[3]    passed       1         0.2137 
mu[4]    passed       1         0.5156 
mu[5]    passed       1         0.8390 
mu[6]    passed       1         0.0533 
mu[7]    passed       1         0.9699 
mu[8]    passed       1         0.6637 
rans[1]  passed       1         0.9138 
rans[2]  passed       1         0.5671 
rans[3]  passed       1         0.9390 
rans[4]  passed       1         0.4225 
rans[5]  passed       1         0.5694 
rans[6]  passed       1         0.2915 
rans[7]  passed       1         0.4738 
rans[8]  passed       1         0.4867 
lp__     passed       1         0.6548 
                                      
         Halfwidth Mean      Halfwidth
         test                         
taus     failed      0.14290 0.016795 
tau      passed      0.33356 0.000961 
taub     failed      0.80242 0.179632 
beta[1]  failed      0.07594 0.019204 
beta[2]  passed      0.36254 0.026755 
beta[3]  passed      0.44263 0.033425 
beta[4]  failed     -0.02191 0.158124 
taualpha failed      2.19381 0.394462 
alpha[1] passed     -0.87442 0.014472 
alpha[2] passed     -0.83753 0.021542 
alpha[3] passed     -0.92614 0.032429 
mu[1]    passed     -1.21609 0.002869 
mu[2]    passed     -1.24532 0.003563 
mu[3]    passed     -1.52935 0.003460 
mu[4]    passed     -1.01090 0.003671 
mu[5]    passed     -0.88602 0.003575 
mu[6]    passed     -0.77985 0.003560 
mu[7]    passed     -0.93675 0.003846 
mu[8]    passed     -1.28404 0.002958 
rans[1]  failed     -0.00263 0.014489 
rans[2]  failed     -0.03230 0.021647 
rans[3]  failed     -0.02768 0.020586 
rans[4]  failed     -0.00127 0.021847 
rans[5]  failed      0.04382 0.013419 
rans[6]  failed      0.01564 0.019822 
rans[7]  failed     -0.02153 0.012006 
rans[8]  failed     -0.05108 0.022046 
lp__     passed    489.43918 0.788791 

3)  Irrespective of STAN, the JAGS fit suggests convergence (though with the old Rhat).  As you can see in the print out below, it converges at different values than STAN and the ML estimates.

Inference for Bugs model at "http://dl.dropbox.com/u/7489119/file6af85c783c103.bug", fit using jags,
 3 chains, each with 4e+06 iterations (first 2e+06 discarded), n.thin = 12500
 n.sims = 480 iterations saved
          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
alpha[1]   -0.251   0.275  -0.847  -0.414  -0.166  -0.021   0.076 1.006   390
alpha[2]   -0.210   0.268  -0.832  -0.369  -0.103  -0.003   0.093 1.011   240
alpha[3]   -0.164   0.295  -0.914  -0.230  -0.046   0.009   0.172 1.018   180
beta[1]     0.693   0.285   0.098   0.512   0.737   0.904   1.148 1.013   330
beta[2]     1.183   0.362   0.407   0.971   1.259   1.438   1.733 1.009   270
beta[3]     1.501   0.452   0.511   1.266   1.604   1.837   2.124 1.017   210
mu[1]      -1.209   0.032  -1.269  -1.232  -1.211  -1.188  -1.149 1.000   480
mu[2]      -1.244   0.035  -1.314  -1.266  -1.246  -1.221  -1.175 1.001   480
mu[3]      -1.531   0.032  -1.595  -1.554  -1.530  -1.508  -1.470 1.001   480
mu[4]      -1.019   0.035  -1.089  -1.042  -1.018  -0.997  -0.947 1.004   340
mu[5]      -0.872   0.031  -0.931  -0.894  -0.872  -0.850  -0.810 0.999   480
mu[6]      -0.765   0.034  -0.831  -0.788  -0.766  -0.743  -0.699 1.008   230
mu[7]      -0.938   0.033  -1.000  -0.961  -0.937  -0.916  -0.876 1.011   260
mu[8]      -1.288   0.033  -1.351  -1.310  -1.288  -1.264  -1.225 1.001   480
rans[1]    -0.300   0.185  -0.641  -0.435  -0.314  -0.154   0.017 1.000   480
rans[2]    -0.075   0.199  -0.492  -0.208  -0.055   0.048   0.298 1.011   480
rans[3]    -0.009   0.197  -0.404  -0.118  -0.007   0.105   0.375 1.010   220
rans[4]     0.173   0.243  -0.271   0.011   0.146   0.328   0.693 1.012   480
rans[5]    -0.094   0.152  -0.433  -0.190  -0.084   0.012   0.176 1.006   480
rans[6]    -0.437   0.213  -0.751  -0.593  -0.498  -0.303   0.018 1.009   220
rans[7]    -0.217   0.149  -0.514  -0.324  -0.218  -0.107   0.030 1.006   270
rans[8]     0.166   0.196  -0.179   0.014   0.164   0.293   0.539 1.003   480
tau         0.334   0.008   0.320   0.329   0.334   0.339   0.350 0.999   480
tau.alpha   0.160   0.115   0.006   0.066   0.139   0.232   0.401 1.003   480
taub        2.277   2.350   0.465   1.120   1.605   2.551   8.711 1.006   230
taus        0.224   0.081   0.025   0.191   0.238   0.276   0.358 1.056   120
deviance  514.223   3.973 508.306 511.243 513.528 516.609 523.373 1.002   480

Michael Betancourt

unread,
Feb 28, 2013, 11:11:46 PM2/28/13
to stan-...@googlegroups.com
Andrew,

Can you try running again with a smaller step size?  In particular, run with a constant step size set to 1/10th the value to which adaptation converges and see if that changes the convergence of the chain.

-Mike

On Feb 28, 2013, at 11:02 PM, atzap <ata...@gmail.com> wrote:


On Thursday, February 28, 2013 10:30:35 PM UTC-5, Andrew Gelman wrote:
My first thought was that you must be running it too long!  4 million iterations looks to be overkill!  Maybe 1000 iterations or 10,000 would suffice?  But are you saying that some of these parameters (e.g., taualpha and rans[1]) did not actually converge even in the simulations shown above?


3)  Irrespective, STAN, the JAGS fit suggests convergence (though with the old Rhat).  As you can see in the print out below, it converges at different values than STAN and the ML estimates.
--

atzap

unread,
Feb 28, 2013, 11:47:25 PM2/28/13
to stan-...@googlegroups.com


On Thursday, February 28, 2013 11:11:46 PM UTC-5, Michael Betancourt wrote:
Can you try running again with a smaller step size?  In particular, run with a constant step size set to 1/10th the value to which adaptation converges and see if that changes the convergence of the chain.

Mike
Are you suggesting something like this:
library(rjags)
bdat5$Omega <- NULL
params = c("taus", "tau","taub", "beta", "taualpha", "alpha", "mu", "rans")
bdat5$xi <- t(bdat5$xi)
params[5] <- "tau.alpha"
its = function() {list(taub=runif(1), tau.alpha = runif(1), taus = runif(1), tau = runif(1))}
inits = its, n.chains = 3, n.adapt=1, quiet=FALSE)
adapt(m1, n.iter = 100, end.adaptation = F)
update(m1, n.iter = 100*.1)
m1 <- coda.samples(m1, params, 200)
summary(m1)

The adapt function is returning TRUE after 100 iterations, but clearly it needs to update for longer before sampling...

Thanks
Andrew

Michael Betancourt

unread,
Feb 28, 2013, 11:54:22 PM2/28/13
to stan-...@googlegroups.com
Sorry, I meant with Stan. If your model has wild spatial-variations in curvature, then the step-size adaption during warm-up can converge to a value so large that some regions of the typical probability mass cannot be reached by the Hamiltonian trajectories.

atzap

unread,
Mar 1, 2013, 12:44:18 AM3/1/13
to stan-...@googlegroups.com
Hopefully I've understood your instructions correctly this time!  

I've called attr(fit.stan@sim$samples[[1]], "sampler_params")$stepsize in my original workspace and it seems like the different chains are converging at different step-sizes (0.3 - 0.5).  I've selected the mean and am now adding equal_step_sizes = T, epsilon = .004 in the call to stan.  Is that correct?  From the manual, it looks like epsilon sets the step size?...

Michael Betancourt

unread,
Mar 1, 2013, 12:57:17 AM3/1/13
to stan-...@googlegroups.com
Yup, epsilon is correct.  0.04 should be sufficient; 0.004 is even more sensitive to possible issues but will run 10 times as long.  Go ahead and run and compare to your previous Stan results.

On Mar 1, 2013, at 12:44 AM, atzap <ata...@gmail.com> wrote:

Hopefully I've understood your instructions correctly this time!  

I've called attr(fit.stan@sim$samples[[1]], "sampler_params")$stepsize in my original workspace and it seems like the different chains are converging at different step-sizes (0.3 - 0.5).  I've selected the mean and am now adding equal_step_sizes = T, epsilon = .004 in the call to stan.  Is that correct?  From the manual, it looks like epsilon sets the step size?...

--

atzap

unread,
Mar 1, 2013, 12:59:56 AM3/1/13
to stan-...@googlegroups.com
I did on 10000 iterations and neff was 2 on some of the parameters!  Will fiddle with the epsilon a bit and on longer runs and see what happens.

Michael Betancourt

unread,
Mar 1, 2013, 1:20:30 AM3/1/13
to stan-...@googlegroups.com
That may be a sign that the lower step size is allowing transitions between modes.  Keep us informed.

atzap

unread,
Mar 1, 2013, 6:50:46 PM3/1/13
to stan-...@googlegroups.com
Hi Mike

It was late when I was writing my last message.  The step sizes in the original workspace were .03 - .05.  Going to an epsilon of 0.04 the four chains are entirely static and do not mix.

Running four chains with 120,000 iterations, a warmup of 100,000 and using an epsilon of 0.004, the n_eff values are still <10 on some of the parameters.  Inspecting the chain traces, there is one chain that becomes stuck at some initial values and doesn't move.  If we discard that, then the chains look like they are rapidly converging early on (much closer to Andrew's 10,000 mark!), and appear to mix well.  

I then went further down to an epsilon of 0.001 and ran for 60,000 with a warmup of 40,000.  This led to all four chains mixing with large n_eff values and Rhat near 1.  That said, there are the same problems as before when running the coda diagnostics that Ben mentioned, especially for some of the ran's and beta's.  I've copied the output below from heidel.diag for one of the chains.

> round(2*pnorm(-abs(sapply(geweke.diag(fit.stan2), function(x){x$z}))), 3)
          [,1]  [,2]  [,3]
taus     0.839 0.513 0.887
tau      0.754 0.814 0.482
taub     0.733 0.055 0.001
beta[1]  0.011 0.833 0.077
beta[2]  0.806 0.970 0.133
beta[3]  0.952 0.340 0.095
beta[4]  0.531 0.962 0.060
taualpha 0.423 0.721 0.967
alpha[1] 0.834 0.442 0.020
alpha[2] 0.632 0.535 0.192
alpha[3] 0.273 0.881 0.217
mu[1]    0.228 0.234 0.089
mu[2]    0.127 0.361 0.534
mu[3]    0.011 0.736 0.903
mu[4]    0.481 0.020 0.767
mu[5]    0.478 0.092 0.751
mu[6]    0.232 0.488 0.143
mu[7]    0.975 0.880 0.871
mu[8]    0.749 0.866 0.926
rans[1]  0.589 0.600 0.004
rans[2]  0.279 0.826 0.937
rans[3]  0.526 0.650 0.623
rans[4]  0.022 0.216 0.751
rans[5]  0.137 0.285 0.086
rans[6]  0.567 0.124 0.391
rans[7]  0.093 0.308 0.030
rans[8]  0.516 0.279 0.588
lp__     0.568 0.339 0.760



> heidel.diag(fit.stan2)
[[1]]
                                       
         Stationarity start     p-value
         test         iteration        
taus     passed        1        0.2079 
tau      passed        1        0.6252 
taub     passed        1        0.8027 
beta[1]  passed        1        0.4335 
beta[2]  passed        1        0.6125 
beta[3]  passed        1        0.9727 
beta[4]  passed        1        0.6211 
taualpha passed        1        0.8490 
alpha[1] passed        1        0.8295 
alpha[2] passed       41        0.2104 
alpha[3] passed        1        0.8825 
mu[1]    passed        1        0.5227 
mu[2]    passed        1        0.3857 
mu[3]    passed       21        0.3802 
mu[4]    passed        1        0.6796 
mu[5]    passed        1        0.9085 
mu[6]    passed        1        0.0950 
mu[7]    passed        1        0.8293 
mu[8]    passed        1        0.1200 
rans[1]  passed        1        0.7570 
rans[2]  passed        1        0.6307 
rans[3]  passed        1        0.7712 
rans[4]  passed        1        0.2270 
rans[5]  passed        1        0.1821 
rans[6]  passed       41        0.0802 
rans[7]  passed        1        0.1297 
rans[8]  passed        1        0.7192 
lp__     passed        1        0.1413 
                                      
         Halfwidth Mean      Halfwidth
         test                         
taus     failed      0.14275 0.02054  
tau      passed      0.33489 0.00117  
taub     failed      0.61523 0.07746  
beta[1]  failed      0.05541 0.02047  
beta[2]  failed      0.32907 0.03735  
beta[3]  passed      0.40136 0.03077  
beta[4]  failed      0.04811 0.10627  
taualpha failed      1.87726 0.20707  
alpha[1] passed     -0.89789 0.01982  
alpha[2] passed     -0.86165 0.02106  
alpha[3] passed     -0.96558 0.04503  
mu[1]    passed     -1.21366 0.00573  
mu[2]    passed     -1.24893 0.00444  
mu[3]    passed     -1.53547 0.00300  
mu[4]    passed     -1.01635 0.00462  
mu[5]    passed     -0.88770 0.00322  
mu[6]    passed     -0.77795 0.00493  
mu[7]    passed     -0.93394 0.00443  
mu[8]    passed     -1.28491 0.00540  
rans[1]  failed      0.00628 0.02295  
rans[2]  failed     -0.02526 0.02620  
rans[3]  failed     -0.04121 0.03366  
rans[4]  failed     -0.00858 0.02356  
rans[5]  failed      0.05011 0.01435  
rans[6]  failed      0.03825 0.02267  
rans[7]  failed     -0.00893 0.01398  
rans[8]  failed     -0.05312 0.02491  
lp__     passed    490.48500 1.09257 

Michael Betancourt

unread,
Mar 2, 2013, 12:14:37 AM3/2/13
to stan-...@googlegroups.com
Andrew,

What does the tree depth per sample look like when you lower the step size? By default NUTS has a maximum tree depth and for sufficiently small step sizes that might limit the motion of the chain (especially for certain variables).
--

atzap

unread,
Mar 2, 2013, 1:09:12 AM3/2/13
to stan-...@googlegroups.com
A fair number of iterations are hitting up against that maximum.  Please see the attached PDFs.  

Should I try raising it slightly?  How high would you suggest?

Thanks again, this is becoming so very informative!
Andrew
treedep_hists.pdf
treedep_its.pdf

atzap

unread,
Mar 2, 2013, 4:11:04 AM3/2/13
to stan-...@googlegroups.com

This is a limitation of the between-chain variance vs. within-chain variance approach to convergence diagnostics (and effective sample size calculations). If there is a lack of identification of some parameters or multimodality then some parameters may look good and others bad. Then again, genuine multimodality is a hard problem to deal with using MCMC. For your longer runs, can you check whether some of the purely within-chain convergence diagnostics in the coda package indicate lack of convergence for the chains individually?

So I'm getting a funny feeling about these within-chain convergence diagnostics in coda.  If I run the two that I've been using on the schools example, they suggest that the individual chains haven't converged!

# run schools example
fit
.stan <- stan(model_code = schools_code, data = schools_dat,iter = 10000, warmup = 8000, thin = 10, chains = 4)

# process into MCMC list
fit
.stan2 <- as.array(fit.stan)
fit
.stan2 <- as.mcmc.list(list(as.mcmc(fit.stan2[,1,]), as.mcmc(fit.stan2[,2,]), as.mcmc(fit.stan2[,3,]), as.mcmc(fit.stan2[,4,])))


# calculate p-values (if you believe in such things) associated with Z-scores from Geweke on each chain

round
(2*pnorm(-abs(sapply(geweke.diag(fit.stan2), function(x){x$z}))), 3)

# calculate Heidel on each chain
heidel
.diag(fit.stan2)

Clearly, this is nonsense, isn't it?  So does that mean we shouldn't be overly concerned about these specific diagnostics, especially the half-width mean test?  In which case, we return to my original issue, which is that assuming the models have converged, JAGS throws up different parameter estimates than STAN...  Sorry if I'm flogging a dead horse.

Daniel Lee

unread,
Mar 2, 2013, 6:50:19 AM3/2/13
to stan-...@googlegroups.com, stan-...@googlegroups.com
Just so we are all on the same page, can you send the
1) JAGS model
2) JAGS data
3) Stan model
4) Stan data

My gut is that there's still something inconsistently specified between the model in both languages. We have coded most of the BUGS example models (in src/stan/models/bugs_examples) and the parameter estimates match. 



Daniel
--

atzap

unread,
Mar 2, 2013, 3:23:55 PM3/2/13
to stan-...@googlegroups.com
I attached a self-contained script above that runs and compares both models.  But here are just the data and model files.  The data files are just dumps of lists from R, but renamed to *.txt so that they upload here.

I'm somewhat hesitant to follow-up about my query regarding the coda chain diagnostics on the schools example (please see my most recent post above).  Perhaps it is better placed in a separate thread?  Sorry I'm asking so many convoluted questions!

Andrew
JAGSmodel.txt
STANmodel.txt
JAGSdata.txt
STANdata.txt

Michael Betancourt

unread,
Mar 2, 2013, 4:31:12 PM3/2/13
to stan-...@googlegroups.com
The idea is to compare the neighborhoods to where the chains converge with the stepsize varying but stepsize*treedepth held constant. NUTS should ensure that stepsize*treedepth is constant provided you don't hit the maximum tree depth, so I would make sure to raise the maximum tree depth for these studies.

If there are weird geometric issues then the chains with smaller stepsizes will be different than the chains with the larger stepsizes. If they are consistent than there's no obvious issue on the HMC/NUTS side.
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> <treedep_hists.pdf><treedep_its.pdf>

Michael Betancourt

unread,
Mar 2, 2013, 4:48:54 PM3/2/13
to stan-...@googlegroups.com
The idea is to compare the neighborhoods to where the chains converge with the stepsize varying but stepsize*treedepth held constant. NUTS should ensure that stepsize*treedepth is constant provided you don't hit the maximum tree depth, so I would make sure to raise the maximum tree depth for these studies.

If there are weird geometric issues then the chains with smaller stepsizes will be different than the chains with the larger stepsizes. If they are consistent than there's no obvious issue on the HMC/NUTS side.

On Mar 2, 2013, at 1:09 AM, atzap wrote:

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> <treedep_hists.pdf><treedep_its.pdf>

Reply all
Reply to author
Forward
0 new messages