Big differences in parameter estimates between JAGS and STAN for same model

atzap

unread,

Feb 23, 2013, 11:04:01 PM2/23/13

to stan-...@googlegroups.com

Hi all

I'm rather perplexed by the following problem. I am fitting a basic linear model in rstan to a normally distributed variable y, and trying to predict the mean of this normal (mu) given different levels of a factor (alpha) and a series of covariates (x.true) their associated effects (beta's). The model and data are available in the attached script. I then re-run this exact same model in JAGS (see the link to the model file in the attached R file) and get perfectly correlated predictions for mu between STAN and JAGS.

The problem is that the beta and alpha estimates are nowhere near each other! I realize that the model hasn't fully converged in JAGS (again showing the tremendous efficiency of HMC!), but it does eventually converge when everything is run about 10X longer, and the estimates are still nowhere near each other. Looking at the posterior summaries, it looks like the taualpha estimate increases at the expense of taub in STAN as compared with JAGS, where the reverse seems to happen. Entirely removing the hyper-parameter taub doesn't change the result in JAGS or STAN, so they're still different. What can possibly be the cause of this? What am I missing? Hopefully, there is a relatively straight forward explanation that will save me a few days of frustration.

The attached R script should demonstrate the "problem", though JAGS must be installed on your system. I am using rstan 1.1 and JAGS 3.3.

Many, many thanks,

Andrew

Rscript.txt

Ben Goodrich

unread,

Feb 24, 2013, 12:25:30 AM2/24/13

to stan-...@googlegroups.com

On Saturday, February 23, 2013 11:04:01 PM UTC-5, atzap wrote:

The problem is that the beta and alpha estimates are nowhere near each other! I realize that the model hasn't fully converged in JAGS (again showing the tremendous efficiency of HMC!), but it does eventually converge when everything is run about 10X longer, and the estimates are still nowhere near each other. Looking at the posterior summaries, it looks like the taualpha estimate increases at the expense of taub in STAN as compared with JAGS, where the reverse seems to happen. Entirely removing the hyper-parameter taub doesn't change the result in JAGS or STAN, so they're still different. What can possibly be the cause of this? What am I missing? Hopefully, there is a relatively straight forward explanation that will save me a few days of frustration.

I think part of the issue is the different parameterizations of the normal distribution between Stan and JAGS. In Stan, the second parameter is a standard deviation. In the BUGS-family, the second parameter is a precision. So, the priors in your Stan model

taualpha ~ normal(0,100)T[0,];
taus ~ normal(0,100)T[0,];

are very different from the priors in your JAGS model

tau.alpha ~ dnorm(0.00000E+00, 100)I(0,)
taus ~ dnorm(0.00000E+00, 100)I(0,)

Also, it is not necessary to explicitly truncate the priors on taualpha and taus in Stan in this particular case because they are declared as non-negative and the amount of truncated mass is a constant.

Ben

atzap

unread,

Feb 24, 2013, 2:11:08 AM2/24/13

to

Thanks for the quick reply Ben. Yes, I suspected that the precision vs SD might be the issue, but I'm not sure I immediately follow how I haven't accounted for this... For example, in JAGS, I then transform
taus ~ dnorm(0.00000E+00, 100)I(0,)
pre.alpha <- pow(tau.alpha, -2)

And pre.alpha goes into the normal prior for alpha. Isn't this the same thing then as what I've done in STAN? Or do I need to throw a sqrt on the 100 in STAN?... I think this is me being daft!

Again, thanks,

Andrew

EDIT: So pg 249 of the STAN manual suggests that normal(1,2) = normal(1,0.25). In which case, normal(1, 0.25) is the same as the following:

normal(1,pre)

pre <- pow(tau, -2)

tau <- 2

Isn't this what I've done in my script?...

Ben Goodrich

unread,

Feb 24, 2013, 2:18:32 AM2/24/13

to stan-...@googlegroups.com

On Sunday, February 24, 2013 1:54:50 AM UTC-5, atzap wrote:

Thanks for the quick reply Ben. Yes, I suspected that the precision vs SD might be the issue, but I'm not sure I immediately follow how I haven't accounted for this... For example, in JAGS, I then transform
taus ~ dnorm(0.00000E+00, 100)I(0,)
pre.alpha <- pow(tau.alpha, -2)

And pre.alpha goes into the normal prior for alpha. Isn't this the same thing then as what I've done in STAN? Or do I need to throw a sqrt on the 100 in STAN?... I think this is me being daft!

Again, thanks,
Andrew

EDIT: So pg 249 suggests that normal(1,2) = normal(1,0.25). In which case, normal(1, 0.25) is the same as the following:

normal(1,pre)
pre <- pow(tau, -2)
tau <- 2

So, isn't this what I've done in my script, and so this isn't the problem?...

You're right, sorry. I didn't look closely enough at your .bug file. But a more likely thing is that in the .stan file you have

beta ~ normal(0.00000E+00, taub);

while in the .bug file you have

    for (i in 1:4) {
        beta[i] ~ dnorm(0.00000E+00, 1e-3)
    }

Although taub has a prior it seems disconnected from the rest of the DAG.

Ben

Bob Carpenter

unread,

Feb 24, 2013, 3:09:24 AM2/24/13

to stan-...@googlegroups.com

If you have the following in BUGS/JAGS, which
uses precision:

a ~ dnorm(b,c) I(0,)

you should this in Stan, which uses std dev:

a ~ normal(b,pow(c,-0.5));

- Bob

On 2/24/13 1:54 AM, atzap wrote:
> Thanks for the quick reply Ben. Yes, I suspected that the precision vs SD might be the issue, but I'm not sure I
> immediately follow how I haven't accounted for this... For example, in JAGS, I then transform
> taus ~ dnorm(0.00000E+00, 100)I(0,)
> pre.alpha <- pow(tau.alpha, -2)
>
> And pre.alpha goes into the normal prior for alpha. Isn't this the same thing then as what I've done in STAN? Or do I
> need to throw a sqrt on the 100 in STAN?... I think this is me being daft!
>
> Again, thanks,
> Andrew
>
>

> On Sunday, February 24, 2013 12:25:30 AM UTC-5, Ben Goodrich wrote:
>

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

atzap

unread,

Feb 24, 2013, 3:16:23 AM2/24/13

to stan-...@googlegroups.com

You're right, sorry. I didn't look closely enough at your .bug file. But a more likely thing is that in the .stan file you have
beta ~ normal(0.00000E+00, taub);
while in the .bug file you have
for (i in 1:4) { beta[i] ~ dnorm(0.00000E+00, 1e-3) }
Although taub has a prior it seems disconnected from the rest of the DAG.

Ben

No worries, lots to read. And now it is my turn to apologize. I was testing a few things and accidentally saved that changed in the bug file. The different answers should persist if the prior in the bug file uses taub. I've corrected the file. You should be able to see now that this still leads to different parameter estimates...

Andrew

atzap

unread,

Feb 24, 2013, 3:24:24 AM2/24/13

to stan-...@googlegroups.com

Isn't this what I've done by doing the following in BUGS/JAGS:

a ~ dnorm(b, c)I(0,)

c <- pow(d, -2)

d ~ dunif(0, 100)

Which in STAN should be equal to:

a ~ normal(b,pow(pow(d, -2),-0.5))

In which case the powers cancel out and you're left with

a ~ normal(b, d)

Which is what I've done...?

Thanks

Andrew

Bob Carpenter

unread,

Feb 24, 2013, 3:40:43 AM2/24/13

to stan-...@googlegroups.com

I was just looking at the response from Ben, who said

--------------

taualpha ~ normal(0,100)T[0,];
taus ~ normal(0,100)T[0,];

are very different from the priors in your JAGS model

tau.alpha ~ dnorm(0.00000E+00, 100)I(0,)
taus ~ dnorm(0.00000E+00, 100)I(0,)

--------------------

The Stan sampling statement equivalent to the above JAGS is:

tau_alpha ~ normal(0,0.1) T[0,];

taus ~ normal(0,0.1) T[0,];

But as Ben said, you don't need the truncation in Stan because
the location and scale (i.e., mean and dev) parameters are
constants, so you could just as well use:

tau_alpha ~ normal(0,0.1);

taus ~ normal(0,0.1);

- Bob

Ben Goodrich

unread,

Feb 24, 2013, 1:24:48 PM2/24/13

to stan-...@googlegroups.com

On Sunday, February 24, 2013 3:16:23 AM UTC-5, atzap wrote:

You should be able to see now that this still leads to different parameter estimates...

Yes, now that I actually ran your code, although I shortened the Stan run, which might not have been the best idea in retrospect. Now, I am guessing that there may be multiple modes?

mu[i] <- beta[1] * xitrue[1,i] + beta[2] * xitrue[2,i] + rans[i] + beta[3]*xitrue[1,i]*xitrue[2,i] + alpha[PRED[i]];

It seems like having both rans[i] and alpha[PRED[i]] could induce multimodality that would leave the mus unaffected, but I couldn't investigate that because rans didn't get saved. But I could sort of see a problem by looking at the chains individually.

> round(t(colMeans(as.array(fit.stan))), 3)
          chains
parameters chain:1 chain:2 chain:3
  taus       0.178   0.229   0.398
  tau        0.335   0.333   0.333
  taub       0.349   0.346   0.301
  beta[1]   -0.124  -0.093  -0.192
  beta[2]    0.087   0.101  -0.245
  beta[3]    0.084   0.071   0.232
  beta[4]    0.054   0.055   0.071
  taualpha   2.183   2.046   1.814
  alpha[1]  -1.062  -1.058  -0.536
  alpha[2]  -0.978  -0.985  -1.111
  alpha[3]  -1.127  -1.122  -1.020
  mu[1]     -1.217  -1.217  -1.211
  mu[2]     -1.245  -1.245  -1.245
  mu[3]     -1.530  -1.527  -1.531
  mu[4]     -1.013  -1.012  -1.010
  mu[5]     -0.876  -0.874  -0.874
  mu[6]     -0.780  -0.781  -0.779
  mu[7]     -0.944  -0.947  -0.945
  mu[8]     -1.282  -1.282  -1.282
  lp__     492.389 490.340 479.517
> round(t(colMeans(jagsfit8$BUGSoutput$sims.array)), 3)
             [,1]    [,2]    [,3]
alpha[1]   -0.013  -0.031  -0.038
alpha[2]   -0.006  -0.020  -0.015
alpha[3]   -0.035  -0.034  -0.031
beta[1]     1.238   0.885   1.021
beta[2]     1.154   1.017   1.173
beta[3]     0.795   0.938   1.229
beta[4]     0.247   0.330  -0.045
deviance  514.308 514.954 514.549
mu[1]      -1.208  -1.218  -1.213
mu[2]      -1.244  -1.246  -1.243
mu[3]      -1.532  -1.526  -1.526
mu[4]      -1.010  -1.010  -1.012
mu[5]      -0.874  -0.876  -0.878
mu[6]      -0.775  -0.777  -0.775
mu[7]      -0.947  -0.943  -0.947
mu[8]      -1.285  -1.285  -1.288
tau         0.334   0.335   0.335
tau.alpha   0.084   0.091   0.084
taub        2.366   2.158   2.318
taus        0.092   0.076   0.078

Chain 3 in the Stan run definitely went to a different place than the first two chains. One plot I like to do if you have a television-sized monitor is

pairs(fit.stan, panel = function(...) smoothScatter(..., add = TRUE))

which is a two-dimensional kernel density estimate for each pair of parameters. As best I can tell from that plot on my laptop, Stan didn't sample the tau variables very well, but that may just be a function of me shortening the run.

Ben

atzap

unread,

Feb 24, 2013, 1:52:04 PM2/24/13

to stan-...@googlegroups.com

On Sunday, February 24, 2013 1:24:48 PM UTC-5, Ben Goodrich wrote:

On Sunday, February 24, 2013 3:16:23 AM UTC-5, atzap wrote:
You should be able to see now that this still leads to different parameter estimates...

Yes, now that I actually ran your code, although I shortened the Stan run, which might not have been the best idea in retrospect. Now, I am guessing that there may be multiple modes?
mu[i] <- beta[1] * xitrue[1,i] + beta[2] * xitrue[2,i] + rans[i] + beta[3]*xitrue[1,i]*xitrue[2,i] + alpha[PRED[i]];
It seems like having both rans[i] and alpha[PRED[i]] could induce multimodality that would leave the mus unaffected, but I couldn't investigate that because rans didn't get saved. But I could sort of see a problem by looking at the chains individually.

Thanks for doing that Ben. I hadn't thought of multimodality. But if this was the case shouldn't taus and taualpha be correlated, which they're not?

Anyway, if you do run for longer, you can get STAN to converge. It does so in a consistently different place than where JAGS converges. Without comparing the different software, I would have assumed that the model fitted. Now, I don't know whether to trust where it ends up or whether there is in fact some multimodality that cannot be detected because within a single program (i.e. STAN or JAGS), it always converges to the same spot. I guess that in addition to what is going on here, I'm also asking how can I trust the results?... Again, the same mu's are predicted between the two programs.

Thanks for bearing with this,

Andrew

Ben Goodrich

unread,

Feb 24, 2013, 2:27:59 PM2/24/13

to stan-...@googlegroups.com

On Sunday, February 24, 2013 1:52:04 PM UTC-5, atzap wrote:

On Sunday, February 24, 2013 1:24:48 PM UTC-5, Ben Goodrich wrote:
On Sunday, February 24, 2013 3:16:23 AM UTC-5, atzap wrote:
You should be able to see now that this still leads to different parameter estimates...

Yes, now that I actually ran your code, although I shortened the Stan run, which might not have been the best idea in retrospect. Now, I am guessing that there may be multiple modes?
mu[i] <- beta[1] * xitrue[1,i] + beta[2] * xitrue[2,i] + rans[i] + beta[3]*xitrue[1,i]*xitrue[2,i] + alpha[PRED[i]];
It seems like having both rans[i] and alpha[PRED[i]] could induce multimodality that would leave the mus unaffected, but I couldn't investigate that because rans didn't get saved. But I could sort of see a problem by looking at the chains individually.
Thanks for doing that Ben. I hadn't thought of multimodality. But if this was the case shouldn't taus and taualpha be correlated, which they're not?

Maybe, but I would suspect correlation in the levels rather than the standard deviations. In other words, you could add a constant to all the rans and subtract a constant from all the alphas and get the same mus without affecting any of the variances. One thing you might do is restrict the alpha vector to have a sum of zero by say, making the third element be the additive inverse of the sum of the first two.

Anyway, if you do run for longer, you can get STAN to converge. It does so in a consistently different place than where JAGS converges. Without comparing the different software, I would have assumed that the model fitted. Now, I don't know whether to trust where it ends up or whether there is in fact some multimodality that cannot be detected because within a single program (i.e. STAN or JAGS), it always converges to the same spot. I guess that in addition to what is going on here, I'm also asking how can I trust the results?... Again, the same mu's are predicted between the two programs.

This is a limitation of the between-chain variance vs. within-chain variance approach to convergence diagnostics (and effective sample size calculations). If there is a lack of identification of some parameters or multimodality then some parameters may look good and others bad. Then again, genuine multimodality is a hard problem to deal with using MCMC. For your longer runs, can you check whether some of the purely within-chain convergence diagnostics in the coda package indicate lack of convergence for the chains individually?

Ben

atzap

unread,

Feb 27, 2013, 5:38:39 PM2/27/13

to

Sorry for the delay in replying. I’ve been testing things further.

Thanks for doing that Ben. I hadn't thought of multimodality. But if this was the case shouldn't taus and taualpha be correlated, which they're not?

Maybe, but I would suspect correlation in the levels rather than the standard deviations. In other words, you could add a constant to all the rans and subtract a constant from all the alphas and get the same mus without affecting any of the variances. One thing you might do is restrict the alpha vector to have a sum of zero by say, making the third element be the additive inverse of the sum of the first two.

A fair point! I've plotted out the rans’s and alpha’s and they don't look particularly correlated. However, running for longer and using the within-chain diagnostics (Geweke; Heidelberg and Welch) shows that the chains aren't really converging despite what neff, Rhat, and visual inspection of the chain traces say... I’ve also tried making the level of alpha with only 1 observation an additive inverse of the other two. The result is that JAGS continues to go to a different place in parameter space than STAN, i.e. beta's are non-overlapping zero while alpha's are in STAN. Though, the model is still not converging even after very long runs (millions of iterations).

I then accepted that I am probably over-fitting this model, and had a long, deep think about what I was doing. This was partly motivated by the fact that I kept observing STAN diverging from JAGS in all sorts of other formulations of my model; though the ran’s were not really converging. The good news is that I now think the ran’s are part of the problem. Essentially, I am modelling i observations of some variable y_ij that was recorded in j groups. I am assuming that y_ij is normal with an estimated mean mu_j and standard deviation theta. I have then been modelling mu_j given two covariates x_j and z_j and accounting for random variation among the j groups using eta_j that is ~N(0, theta2). This then leads to mu_j being described as:

mu_j = beta0 + beta1*x_j + beta2*z_j + beta3*x_j*z_j + eta_j.

This fits fine.

Where it turns to custard is that I've been assuming that x_j and z_j are in fact unobserved and come from a multivariate normal with mean equal to the observed x’_j and z’_j. In other words, x_j can be described as equal to x’_j + e_j and z’_j + e’_j, where e_j and e’_j are estimated errors from the multivariate. Essentially, what I think I end up doing here is throwing in three levels of j-level variance. This is introducing correlations that then prevent the estimates of the ran’s from converging, and STAN (for some reason) consistently wanders off into a different part of parameter space than JAGS. I think this explains why an errors-in-variables model with the ran’s but without the alpha’s still differs between STAN and JAGS and has difficulty converging. The estimates of e_j and e’_j can converge when I remove eta_j because they are jointly derived from the VCV of the multivariate.

Hopefully this makes sense up to here? So now that I managed to fit the above equation for mu_j with the errors-in-variables approach, I decided to try and replace beta0 with a three-class factor observed at the level of the j's (the alpha from previously). But again, JAGS and STAN have difficulty converging and each are wandering off into different parts of parameter space. Is this because there is only one observation at the level of alpha[3]? I realize this may be more of a statistical than STAN issue, but bear with me so I can tell you why I think it might be both.

If I remove the whole errors-in-variables approach and throw the ran's back in along with the alpha's (equivalent to the function written above but replacing beta0 with alpha), STAN has different parameter estimates than JAGS. I've saved the runs in the following workspace (https://dl.dropbox.com/u/7489119/Workspace3.RData), and the attached script describes the model and diagnostics. The models look pretty close to converging, though there are issues with some of the Heidelberg and Welch tests. The most concerning bit here is that the maximum likelihood solution fitted with the lmer function in R closely resembles STAN not JAGS. I do appreciate that the parameter estimates for the beta’s are massively correlated (the errors-in-variables approach was used to account for covariation between the xi's), but I still don't understand why JAGS goes somewhere else in parameter space than STAN? Presumably it must be because, even after millions of iterations, JAGS hasn't converged on the ML solution, while HMC manages to do so? The "problem" in JAGS may also be due to covariation, because the alpha’s and beta’s are positively correlated, so as the beta's wander away from the ML estimates, the alpha's follow.

To summarize, it looks like STAN is doing a better job of converging than JAGS, and this is likely the source of the original differences. Within-chain convergence diagnostics are important for helping to identify this problem. The one lingering question is why my errors-in-variables approach has difficulty dealing with only one observation at the level of alpha[3] when the ran's are removed (see model in STAN_norans.txt)? The one observation isn't a problem when I don't estimate x_j and z_j, and use their true observed values.

Thanks heaps for reading all this. I know that it is a lot to ask.

Andrew

Rscript3.txt

STAN_norans.txt

Andrew Gelman

unread,

Feb 27, 2013, 5:39:35 PM2/27/13

to stan-...@googlegroups.com

Hi, Andrew T. I'm glad that Stan is helpful. Just quickly: the R-hat we use now is a split R-hat, which includes within as well as between-chain information. So I don't think it should happen that the chains aren't really converging despite what neff, Rhat, and visual inspection of the chain traces say. If this _is_ happening, I'd like to understand what's going on. So if you could send the example of the results that weren't converging despite Rhat etc., please do so and I can take a look. Thanks.

AG

atzap

unread,

Feb 27, 2013, 6:18:18 PM2/27/13

to stan-...@googlegroups.com, gel...@stat.columbia.edu

Thanks for the quick reply Andrew G. and sorry again for the dense rambling post.

Please download the workspace linked above (https://dl.dropbox.com/u/7489119/Workspace3.RData) and then open up the Rscript3.txt file. You will see that the output in the object fit.stan looks like it has converged but the heidel.diag and geweke.diag tests suggest otherwise for some of the parameters (e.g. ran's and taualpha). More importantly, the following quote:

despite what neff, Rhat, and visual inspection of the chain traces

may be more relevant for the JAGS result. Again, please see the jagsfit8 object in the linked workspace. The within-chain diagnostics from coda suggest that some parameters in some of the chains haven't converged, and this is especially true when we compare the result to STAN and the lmer function.

Again, thanks for STAN! Its performance is continuing to impress when compared to the other tools that are out there.

Andrew

Andrew Gelman

unread,

Feb 28, 2013, 10:30:35 PM2/28/13

to stan-...@googlegroups.com

Hi, Andrew T. With the Jags fits, are you doing diagnostics using coda? The R-hat in Coda is the old R-hat that does not fully use within-chain information. We now use split R-hat which catches these problems.

Or, to put it another way, I looked at the Stan result and saw this:

Inference for Stan model: mSEM.

3 chains: each with iter=4e+06; warmup=2e+06; thin=6250; 640 iterations saved.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat

taus 0.1 0.0 0.2 0.0 0.0 0.1 0.2 0.6 881 1

tau 0.3 0.0 0.0 0.3 0.3 0.3 0.3 0.4 960 1

taub 0.8 0.0 1.4 0.1 0.3 0.5 0.8 3.1 960 1

beta[1] 0.1 0.0 0.2 -0.2 0.0 0.1 0.1 0.5 960 1

beta[2] 0.4 0.0 0.2 -0.1 0.2 0.4 0.5 1.0 960 1

beta[3] 0.4 0.0 0.3 -0.1 0.3 0.5 0.6 1.2 960 1

beta[4] 0.1 0.1 1.9 -1.9 -0.3 0.0 0.3 2.0 921 1

taualpha 2.0 0.1 3.0 0.5 0.9 1.3 2.1 7.3 960 1

alpha[1] -0.9 0.0 0.2 -1.2 -0.9 -0.9 -0.8 -0.4 960 1

alpha[2] -0.8 0.0 0.2 -1.2 -0.9 -0.8 -0.8 -0.3 960 1

alpha[3] -0.9 0.0 0.3 -1.5 -1.0 -0.9 -0.8 -0.2 960 1

mu[1] -1.2 0.0 0.0 -1.3 -1.2 -1.2 -1.2 -1.2 960 1

mu[2] -1.2 0.0 0.0 -1.3 -1.3 -1.2 -1.2 -1.2 960 1

mu[3] -1.5 0.0 0.0 -1.6 -1.6 -1.5 -1.5 -1.5 960 1

mu[4] -1.0 0.0 0.0 -1.1 -1.0 -1.0 -1.0 -0.9 921 1

mu[5] -0.9 0.0 0.0 -1.0 -0.9 -0.9 -0.9 -0.8 828 1

mu[6] -0.8 0.0 0.0 -0.8 -0.8 -0.8 -0.8 -0.7 946 1

mu[7] -0.9 0.0 0.0 -1.0 -1.0 -0.9 -0.9 -0.9 925 1

mu[8] -1.3 0.0 0.0 -1.3 -1.3 -1.3 -1.3 -1.2 960 1

rans[1] 0.0 0.0 0.1 -0.3 0.0 0.0 0.0 0.2 960 1

rans[2] 0.0 0.0 0.2 -0.5 -0.1 0.0 0.0 0.3 743 1

rans[3] 0.0 0.0 0.2 -0.4 -0.1 0.0 0.0 0.3 893 1

rans[4] 0.0 0.0 0.2 -0.4 -0.1 0.0 0.0 0.3 924 1

rans[5] 0.0 0.0 0.1 -0.2 0.0 0.0 0.1 0.3 836 1

rans[6] 0.0 0.0 0.2 -0.5 0.0 0.0 0.1 0.4 960 1

rans[7] 0.0 0.0 0.1 -0.3 -0.1 0.0 0.0 0.2 869 1

rans[8] -0.1 0.0 0.2 -0.6 -0.1 0.0 0.0 0.3 818 1

lp__ 489.5 0.2 7.0 475.5 485.2 489.4 493.8 503.8 904 1

My first thought was that you must be running it too long! 4 million iterations looks to be overkill! Maybe 1000 iterations or 10,000 would suffice? But are you saying that some of these parameters (e.g., taualpha and rans[1]) did not actually converge even in the simulations shown above?

AG

atzap

unread,

Feb 28, 2013, 11:03:37 PM2/28/13

to

On Thursday, February 28, 2013 10:30:35 PM UTC-5, Andrew Gelman wrote:

My first thought was that you must be running it too long! 4 million iterations looks to be overkill! Maybe 1000 iterations or 10,000 would suffice? But are you saying that some of these parameters (e.g., taualpha and rans[1]) did not actually converge even in the simulations shown above?

Thanks for looking at it Andrew. I agree that it is probably is overkill, though there were problems on the shorter runs.

I am saying that:

1) the output you printed looks like the model has converged - correct?

2) but the other diagnostics that Ben mentioned in this thread suggest otherwise (see below for just the first chain)

> fit.stan2 <- as.array(fit.stan)

> fit.stan2 <- as.mcmc.list(list(as.mcmc(fit.stan2[,1,]), as.mcmc(fit.stan2[,2,]), as.mcmc(fit.stan2[,3,])))

> heidel.diag(fit.stan2)

[[1]]

Stationarity start p-value

test iteration

taus passed 1 0.7712

tau passed 1 0.6906

taub passed 1 0.5310

beta[1] passed 1 0.2641

beta[2] passed 1 0.4617

beta[3] passed 1 0.5898

beta[4] passed 1 0.6979

taualpha passed 1 0.3349

alpha[1] passed 1 0.9918

alpha[2] passed 1 0.3374

alpha[3] passed 1 0.2789

mu[1] passed 1 0.2417

mu[2] passed 1 0.3070

mu[3] passed 1 0.2137

mu[4] passed 1 0.5156

mu[5] passed 1 0.8390

mu[6] passed 1 0.0533

mu[7] passed 1 0.9699

mu[8] passed 1 0.6637

rans[1] passed 1 0.9138

rans[2] passed 1 0.5671

rans[3] passed 1 0.9390

rans[4] passed 1 0.4225

rans[5] passed 1 0.5694

rans[6] passed 1 0.2915

rans[7] passed 1 0.4738

rans[8] passed 1 0.4867

lp__ passed 1 0.6548

Halfwidth Mean Halfwidth

test

taus failed 0.14290 0.016795

tau passed 0.33356 0.000961

taub failed 0.80242 0.179632

beta[1] failed 0.07594 0.019204

beta[2] passed 0.36254 0.026755

beta[3] passed 0.44263 0.033425

beta[4] failed -0.02191 0.158124

taualpha failed 2.19381 0.394462

alpha[1] passed -0.87442 0.014472

alpha[2] passed -0.83753 0.021542

alpha[3] passed -0.92614 0.032429

mu[1] passed -1.21609 0.002869

mu[2] passed -1.24532 0.003563

mu[3] passed -1.52935 0.003460

mu[4] passed -1.01090 0.003671

mu[5] passed -0.88602 0.003575

mu[6] passed -0.77985 0.003560

mu[7] passed -0.93675 0.003846

mu[8] passed -1.28404 0.002958

rans[1] failed -0.00263 0.014489

rans[2] failed -0.03230 0.021647

rans[3] failed -0.02768 0.020586

rans[4] failed -0.00127 0.021847

rans[5] failed 0.04382 0.013419

rans[6] failed 0.01564 0.019822

rans[7] failed -0.02153 0.012006

rans[8] failed -0.05108 0.022046

lp__ passed 489.43918 0.788791

3) Irrespective of STAN, the JAGS fit suggests convergence (though with the old Rhat). As you can see in the print out below, it converges at different values than STAN and the ML estimates.

Inference for Bugs model at "http://dl.dropbox.com/u/7489119/file6af85c783c103.bug", fit using jags,

3 chains, each with 4e+06 iterations (first 2e+06 discarded), n.thin = 12500

n.sims = 480 iterations saved

mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff

alpha[1] -0.251 0.275 -0.847 -0.414 -0.166 -0.021 0.076 1.006 390

alpha[2] -0.210 0.268 -0.832 -0.369 -0.103 -0.003 0.093 1.011 240

alpha[3] -0.164 0.295 -0.914 -0.230 -0.046 0.009 0.172 1.018 180

beta[1] 0.693 0.285 0.098 0.512 0.737 0.904 1.148 1.013 330

beta[2] 1.183 0.362 0.407 0.971 1.259 1.438 1.733 1.009 270

beta[3] 1.501 0.452 0.511 1.266 1.604 1.837 2.124 1.017 210

mu[1] -1.209 0.032 -1.269 -1.232 -1.211 -1.188 -1.149 1.000 480

mu[2] -1.244 0.035 -1.314 -1.266 -1.246 -1.221 -1.175 1.001 480

mu[3] -1.531 0.032 -1.595 -1.554 -1.530 -1.508 -1.470 1.001 480

mu[4] -1.019 0.035 -1.089 -1.042 -1.018 -0.997 -0.947 1.004 340

mu[5] -0.872 0.031 -0.931 -0.894 -0.872 -0.850 -0.810 0.999 480

mu[6] -0.765 0.034 -0.831 -0.788 -0.766 -0.743 -0.699 1.008 230

mu[7] -0.938 0.033 -1.000 -0.961 -0.937 -0.916 -0.876 1.011 260

mu[8] -1.288 0.033 -1.351 -1.310 -1.288 -1.264 -1.225 1.001 480

rans[1] -0.300 0.185 -0.641 -0.435 -0.314 -0.154 0.017 1.000 480

rans[2] -0.075 0.199 -0.492 -0.208 -0.055 0.048 0.298 1.011 480

rans[3] -0.009 0.197 -0.404 -0.118 -0.007 0.105 0.375 1.010 220

rans[4] 0.173 0.243 -0.271 0.011 0.146 0.328 0.693 1.012 480

rans[5] -0.094 0.152 -0.433 -0.190 -0.084 0.012 0.176 1.006 480

rans[6] -0.437 0.213 -0.751 -0.593 -0.498 -0.303 0.018 1.009 220

rans[7] -0.217 0.149 -0.514 -0.324 -0.218 -0.107 0.030 1.006 270

rans[8] 0.166 0.196 -0.179 0.014 0.164 0.293 0.539 1.003 480

tau 0.334 0.008 0.320 0.329 0.334 0.339 0.350 0.999 480

tau.alpha 0.160 0.115 0.006 0.066 0.139 0.232 0.401 1.003 480

taub 2.277 2.350 0.465 1.120 1.605 2.551 8.711 1.006 230

taus 0.224 0.081 0.025 0.191 0.238 0.276 0.358 1.056 120

deviance 514.223 3.973 508.306 511.243 513.528 516.609 523.373 1.002 480

Michael Betancourt

unread,

Feb 28, 2013, 11:11:46 PM2/28/13

to stan-...@googlegroups.com

Andrew,

Can you try running again with a smaller step size? In particular, run with a constant step size set to 1/10th the value to which adaptation converges and see if that changes the convergence of the chain.

-Mike

On Feb 28, 2013, at 11:02 PM, atzap <ata...@gmail.com> wrote:

On Thursday, February 28, 2013 10:30:35 PM UTC-5, Andrew Gelman wrote:
My first thought was that you must be running it too long! 4 million iterations looks to be overkill! Maybe 1000 iterations or 10,000 would suffice? But are you saying that some of these parameters (e.g., taualpha and rans[1]) did not actually converge even in the simulations shown above?

3) Irrespective, STAN, the JAGS fit suggests convergence (though with the old Rhat). As you can see in the print out below, it converges at different values than STAN and the ML estimates.

--

atzap

unread,

Feb 28, 2013, 11:47:25 PM2/28/13

to stan-...@googlegroups.com

On Thursday, February 28, 2013 11:11:46 PM UTC-5, Michael Betancourt wrote:

Can you try running again with a smaller step size? In particular, run with a constant step size set to 1/10th the value to which adaptation converges and see if that changes the convergence of the chain.

Mike

Are you suggesting something like this:

library(rjags)
bdat5 <- dget("http://dl.dropbox.com/u/7489119/bdat5")
bdat5$Omega <- NULL
params = c("taus", "tau","taub", "beta", "taualpha", "alpha", "mu", "rans")
bdat5$xi <- t(bdat5$xi)
params[5] <- "tau.alpha"
its = function() {list(taub=runif(1), tau.alpha = runif(1), taus = runif(1), tau = runif(1))}
m1 <- jags.model("http://dl.dropbox.com/u/7489119/file6af85c783c103.bug", data= bdat5, 
		inits = its, n.chains = 3, n.adapt=1, quiet=FALSE)
adapt(m1, n.iter = 100, end.adaptation = F)
update(m1, n.iter = 100*.1)
m1 <- coda.samples(m1, params, 200)
summary(m1)

The adapt function is returning TRUE after 100 iterations, but clearly it needs to update for longer before sampling...

Thanks

Andrew

Michael Betancourt

unread,

Feb 28, 2013, 11:54:22 PM2/28/13

to stan-...@googlegroups.com

Sorry, I meant with Stan. If your model has wild spatial-variations in curvature, then the step-size adaption during warm-up can converge to a value so large that some regions of the typical probability mass cannot be reached by the Hamiltonian trajectories.

atzap

unread,

Mar 1, 2013, 12:44:18 AM3/1/13

to stan-...@googlegroups.com

Hopefully I've understood your instructions correctly this time!

I've called attr(fit.stan@sim$samples[[1]], "sampler_params")$stepsize in my original workspace and it seems like the different chains are converging at different step-sizes (0.3 - 0.5). I've selected the mean and am now adding equal_step_sizes = T, epsilon = .004 in the call to stan. Is that correct? From the manual, it looks like epsilon sets the step size?...

Michael Betancourt

unread,

Mar 1, 2013, 12:57:17 AM3/1/13

to stan-...@googlegroups.com

Yup, epsilon is correct. 0.04 should be sufficient; 0.004 is even more sensitive to possible issues but will run 10 times as long. Go ahead and run and compare to your previous Stan results.

On Mar 1, 2013, at 12:44 AM, atzap <ata...@gmail.com> wrote:

Hopefully I've understood your instructions correctly this time!

I've called attr(fit.stan@sim$samples[[1]], "sampler_params")$stepsize in my original workspace and it seems like the different chains are converging at different step-sizes (0.3 - 0.5). I've selected the mean and am now adding equal_step_sizes = T, epsilon = .004 in the call to stan. Is that correct? From the manual, it looks like epsilon sets the step size?...

--

atzap

unread,

Mar 1, 2013, 12:59:56 AM3/1/13

to stan-...@googlegroups.com

I did on 10000 iterations and neff was 2 on some of the parameters! Will fiddle with the epsilon a bit and on longer runs and see what happens.

Michael Betancourt

unread,

Mar 1, 2013, 1:20:30 AM3/1/13

to stan-...@googlegroups.com

That may be a sign that the lower step size is allowing transitions between modes. Keep us informed.

atzap

unread,

Mar 1, 2013, 6:50:46 PM3/1/13

to stan-...@googlegroups.com

Hi Mike

It was late when I was writing my last message. The step sizes in the original workspace were .03 - .05. Going to an epsilon of 0.04 the four chains are entirely static and do not mix.

Running four chains with 120,000 iterations, a warmup of 100,000 and using an epsilon of 0.004, the n_eff values are still <10 on some of the parameters. Inspecting the chain traces, there is one chain that becomes stuck at some initial values and doesn't move. If we discard that, then the chains look like they are rapidly converging early on (much closer to Andrew's 10,000 mark!), and appear to mix well.

I then went further down to an epsilon of 0.001 and ran for 60,000 with a warmup of 40,000. This led to all four chains mixing with large n_eff values and Rhat near 1. That said, there are the same problems as before when running the coda diagnostics that Ben mentioned, especially for some of the ran's and beta's. I've copied the output below from heidel.diag for one of the chains.

> round(2*pnorm(-abs(sapply(geweke.diag(fit.stan2), function(x){x$z}))), 3)

[,1] [,2] [,3]

taus 0.839 0.513 0.887

tau 0.754 0.814 0.482

taub 0.733 0.055 0.001

beta[1] 0.011 0.833 0.077

beta[2] 0.806 0.970 0.133

beta[3] 0.952 0.340 0.095

beta[4] 0.531 0.962 0.060

taualpha 0.423 0.721 0.967

alpha[1] 0.834 0.442 0.020

alpha[2] 0.632 0.535 0.192

alpha[3] 0.273 0.881 0.217

mu[1] 0.228 0.234 0.089

mu[2] 0.127 0.361 0.534

mu[3] 0.011 0.736 0.903

mu[4] 0.481 0.020 0.767

mu[5] 0.478 0.092 0.751

mu[6] 0.232 0.488 0.143

mu[7] 0.975 0.880 0.871

mu[8] 0.749 0.866 0.926

rans[1] 0.589 0.600 0.004

rans[2] 0.279 0.826 0.937

rans[3] 0.526 0.650 0.623

rans[4] 0.022 0.216 0.751

rans[5] 0.137 0.285 0.086

rans[6] 0.567 0.124 0.391

rans[7] 0.093 0.308 0.030

rans[8] 0.516 0.279 0.588

lp__ 0.568 0.339 0.760

> heidel.diag(fit.stan2)

[[1]]

Stationarity start p-value

test iteration

taus passed 1 0.2079

tau passed 1 0.6252

taub passed 1 0.8027

beta[1] passed 1 0.4335

beta[2] passed 1 0.6125

beta[3] passed 1 0.9727

beta[4] passed 1 0.6211

taualpha passed 1 0.8490

alpha[1] passed 1 0.8295

alpha[2] passed 41 0.2104

alpha[3] passed 1 0.8825

mu[1] passed 1 0.5227

mu[2] passed 1 0.3857

mu[3] passed 21 0.3802

mu[4] passed 1 0.6796

mu[5] passed 1 0.9085

mu[6] passed 1 0.0950

mu[7] passed 1 0.8293

mu[8] passed 1 0.1200

rans[1] passed 1 0.7570

rans[2] passed 1 0.6307

rans[3] passed 1 0.7712

rans[4] passed 1 0.2270

rans[5] passed 1 0.1821

rans[6] passed 41 0.0802

rans[7] passed 1 0.1297

rans[8] passed 1 0.7192

lp__ passed 1 0.1413

Halfwidth Mean Halfwidth

test

taus failed 0.14275 0.02054

tau passed 0.33489 0.00117

taub failed 0.61523 0.07746

beta[1] failed 0.05541 0.02047

beta[2] failed 0.32907 0.03735

beta[3] passed 0.40136 0.03077

beta[4] failed 0.04811 0.10627

taualpha failed 1.87726 0.20707

alpha[1] passed -0.89789 0.01982

alpha[2] passed -0.86165 0.02106

alpha[3] passed -0.96558 0.04503

mu[1] passed -1.21366 0.00573

mu[2] passed -1.24893 0.00444

mu[3] passed -1.53547 0.00300

mu[4] passed -1.01635 0.00462

mu[5] passed -0.88770 0.00322

mu[6] passed -0.77795 0.00493

mu[7] passed -0.93394 0.00443

mu[8] passed -1.28491 0.00540

rans[1] failed 0.00628 0.02295

rans[2] failed -0.02526 0.02620

rans[3] failed -0.04121 0.03366

rans[4] failed -0.00858 0.02356

rans[5] failed 0.05011 0.01435

rans[6] failed 0.03825 0.02267

rans[7] failed -0.00893 0.01398

rans[8] failed -0.05312 0.02491

lp__ passed 490.48500 1.09257

Michael Betancourt

unread,

Mar 2, 2013, 12:14:37 AM3/2/13

to stan-...@googlegroups.com

Andrew,

What does the tree depth per sample look like when you lower the step size? By default NUTS has a maximum tree depth and for sufficiently small step sizes that might limit the motion of the chain (especially for certain variables).

--

atzap

unread,

Mar 2, 2013, 1:09:12 AM3/2/13

to stan-...@googlegroups.com

A fair number of iterations are hitting up against that maximum. Please see the attached PDFs.

Should I try raising it slightly? How high would you suggest?

Thanks again, this is becoming so very informative!

Andrew

treedep_hists.pdf

treedep_its.pdf

atzap

unread,

Mar 2, 2013, 4:11:04 AM3/2/13

to stan-...@googlegroups.com

This is a limitation of the between-chain variance vs. within-chain variance approach to convergence diagnostics (and effective sample size calculations). If there is a lack of identification of some parameters or multimodality then some parameters may look good and others bad. Then again, genuine multimodality is a hard problem to deal with using MCMC. For your longer runs, can you check whether some of the purely within-chain convergence diagnostics in the coda package indicate lack of convergence for the chains individually?

So I'm getting a funny feeling about these within-chain convergence diagnostics in coda. If I run the two that I've been using on the schools example, they suggest that the individual chains haven't converged!

# run schools example
fit.stan <- stan(model_code = schools_code, data = schools_dat,iter = 10000, warmup = 8000, thin = 10, chains = 4)

# process into MCMC list
fit.stan2 <- as.array(fit.stan)
fit.stan2 <- as.mcmc.list(list(as.mcmc(fit.stan2[,1,]), as.mcmc(fit.stan2[,2,]), as.mcmc(fit.stan2[,3,]), as.mcmc(fit.stan2[,4,])))


# calculate p-values (if you believe in such things) associated with Z-scores from Geweke on each chain


round(2*pnorm(-abs(sapply(geweke.diag(fit.stan2), function(x){x$z}))), 3)

# calculate Heidel on each chain
heidel.diag(fit.stan2)

Clearly, this is nonsense, isn't it? So does that mean we shouldn't be overly concerned about these specific diagnostics, especially the half-width mean test? In which case, we return to my original issue, which is that assuming the models have converged, JAGS throws up different parameter estimates than STAN... Sorry if I'm flogging a dead horse.

Daniel Lee

unread,

Mar 2, 2013, 6:50:19 AM3/2/13

to stan-...@googlegroups.com, stan-...@googlegroups.com

Just so we are all on the same page, can you send the

1) JAGS model

2) JAGS data

3) Stan model

4) Stan data

My gut is that there's still something inconsistently specified between the model in both languages. We have coded most of the BUGS example models (in src/stan/models/bugs_examples) and the parameter estimates match.

Daniel

--

atzap

unread,

Mar 2, 2013, 3:23:55 PM3/2/13

to stan-...@googlegroups.com

I attached a self-contained script above that runs and compares both models. But here are just the data and model files. The data files are just dumps of lists from R, but renamed to *.txt so that they upload here.

I'm somewhat hesitant to follow-up about my query regarding the coda chain diagnostics on the schools example (please see my most recent post above). Perhaps it is better placed in a separate thread? Sorry I'm asking so many convoluted questions!

Andrew

JAGSmodel.txt

STANmodel.txt

JAGSdata.txt

STANdata.txt

Michael Betancourt

unread,

Mar 2, 2013, 4:31:12 PM3/2/13

to stan-...@googlegroups.com

The idea is to compare the neighborhoods to where the chains converge with the stepsize varying but stepsize*treedepth held constant. NUTS should ensure that stepsize*treedepth is constant provided you don't hit the maximum tree depth, so I would make sure to raise the maximum tree depth for these studies.

If there are weird geometric issues then the chains with smaller stepsizes will be different than the chains with the larger stepsizes. If they are consistent than there's no obvious issue on the HMC/NUTS side.

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

> <treedep_hists.pdf><treedep_its.pdf>

Michael Betancourt

unread,

Mar 2, 2013, 4:48:54 PM3/2/13

to stan-...@googlegroups.com

The idea is to compare the neighborhoods to where the chains converge with the stepsize varying but stepsize*treedepth held constant. NUTS should ensure that stepsize*treedepth is constant provided you don't hit the maximum tree depth, so I would make sure to raise the maximum tree depth for these studies.

If there are weird geometric issues then the chains with smaller stepsizes will be different than the chains with the larger stepsizes. If they are consistent than there's no obvious issue on the HMC/NUTS side.

On Mar 2, 2013, at 1:09 AM, atzap wrote:

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

> <treedep_hists.pdf><treedep_its.pdf>

Reply all

Reply to author

Forward