Here is a proposal I have been playing with for testing convergence. Let me know what you think.
As I understand it, the objections in BDA to testing a null hypothesis of convergence are
> convergence(leuk, thin = 4, R = 100)
[1] "Multivariate loss = 1"
disco(x = mat, factors = chain_id, distance = FALSE, R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
factors 20 36.90887 1.84544 0.956 0.54455
Within 5229 10095.38503 1.93065
Total 5249 10132.29390
> leukfr
Inference for Stan model: leukfr.
21 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=21000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta 1.6 0.0 0.4 0.8 1.3 1.6 1.9 2.5 5954 1.0
tau 166.7 22.3 337.4 1.5 10.5 38.5 153.8 1299.9 229 1.1
dL0[1] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 9596 1.0
dL0[2] 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 10712 1.0
dL0[3] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 21000 1.0
dL0[4] 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 12603 1.0
dL0[5] 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 21000 1.0
dL0[6] 0.1 0.0 0.0 0.0 0.0 0.1 0.1 0.2 15104 1.0
dL0[7] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 21000 1.0
dL0[8] 0.1 0.0 0.1 0.0 0.1 0.1 0.2 0.3 10490 1.0
dL0[9] 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 21000 1.0
dL0[10] 0.1 0.0 0.1 0.0 0.0 0.1 0.1 0.2 21000 1.0
dL0[11] 0.1 0.0 0.1 0.0 0.1 0.1 0.1 0.3 21000 1.0
dL0[12] 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.3 21000 1.0
dL0[13] 0.1 0.0 0.1 0.0 0.0 0.1 0.1 0.3 21000 1.0
dL0[14] 0.1 0.0 0.1 0.0 0.0 0.1 0.1 0.3 21000 1.0
dL0[15] 0.1 0.0 0.1 0.0 0.0 0.1 0.1 0.3 21000 1.0
dL0[16] 0.3 0.0 0.2 0.0 0.1 0.2 0.4 0.8 21000 1.0
dL0[17] 0.5 0.0 0.4 0.1 0.2 0.4 0.6 1.3 7337 1.0
b[1] 0.1 0.0 0.3 -0.5 -0.1 0.0 0.1 0.9 3712 1.0
b[2] 0.1 0.0 0.4 -0.3 0.0 0.0 0.2 1.2 1486 1.0
b[3] 0.1 0.0 0.3 -0.5 -0.1 0.0 0.1 0.8 6507 1.0
b[4] 0.1 0.0 0.3 -0.3 0.0 0.0 0.2 1.0 1878 1.0
b[5] 0.0 0.0 0.3 -0.6 -0.1 0.0 0.1 0.6 17331 1.0
b[6] 0.0 0.0 0.3 -0.5 -0.1 0.0 0.1 0.7 13197 1.0
b[7] 0.0 0.0 0.3 -0.5 -0.1 0.0 0.1 0.7 21000 1.0
b[8] 0.1 0.0 0.3 -0.3 0.0 0.0 0.2 1.0 2022 1.0
b[9] 0.0 0.0 0.3 -0.6 -0.1 0.0 0.1 0.6 21000 1.0
b[10] 0.0 0.0 0.3 -0.6 -0.1 0.0 0.1 0.5 21000 1.0
b[11] 0.0 0.0 0.3 -0.7 -0.1 0.0 0.1 0.5 9825 1.0
b[12] 0.0 0.0 0.3 -0.5 -0.1 0.0 0.1 0.6 21000 1.0
b[13] -0.1 0.0 0.3 -0.8 -0.1 0.0 0.1 0.4 5327 1.0
b[14] 0.0 0.0 0.3 -0.5 -0.1 0.0 0.1 0.6 17650 1.0
b[15] -0.1 0.0 0.3 -0.8 -0.1 0.0 0.1 0.4 3997 1.0
b[16] 0.0 0.0 0.2 -0.6 -0.1 0.0 0.1 0.5 14528 1.0
b[17] -0.1 0.0 0.3 -0.8 -0.2 0.0 0.0 0.3 2826 1.0
b[18] 0.0 0.0 0.2 -0.6 -0.1 0.0 0.1 0.5 21000 1.0
b[19] 0.0 0.0 0.3 -0.7 -0.1 0.0 0.1 0.5 9137 1.0
b[20] -0.1 0.0 0.3 -0.8 -0.1 0.0 0.1 0.4 3914 1.0
b[21] -0.2 0.0 0.4 -1.3 -0.3 -0.1 0.0 0.2 955 1.0
sigma 0.2 0.0 0.2 0.0 0.1 0.2 0.3 0.8 472 1.0
lp__ -77.5 1.1 18.3 -109.9 -91.2 -78.2 -64.2 -41.4 292 1.1
> convergence(leukfr, R = 100, thin = 10)
[1] "Multivariate loss = 1.02"
disco(x = mat, factors = chain_id, distance = FALSE, R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
factors 20 11927.45170 596.37258 5.045 0.009901
Within 2079 245741.18271 118.20163
Total 2099 257668.63440library(rstan)
source("dependence.R")
source("convergence.R")
model_nums <- 21:82
models <- basename(stan_demo(0))[model_nums]
leuk <- stan_demo(model_nums[10], seed = 12345, chains = 21, thin = 1)
d <- dependence(leuk)
convergence(leuk, thin = 4, R = 100)
leukfr <- pstan_demo(model_nums[9], seed = 12345, chains = 21, thin = 1)
d <- dependence(leukfr)
convergence(leukfr, 0thin = 10, R = 100)
> Here is a proposal I have been playing with for testing convergence. Let me know what you think.
I think you need feedback from people who know more stats than me!
You should write all these contributions up as papers. I think a wider
audience than the stan-dev list would be interested.
> With Stan, I can get about 75% of the BUGS models to fail to reject a null hypothesis of convergence.
If I'm counting negations properly, that means 25% of the BUGS models
do reject the null hypothesis of convergence? That's bad, right?
> This test does not seem to depend heavily on the number of parameters (and other monitored quantities) --- except in the time and RAM needed to conduct it --- because while the absolute Euclidean distance between draws tends to increase as the dimensionality increases, the ratio of average between-chain distance to the average within-chain distance should be about 1.0 regardless of the dimensionality. The leuk model has 53 parameters.
>
> I think this procedure is favorable when considering the limitations of the marginal Rhat and n_eff, which
> • presumes adequate mixing
> • looks only at mean differences
> • assumes the posterior distribution is approximately normal
> • yields a different n_eff for each margin
> • makes it hard to make a conclusion about convergence when a few Rhats are not <= 1.1
I thought Andrew's advice was that all Rhats should be below 1.05?
> The last point is emphasized in a largely unknown AFAIK Political Analysis paper by Jeff Gill
>
> http://www.artsci.wustl.edu/~jgill/papers/converge_gill.pdf
>
> that shows due to the parameter dependence that evidence of nonconvergence on any margin is evidence of nonconvergence jointly in MH and Gibbs samplers. Thus, you really want to make a decision about whether to declare convergence based on some other criterion --- like the one I am outlining which I think overcomes Gill's criticisms too --- and then use the marginal Rhat for the quantity of interest of the model.
>
If you've already declared convergence, why bother with Rhat?
N_eff is another matter, of course.
> FWIW, the dependence plot for the similar leukfr model illustrates some problem:
>
>
>
> Here, it looks as if something went wrong in the warmup phase, which resulted in an increase in the dependence as soon as we started retaining the draws. And the dependence is fading slowly so that the chains would have to be heavily thinned to test the null hypothesis of convergence. Nevertheless, the print output does not look that bad from the typical user's perspective
This is worrisome. Though I'd say tau looks rather problematic in
this model, whatever it is.
> Here is a proposal I have been playing with for testing convergence. Let me know what you think.
I think you need feedback from people who know more stats than me!
You should write all these contributions up as papers. I think a wider
audience than the stan-dev list would be interested.
I wouldn’t be surprised if lots of the BUGS models don’t converge, butwe have to be really careful with multivariate tests like this. There isnot a lot of literature on multivariate MC-CLTs (i.e. none at all) but bestcase if we extrapolate the linear results then you’d have to take intoaccount the differing ESS for the covariance estimates into the ANOVAwhich is going to be a nightmare.
Almost. The biggest question is what is the asymptotic distribution of a multivariateMonte Carlo estimator? The marginal variances (and diagonal elements of thecovariance matrix) will be the usual, true marginal variance / ESS, but it’s lessclear what the off-diagonal elements will be. Even if the off-diagonals scaleas true covariance / ESS, we have n(n -1) / 2 ESSs to contend with and there’sno strong theoretical argument (as far as I know) that motivates why they shouldbe the same. And if they’re not the same then there’s no consistent way to definea joint ESS for tests like yours.In particular, if you’re going to claim a joint convergence diagnostic then it shouldpresumably be one that is invariant to parameter transformations.
> convergence(leuk, thin = 4, R = 100)
[1] "Multivariate loss = 1"
disco(x = mat, factors = chain_id, distance = FALSE, R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
factors 20 36.90887 1.84544 0.956 0.54455
Within 5229 10095.38503 1.93065
Total 5249 10132.29390
10095.38503 / 5229 = 1.93065, which in turn is not the right denominator for 1.84544 / 1.93065 = 0.956 ?
If so, then I think there are two things to consider. First, the 5229 is a constant when they do the permutations to get the p-value, so I am not sure it makes a difference as to whether 5229 is really the ESS. Second, if you thin to the point that you think the retained draws are approximately independent, then I would guess that the ESS is approximately equal to the number of retained draws.Alternatively, one could thin the chains so that approximate independence of retained draws is a reasonable assumption and obtain the p-value for a test of the null hypothesis that all the chains have the same distribution. To do this, we can use the distance correlation concept of Székely, Rizzo, and Bakirov (2007)
https://projecteuclid.org/download/pdfview_1/euclid.aos/1201012979
which has the property that two possibly multivariate random variables A and B are independent iff the distance correlation between them is zero. We can estimate the distance correlation at time t by rowwise stacking the m draws at time t into the matrix A and rowwise stacking the m draws at time t - j into the matrix B for some j > 0 lags (the columns of both A and B are the posterior parameters). There is a bias-corrected version of the distance correlation estimator in the bcdcor() function in the energy package for R. We can then plot nonparametric curves of these distance correlations over time by j to see what lag length renders the draws essentially independent. This particular graph is fairly easy to interpret, such as in the case of the leuk model from the BUGS examples: