[RFC] Convergence test

151 views
Skip to first unread message

Ben Goodrich

unread,
Feb 24, 2014, 5:54:33 PM2/24/14
to stan...@googlegroups.com

Here is a proposal I have been playing with for testing convergence. Let me know what you think.

As I understand it, the objections in BDA to testing a null hypothesis of convergence are

  1. This null hypothesis is presumably false so there is no point in testing it
  2. (corollary of 1) Even if such a test is statistically significant, the difference between the simulated distribution of the quantity of interest and the theoretical distribution of the quantity of interest may be substantively negligible so what you really need is a loss function rather than a hypothesis test
  3. Almost all of the tests in the coda and boa R packages operate on the margins but are not independent, so you aren't testing a null hypothesis of convergence anyway
I think each of these objections (plus a few more) can be overcome, at least for Stan. This paper by Rizzo and Székely (2010)

https://projecteuclid.org/download/pdfview_1/euclid.aoas/1280842151

generalizes the (M)ANOVA test with an index parameter alpha on the (0,2] interval. Iff alpha = 2, then the test simplifies to a (M)ANOVA test of equality of means across groups (or chains in the MCMC context). If alpha is on the (0,2) interval, then the test is consistent for a more general test that the groups (chains) have the same distribution (not required to be multivariate normal). It produces average between-group (B) and within-group distances (W) that are similar to the average between-group and within-group sum of squares in an ANOVA decomposition (or Rhat), although the resulting "F-statistic" is not exactly distributed F under the null hypothesis of no group-wise differences in distributions, so you have to do permutations to get p-values. In any event, this test is implemented in the disco() function in the energy R package.

However, this test presumes the realizations in each group are independent, which is not true for realizations from the same Markov chain. One could just ignore this problem and construct something similar to to the multivariate Brooks-Gelman-Rubin statistic with sqrt(T / W) where  T = (n - 1) / n * W + 1 / n * B to obtain a conservative loss function but that suffers from the same problem as the actual multivariate Brooks-Gelman-Rubin statistic in that it is hard to interpret as a scale factor of anything in particular.

Alternatively, one could thin the chains so that approximate independence of retained draws is a reasonable assumption and obtain the p-value for a test of the null hypothesis that all the chains have the same distribution. To do this, we can use the distance correlation concept of Székely, Rizzo, and Bakirov (2007)

https://projecteuclid.org/download/pdfview_1/euclid.aos/1201012979

which has the property that two possibly multivariate random variables A and B are independent iff the distance correlation between them is zero. We can estimate the distance correlation at time t by rowwise stacking the m draws at time t into the matrix A and rowwise stacking the m draws at time t - j into the matrix B for some j > 0 lags (the columns of both A and B are the posterior parameters). There is a bias-corrected version of the distance correlation estimator in the bcdcor() function in the energy package for R. We can then plot nonparametric curves of these distance correlations over time by j to see what lag length renders the draws essentially independent. This particular graph is fairly easy to interpret, such as in the case of the leuk model from the BUGS examples:

This plot is pretty typical of those in the BUGS corpus when the MCMC goes well in that it
  • exhibits declining dependence during the warmup period
  • exhibits fairly constant dependence after the warmup period
  • exhibits essential independence after a handful of lags (and this might improve once we bump the target acceptance rate up)
In this case, it looks like we need to thin by retaining every third or fourth draw. If so, the resulting convergence test is

> convergence(leuk, thin = 4, R = 100)
[1] "Multivariate loss = 1"
disco
(x = mat, factors = chain_id, distance = FALSE, R = R)

Distance Components: index  1.00
Source                 Df   Sum Dist  Mean Dist    F-ratio    p-value
factors                
20   36.90887    1.84544      0.956    0.54455
Within               5229 10095.38503    1.93065
Total                5249 10132.29390

With Stan, I can get about 75% of the BUGS models to fail to reject a null hypothesis of convergence. This test does not seem to depend heavily on the number of parameters (and other monitored quantities) --- except in the time and RAM needed to conduct it --- because while the absolute Euclidean distance between draws tends to increase as the dimensionality increases, the ratio of average between-chain distance to the average within-chain distance should be about 1.0 regardless of the dimensionality. The leuk model has 53 parameters.

I think this procedure is favorable when considering the limitations of the marginal Rhat and n_eff, which
  • presumes adequate mixing
  • looks only at mean differences
  • assumes the posterior distribution is approximately normal
  • yields a different n_eff for each margin
  • makes it hard to make a conclusion about convergence when a few Rhats are not <= 1.1
The last point is emphasized in a largely unknown AFAIK Political Analysis paper by Jeff Gill

http://www.artsci.wustl.edu/~jgill/papers/converge_gill.pdf

that shows due to the parameter dependence that evidence of nonconvergence on any margin is evidence of nonconvergence jointly in MH and Gibbs samplers. Thus, you really want to make a decision about whether to declare convergence based on some other criterion --- like the one I am outlining which I think overcomes Gill's criticisms too --- and then use the marginal Rhat for the quantity of interest of the model.

FWIW, the dependence plot for the similar leukfr model illustrates some problem:

Here, it looks as if something went wrong in the warmup phase, which resulted in an increase in the dependence as soon as we started retaining the draws. And the dependence is fading slowly so that the chains would have to be heavily thinned to test the null hypothesis of convergence. Nevertheless, the print output does not look that bad from the typical user's perspective

> leukfr
Inference for Stan model: leukfr.
21 chains, each with iter=2000; warmup=1000; thin=1;
post
-warmup draws per chain=1000, total post-warmup draws=21000.

         mean se_mean    sd  
2.5%   25%   50%   75%  97.5% n_eff Rhat
beta      
1.6     0.0   0.4    0.8   1.3   1.6   1.9    2.5  5954  1.0
tau    
166.7    22.3 337.4    1.5  10.5  38.5 153.8 1299.9   229  1.1
dL0
[1]    0.0     0.0   0.0    0.0   0.0   0.0   0.0    0.1  9596  1.0
dL0
[2]    0.0     0.0   0.0    0.0   0.0   0.0   0.1    0.1 10712  1.0
dL0
[3]    0.0     0.0   0.0    0.0   0.0   0.0   0.0    0.1 21000  1.0
dL0
[4]    0.0     0.0   0.0    0.0   0.0   0.0   0.1    0.1 12603  1.0
dL0
[5]    0.0     0.0   0.0    0.0   0.0   0.0   0.1    0.1 21000  1.0
dL0
[6]    0.1     0.0   0.0    0.0   0.0   0.1   0.1    0.2 15104  1.0
dL0
[7]    0.0     0.0   0.0    0.0   0.0   0.0   0.0    0.1 21000  1.0
dL0
[8]    0.1     0.0   0.1    0.0   0.1   0.1   0.2    0.3 10490  1.0
dL0
[9]    0.0     0.0   0.0    0.0   0.0   0.0   0.1    0.2 21000  1.0
dL0
[10]   0.1     0.0   0.1    0.0   0.0   0.1   0.1    0.2 21000  1.0
dL0
[11]   0.1     0.0   0.1    0.0   0.1   0.1   0.1    0.3 21000  1.0
dL0
[12]   0.1     0.0   0.1    0.0   0.0   0.0   0.1    0.3 21000  1.0
dL0
[13]   0.1     0.0   0.1    0.0   0.0   0.1   0.1    0.3 21000  1.0
dL0
[14]   0.1     0.0   0.1    0.0   0.0   0.1   0.1    0.3 21000  1.0
dL0
[15]   0.1     0.0   0.1    0.0   0.0   0.1   0.1    0.3 21000  1.0
dL0
[16]   0.3     0.0   0.2    0.0   0.1   0.2   0.4    0.8 21000  1.0
dL0
[17]   0.5     0.0   0.4    0.1   0.2   0.4   0.6    1.3  7337  1.0
b
[1]      0.1     0.0   0.3   -0.5  -0.1   0.0   0.1    0.9  3712  1.0
b
[2]      0.1     0.0   0.4   -0.3   0.0   0.0   0.2    1.2  1486  1.0
b
[3]      0.1     0.0   0.3   -0.5  -0.1   0.0   0.1    0.8  6507  1.0
b
[4]      0.1     0.0   0.3   -0.3   0.0   0.0   0.2    1.0  1878  1.0
b
[5]      0.0     0.0   0.3   -0.6  -0.1   0.0   0.1    0.6 17331  1.0
b
[6]      0.0     0.0   0.3   -0.5  -0.1   0.0   0.1    0.7 13197  1.0
b
[7]      0.0     0.0   0.3   -0.5  -0.1   0.0   0.1    0.7 21000  1.0
b
[8]      0.1     0.0   0.3   -0.3   0.0   0.0   0.2    1.0  2022  1.0
b
[9]      0.0     0.0   0.3   -0.6  -0.1   0.0   0.1    0.6 21000  1.0
b
[10]     0.0     0.0   0.3   -0.6  -0.1   0.0   0.1    0.5 21000  1.0
b
[11]     0.0     0.0   0.3   -0.7  -0.1   0.0   0.1    0.5  9825  1.0
b
[12]     0.0     0.0   0.3   -0.5  -0.1   0.0   0.1    0.6 21000  1.0
b
[13]    -0.1     0.0   0.3   -0.8  -0.1   0.0   0.1    0.4  5327  1.0
b
[14]     0.0     0.0   0.3   -0.5  -0.1   0.0   0.1    0.6 17650  1.0
b
[15]    -0.1     0.0   0.3   -0.8  -0.1   0.0   0.1    0.4  3997  1.0
b
[16]     0.0     0.0   0.2   -0.6  -0.1   0.0   0.1    0.5 14528  1.0
b
[17]    -0.1     0.0   0.3   -0.8  -0.2   0.0   0.0    0.3  2826  1.0
b
[18]     0.0     0.0   0.2   -0.6  -0.1   0.0   0.1    0.5 21000  1.0
b
[19]     0.0     0.0   0.3   -0.7  -0.1   0.0   0.1    0.5  9137  1.0
b
[20]    -0.1     0.0   0.3   -0.8  -0.1   0.0   0.1    0.4  3914  1.0
b
[21]    -0.2     0.0   0.4   -1.3  -0.3  -0.1   0.0    0.2   955  1.0
sigma    
0.2     0.0   0.2    0.0   0.1   0.2   0.3    0.8   472  1.0
lp__    
-77.5     1.1  18.3 -109.9 -91.2 -78.2 -64.2  -41.4   292  1.1

although the ratio of n_eff to nm on some margins suggests that there is a mixing problem, in which case you would want to be skeptical of the output. With a thin interval of 10 (which is probably too small to achieve independence), you would reject the null hypothesis of convergence

> convergence(leukfr, R = 100, thin = 10)
[1] "Multivariate loss = 1.02"
disco
(x = mat, factors = chain_id, distance = FALSE, R = R)

Distance Components: index  1.00
Source                 Df   Sum Dist  Mean Dist    F-ratio    p-value
factors                
20 11927.45170  596.37258      5.045   0.009901
Within               2079 245741.18271  118.20163
Total                2099 257668.63440

even though the multivariate loss of 1.02 is small based on the criteria usually used to evaluate an Rhat.

The commands that are needed to do this stuff in RStan (should work with a mcmc.list from coda) are below and source the attached files.

library(rstan)
source
("dependence.R")
source
("convergence.R")
model_nums
<- 21:82
models
<- basename(stan_demo(0))[model_nums]
leuk
<- stan_demo(model_nums[10], seed = 12345, chains = 21, thin = 1)
d
<- dependence(leuk)
convergence
(leuk, thin = 4, R = 100)
leukfr
<- pstan_demo(model_nums[9], seed = 12345, chains = 21, thin = 1)
d
<- dependence(leukfr)
convergence
(leukfr, 0thin = 10, R = 100)

Ben

dependence.R
convergence.R

Bob Carpenter

unread,
Feb 25, 2014, 2:21:28 PM2/25/14
to stan...@googlegroups.com

On Feb 24, 2014, at 10:54 PM, Ben Goodrich <goodri...@gmail.com> wrote:

> Here is a proposal I have been playing with for testing convergence. Let me know what you think.

I think you need feedback from people who know more stats than me!

You should write all these contributions up as papers. I think a wider
audience than the stan-dev list would be interested.

> With Stan, I can get about 75% of the BUGS models to fail to reject a null hypothesis of convergence.

If I'm counting negations properly, that means 25% of the BUGS models
do reject the null hypothesis of convergence? That's bad, right?

> This test does not seem to depend heavily on the number of parameters (and other monitored quantities) --- except in the time and RAM needed to conduct it --- because while the absolute Euclidean distance between draws tends to increase as the dimensionality increases, the ratio of average between-chain distance to the average within-chain distance should be about 1.0 regardless of the dimensionality. The leuk model has 53 parameters.
>
> I think this procedure is favorable when considering the limitations of the marginal Rhat and n_eff, which
> • presumes adequate mixing
> • looks only at mean differences
> • assumes the posterior distribution is approximately normal
> • yields a different n_eff for each margin
> • makes it hard to make a conclusion about convergence when a few Rhats are not <= 1.1

I thought Andrew's advice was that all Rhats should be below 1.05?

> The last point is emphasized in a largely unknown AFAIK Political Analysis paper by Jeff Gill
>
> http://www.artsci.wustl.edu/~jgill/papers/converge_gill.pdf
>
> that shows due to the parameter dependence that evidence of nonconvergence on any margin is evidence of nonconvergence jointly in MH and Gibbs samplers. Thus, you really want to make a decision about whether to declare convergence based on some other criterion --- like the one I am outlining which I think overcomes Gill's criticisms too --- and then use the marginal Rhat for the quantity of interest of the model.
>

If you've already declared convergence, why bother with Rhat?

N_eff is another matter, of course.

> FWIW, the dependence plot for the similar leukfr model illustrates some problem:
>
>
>
> Here, it looks as if something went wrong in the warmup phase, which resulted in an increase in the dependence as soon as we started retaining the draws. And the dependence is fading slowly so that the chains would have to be heavily thinned to test the null hypothesis of convergence. Nevertheless, the print output does not look that bad from the typical user's perspective

This is worrisome. Though I'd say tau looks rather problematic in
this model, whatever it is.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
> <dependence.R><convergence.R>

Ben Goodrich

unread,
Feb 25, 2014, 2:40:16 PM2/25/14
to stan...@googlegroups.com
On Tuesday, February 25, 2014 2:21:28 PM UTC-5, Bob Carpenter wrote:

> Here is a proposal I have been playing with for testing convergence. Let me know what you think.

I think you need feedback from people who know more stats than me!

You should write all these contributions up as papers.  I think a wider
audience than the stan-dev list would be interested.

I was planning to dump it on my QMSS students Thursday :)
 
> With Stan, I can get about 75% of the BUGS models to fail to reject a null hypothesis of convergence.

If I'm counting negations properly, that means 25% of the BUGS models
do reject the null hypothesis of convergence?  That's bad, right?

Yes, but the exceptions include cases where the model doesn't parse (fire), or is messup and and thereby excluded from test-models for other reasons. In addition, there are some models that probably would converge with more sane priors than inv_gamma(0,0) and what not. There weren't so many cases where I tried hard to get a seemingly reasonable BUGS model to converge and it wasn't happening.

>  This test does not seem to depend heavily on the number of parameters (and other monitored quantities) --- except in the time and RAM needed to conduct it --- because while the absolute Euclidean distance between draws tends to increase as the dimensionality increases, the ratio of average between-chain distance to the average within-chain distance should be about 1.0 regardless of the dimensionality. The leuk model has 53 parameters.
>
> I think this procedure is favorable when considering the limitations of the marginal Rhat and n_eff, which
>         • presumes adequate mixing
>         • looks only at mean differences
>         • assumes the posterior distribution is approximately normal
>         • yields a different n_eff for each margin
>         • makes it hard to make a conclusion about convergence when a few Rhats are not <= 1.1

I thought Andrew's advice was that all Rhats should be below 1.05?

Andrew is a random variable. BDA III p. 287 was my source for 1.1.
 
> The last point is emphasized in a largely unknown AFAIK Political Analysis paper by Jeff Gill
>
> http://www.artsci.wustl.edu/~jgill/papers/converge_gill.pdf
>
> that shows due to the parameter dependence that evidence of nonconvergence on any margin is evidence of nonconvergence jointly in MH and Gibbs samplers. Thus, you really want to make a decision about whether to declare convergence based on some other criterion --- like the one I am outlining which I think overcomes Gill's criticisms too --- and then use the marginal Rhat for the quantity of interest of the model.
>

If you've already declared convergence, why bother with Rhat?

N_eff is another matter, of course.

Yeah.
 

> FWIW, the dependence plot for the similar leukfr model illustrates some problem:
>
>
>
> Here, it looks as if something went wrong in the warmup phase, which resulted in an increase in the dependence as soon as we started retaining the draws. And the dependence is fading slowly so that the chains would have to be heavily thinned to test the null hypothesis of convergence. Nevertheless, the print output does not look that bad from the typical user's perspective

This is worrisome.  Though I'd say tau looks rather problematic in
this model, whatever it is.  

It's a precision with a gamma(.001, .001) prior. This is among the likely sources of bad looking graphs I mentioned above.

Ben

Ben Goodrich

unread,
Feb 25, 2014, 3:01:25 PM2/25/14
to stan...@googlegroups.com
On Tuesday, February 25, 2014 2:21:28 PM UTC-5, Bob Carpenter wrote:

> Here is a proposal I have been playing with for testing convergence. Let me know what you think.

I think you need feedback from people who know more stats than me!

You should write all these contributions up as papers.  I think a wider
audience than the stan-dev list would be interested.

Let's see how it goes first. The stat stuff can be worked, or really it already has by Rizzo and Székely (albeit not in reference to MCMC). What I think has happened is this:

In the beginning, there was a lot of FUD perpetuated by maximum likelihoodists about convergence of Bayesian models; e.g. what if you think your chains have converged and you publish the results but really your chains haven't converged yet and if you ran them a little longer they would converge to something that caused you to reach the opposite conclusion about your quantity of interest? And then there was BUGS, where you can burnin for hundreds of thousands of iterations and still maybe not feel confident about declaring convergence. And then there were a littany of convergence diagnostics that individually weren't conclusive and collectively could be contradictory. But BDA doesn't talk about many of them, so there is a generation of people that have basically come to an equilibrium where you verify that all the Rhats are sufficiently small and then decide that the chains have converged. And then you have the problem that Jeff Gill talks about where convergence is a joint concept but is judged marginally and you can easily get into a situation in hierarchical models where a minority of Rhats are not so small and can you rely on (functions of) the other parameters to make inferences?

So, what I think is needed is something where we can say something automatable "Do this, then that, and decide on convergence" that is backed by good statistical theory (i.e. it's joint rather than marginal, it tests for differences in distributions rather than merely differences in means,  it doesn't require multivariate normality, it exposes poor mixing rather than assuming adequate mixing, etc.).

Ben



Michael Betancourt

unread,
Feb 25, 2014, 3:06:42 PM2/25/14
to stan...@googlegroups.com
I wouldn’t be surprised if lots of the BUGS models don’t converge, but
we have to be really careful with multivariate tests like this.  There is
not a lot of literature on multivariate MC-CLTs (i.e. none at all) but best
case if we extrapolate the linear results then you’d have to take into 
account the differing ESS for the covariance estimates into the ANOVA
which is going to be a nightmare.

Bob Carpenter

unread,
Feb 25, 2014, 4:55:17 PM2/25/14
to stan...@googlegroups.com

On Feb 25, 2014, at 7:40 PM, Ben Goodrich <goodri...@gmail.com> wrote:

> On Tuesday, February 25, 2014 2:21:28 PM UTC-5, Bob Carpenter wrote:
>
> I thought Andrew's advice was that all Rhats should be below 1.05?
>
> Andrew is a random variable. BDA III p. 287 was my source for 1.1.

Andrew? Want to commit to a value for advice we should give people?

And thanks for all the other clarifications, Ben. I don't feel so bad.
I think Stan's just doing a better job of exploring the true posteriors
for those super-diffuse priors.

We should write that up, too!

- Bob

Jiqiang Guo

unread,
Feb 25, 2014, 5:18:22 PM2/25/14
to stan...@googlegroups.com
What if we run the tests on the simulation produced by JAGS. 

Jiqiang 


Ben Goodrich

unread,
Feb 25, 2014, 6:34:49 PM2/25/14
to stan...@googlegroups.com
On Tuesday, February 25, 2014 3:06:42 PM UTC-5, Michael Betancourt wrote:
I wouldn’t be surprised if lots of the BUGS models don’t converge, but
we have to be really careful with multivariate tests like this.  There is
not a lot of literature on multivariate MC-CLTs (i.e. none at all) but best
case if we extrapolate the linear results then you’d have to take into 
account the differing ESS for the covariance estimates into the ANOVA
which is going to be a nightmare.

Granted that the original paper was focused on the usual situation where someone is trying to decide if there are between-group differences in data and they don't explicitly talk about MCMC where the groups are chains. But aside from the fact that the test assumes independent observations, which isn't the case for a Markov chain and you have to thin enough for there to be approximate independence, I don't see anything about the paper that would make it inapplicable to Markov chain output.

For alpha = 2 and one variable, the test is equivalent to the ANOVA test, which uses CLT stuff to get the null distribution. But the main point is to use alpha < 2, so that you are testing differences in distribution rather than merely differences in means. Since the test explicitly doesn't require finite second moments, I'm not sure where the CLT would enter the consideration. In section 3.3, they do show that the asymptotic distribution of the test is a weighted sum of squared Gaussians, but they don't do anything with that except show that the sum diverges under the alternate hypothesis. To simulate the p-values, they do permutation tests, basically scrambling the groups structure and counting the proportion of times you get a more extreme statistic. So I'm not sure what we would need to be converging in distribution to a Gaussian in order for it to be a viable test?

Maybe you were saying that the multivariate MC-CLT enters somehow when deciding how much to thin the output before doing the test? I guess I'm distorting the size of the test a little bit by looking at the output to decide how much to thin, but I doubt that would be a major thing.

Ben

Michael Betancourt

unread,
Feb 25, 2014, 7:08:25 PM2/25/14
to stan...@googlegroups.com
Almost.  The biggest question is what is the asymptotic distribution of a multivariate
Monte Carlo estimator?  The marginal variances (and diagonal elements of the
covariance matrix) will be the usual, true marginal variance / ESS, but it’s less
clear what the off-diagonal elements will be.  Even if the off-diagonals scale
as true covariance / ESS, we have n(n -1) / 2 ESSs to contend with and there’s
no strong theoretical argument (as far as I know) that motivates why they should
be the same.  And if they’re not the same then there’s no consistent way to define
a joint ESS for tests like yours.

In particular, if you’re going to claim a joint convergence diagnostic then it should
presumably be one that is invariant to parameter transformations.

Ben Goodrich

unread,
Feb 25, 2014, 7:54:39 PM2/25/14
to stan...@googlegroups.com
On Tuesday, February 25, 2014 7:08:25 PM UTC-5, Michael Betancourt wrote:
Almost.  The biggest question is what is the asymptotic distribution of a multivariate
Monte Carlo estimator?  The marginal variances (and diagonal elements of the
covariance matrix) will be the usual, true marginal variance / ESS, but it’s less
clear what the off-diagonal elements will be.  Even if the off-diagonals scale
as true covariance / ESS, we have n(n -1) / 2 ESSs to contend with and there’s
no strong theoretical argument (as far as I know) that motivates why they should
be the same.  And if they’re not the same then there’s no consistent way to define
a joint ESS for tests like yours.

In particular, if you’re going to claim a joint convergence diagnostic then it should
presumably be one that is invariant to parameter transformations.

The distance correlation measure of dependence is invariant to scale and rotation. But for other kinds of transformation, I guess it is up to the user as to what functions to monitor for between-chain differences when doing the tests.

Correct me if I am still misunderstanding, but I think you are saying that the within-chain degrees of freedom that the test is using should be equal to unknown ESS. So, like in the original example I gave

> convergence(leuk, thin = 4, R = 100)
[1] "Multivariate loss = 1"

disco
(x = mat, factors = chain_id, distance = FALSE, R = R)

Distance Components: index  1.00
Source                 Df   Sum Dist  Mean Dist    F-ratio    p-
value
factors                
20   36.90887    1.84544      0.956    0.54455

Within               5229 10095.38503    1.93065
Total                5249 10132.29390

you are saying that 5229 (21 * 1000 / 4 - 21) is not the right denominator for 10095.38503 / 5229 = 1.93065, which in turn is not the right denominator for 1.84544 / 1.93065 = 0.956 ?

If so, then I think there are two things to consider. First, the 5229 is a constant when they do the permutations to get the p-value, so I am not sure it makes a difference as to whether 5229 is really the ESS. Second, if you thin to the point that you think the retained draws are approximately independent, then I would guess that the ESS is approximately equal to the number of retained draws.

Then there is the question of whether what you are saying adversely affects the distance correlation vs. iteration plot used to decide the thin interval. I just did a lowess curve through the invisible points that represent the distance correlation at time t for lag length j. So that is not super-formal, reuses the data multiple times, depends on the bandwidth, etc. But I wasn't too worried about it initially. If you don't thin enough, then the within-chain distances are too small and the F-ratio is too big and you are more likely to make a Type I error. In which case, you would run it longer, thin it more, and likely land on your feet eventually.

Ben
 

Ben Goodrich

unread,
Mar 2, 2014, 6:43:40 PM3/2/14
to stan...@googlegroups.com
On Monday, February 24, 2014 5:54:33 PM UTC-5, Ben Goodrich wrote:
Alternatively, one could thin the chains so that approximate independence of retained draws is a reasonable assumption and obtain the p-value for a test of the null hypothesis that all the chains have the same distribution. To do this, we can use the distance correlation concept of Székely, Rizzo, and Bakirov (2007)

https://projecteuclid.org/download/pdfview_1/euclid.aos/1201012979

which has the property that two possibly multivariate random variables A and B are independent iff the distance correlation between them is zero. We can estimate the distance correlation at time t by rowwise stacking the m draws at time t into the matrix A and rowwise stacking the m draws at time t - j into the matrix B for some j > 0 lags (the columns of both A and B are the posterior parameters). There is a bias-corrected version of the distance correlation estimator in the bcdcor() function in the energy package for R. We can then plot nonparametric curves of these distance correlations over time by j to see what lag length renders the draws essentially independent. This particular graph is fairly easy to interpret, such as in the case of the leuk model from the BUGS examples:

I made the plot below to compare the posterior dependence of MH, NUTS, and Gibbs. Again, the estimated dependence at a particular iteration is consistent as the number of chains increases.

The example is low birthweight as a function of age, race, and smoking from MCMCpack. MCMClogit is a MH sampler with scaled Cauchy priors on the coefficients that jumps from a multivariate normal with covariance matrix proportional to the estimated variance-covariance matrix of the MLE estimates. It has about a 25% acceptance rate. The same model in Stan with an 80% targeted acceptance rate has much less dependence. Stan is almost as good as the MCMCprobit implementation, which has a noninformative multivariate normal prior on the coefficients and updates them in one batch from their full-conditional distribution given the data-augmented latent variable.

Whether you go on to say that the Stan draws are essentially independent after three or four lags and then test the null hypothesis of convergence to the same distribution is another question (although you would fail to reject in this case). I still don't see why whether or not the covariances among thinned MCMC draws converge to anything matters if the test only requires the existence of first moments.

Ben

Reply all
Reply to author
Forward
0 new messages