Issue One: What is Convergence?
There is no hard cutoff between “transience” and “equilibrium". What happens is that as N-> infinity the distribution of possible states in the Markov chain approaches the target distribution and _in that limit_ the expected value of the Monte Carlo estimator of _any integrable function_ converges to the true expectation. There is nothing like warmup here because in the N->infinity limit the effects of initial state are completely washed out.
The problem is what happens for finite N. If we can prove a strong _geometric ergodicity_ property (which depends on the sampler and the target distribution) then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time — but even if you can show it exists and is finite (which is nigh impossible) you can’t compute an actual value analytically.
--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
No.No, no, no.No, no, no, no, no, no.
The problem with Rhat is that it is _necessary_ but not _sufficient_. The ridiculously dangerous thing about simple MCMC algorithms is that they can fail with absolutely no diagnostics. For example, consider a Cauchy seeded in the bulk — with high probability it never reaches the tails and produces under-dispersed samples that look fine under any check.This is essentially why Geyer was so fervently against Rhat back in the day — if you can’t guarantee that the sampler will work okay than diagnostics like Rhat will never be enough to ensure that everything is working. There may be hope for HMC but I can’t promise anything yet — hopefully we’ll have something within the next few months if the mathematics conforms to my intuition…
The problem is statements like "the dependence between draws is essentially zero after 5 to 7 lags, in which case a strong form of ergodicity presumably holds, in which case convergence shouldn't be a problem”. Ultimately the issue is that when geometric ergodicity does not hold everything can look fine and there are no numerical diagnostics to indicate that something is wrong.
Just because Gibbs, Random Walk Metro, and HMC all converge to the same distribution doesn’t mean that it’s the correct equilibrium distribution — indeed when algorithms fail they tend to fail in the same way.
On Saturday, August 2, 2014 3:15:21 PM UTC-4, Michael Betancourt wrote:No.No, no, no.No, no, no, no, no, no.
No, what? I've been posting examples for the last few months where the Rhats are close to 1.0 and the chain has not converged, so I'm not sure which part of what I said you are disagreeing with. In the most recent example with the Gibbs sampler, I noted that despite the low Rhats, the draws are still very dependent out to 640 lags so presumably no form of strong ergodicity holds so it is hard to justify a conclusion that the chains have converged (implicitly: even though that is what readers of BDA have been taught to do for 3 editions).
The only reason that I think the Gibbs sampler is okay in this situation is that the results are similar to what I get with Stan, where the dependence between draws is essentially zero after 5 to 7 lags, in which case a strong form of ergodicity presumably holds, in which case convergence shouldn't be a problem and indeed we fail to reject the null that all the chains have the same distribution (and FWIW the Rhats are all 1.0).
The problem with Rhat is that it is _necessary_ but not _sufficient_. The ridiculously dangerous thing about simple MCMC algorithms is that they can fail with absolutely no diagnostics. For example, consider a Cauchy seeded in the bulk — with high probability it never reaches the tails and produces under-dispersed samples that look fine under any check.This is essentially why Geyer was so fervently against Rhat back in the day — if you can’t guarantee that the sampler will work okay than diagnostics like Rhat will never be enough to ensure that everything is working. There may be hope for HMC but I can’t promise anything yet — hopefully we’ll have something within the next few months if the mathematics conforms to my intuition…
In what way does the dependence plot I posted not conform to exactly that intuition?
Ben--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.
For even the weak form of ergodicity, we know that if you let that Markov chain run for an infinite number of iterations and let the lag between any two draws go to infinity, then those two draws will approach independence. In practice with n draws and m chains, we would like to see negligible dependence for all post-warmup draws that are separated by s lags. And there are plenty of examples where we see that for some manageable s and plenty of examples where we do not.
So, the question seems to be: Can the dependence seem negligible when in reality, it is non-negligible? That could happen if A) the unbiased distance correlation estimator systematically underestimated the dependence at s lags or if B) the distance correlation were zero at s lags for m chains and n iterations but would become non-zero if n or m were larger. B) can't be ruled out from the iterations that you have, but those concerns hold for any empirical diagnostic.
On A), the distance correlation requires finite first moments, so I don't know what would happen to the estimator if that condition didn't hold, but otherwise I have a hard time seeing what could go badly wrong since the estimator utilizes the m * (m - 1) / 2 distances between chains and two fixed time points (t and t - s), rather than the within-chain information that makes estimation complicated. To be on the safe side, I guess you could use the biased distance correlation estimator that overestimates when there is independence, but then you would need a much larger m.
Your intuition here holds only for geometrically (and uniformly) ergodic chains, which is the point I keep trying to make.
Supposing
that your eyes trick you, then you can test the null hypothesis that
the thinned, split chains all have the same distribution. If you don't
thin enough, then you tend to reject the null for the same reason. Otherwise, the test usually seems to result in the same conclusion that you would arrive at visually. So, you would need a
failure to converge, a misleading plot, and a type II error to get into
trouble.

For example, a hyperparameter might get stuck and the chain ends up sampling from the distribution conditioned on the initial value of that hyperparameter. When the hyperparameter does move it may tend to jump to other extreme values and the chain continues sampling from the new conditional distribution. Between any jumps everything looks fine and any lag-based diagnostic is fine.
The big problem with MCMC is that if your target distribution doesn’t satisfy the geometric ergodicity conditions for your Markov transition then you are likely to get meaningless answers. And nobody ever checks those conditions (which is why we have to be careful with the BUGS models). HMC might actually be different (it looks like numerical divergences actually indicate geometric ergodicity failing) but it’ll be a while before we can nail down all the theoretical details.
--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
a) Polynomial ergodicity is not good enough — geometric or uniform is required, and you don’t get uniform ergodicity on non-comptact spaces.
c) One of the really important properties of geometric ergodicity is that we can constrain _how_ the delta_{q} T^{i} converge to the target distribution as i \rightarrow \infty. This allows us to ensure that correlations between different delta_{q} T^{i}, and even the variance of different samples from delta_{q} T^{i}, are well-defined. In this case you can calculate all the lags you want and argue for stationarity. But theoretically the time to stationarity should be approximately the same as the autocorrelation time, so if you rely on these diagnostics entirely you’re throwing away good information (comparing the time for Rhat or any distance diagnostic to converge vs autocorrelation time would be a good project).
Also, the log scale on your plots is hugely misleading. The variations in the retained parts are of the same time scale as all of the warmup, indicating that warmup was just nowhere near long enough.
b) EVERYTHING is an integral and you have to treat it as such. The appropriate question is not whether the function whose expectation you’re taking diverges or whether the measure you’re taking an expectation with respect to has finite moments, the question is whether or not the integral \int f(x) dp(x) is finite (mathily, whether f \in L^{1} \left( p \right) or not). This has to be verified on a case-by-case basis.To go further we have to be careful with the notation. A Markov chain is a generated by a transition operator, T, that maps measures to measures. We start with a delta function at the initial state, delta_{q}, and after time n get the measure delta_{q} T^{n} (T acts on the right because statisticians hate good notation). The distribution of the Markov chain is then a semi-direct product measure,delta_{q} T^{0} \ltimes delta_{q} T^{1} \ltimes … \ltimes delta_{q} T^{n}The actual Markov chain is a sample from this distribution and the semi-direct product just means that if we sampled the state q_{i} then q^{i + 1} ~ delta_{q_{i}} T.
d) Without geometric ergodicity we are boned, because we don’t even know how the intermediate distributions, delta_{q} T^{i}, relate to the asymptotic target measure. These means that within chain and between chain expectations may be ill-defined and not close to the asymptotic values for any finite i.
For example, in the case of “jumping” hyper parameters the diagnostics will catch short periods, but remember that the centered-parameterization of the 8-schools got stuck only after hundreds of thousands of iterations which is far more than anyone would run.
> convergence(eight_schools, thin = 25, R = 199)
[1] "Multivariate loss = 1.008"
disco(x = mat, factors = chain_id, distance = FALSE, R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
factors 31 596.90096 19.25487 1.315 0.005
Within 608 8905.04746 14.64646
Total 639 9501.94843
Firstly, the “convergence” time will depend on each marginal because it depends on the function of interest. Because Rhat and its ilk cannot disentangle lack of stationary due to transient effects or lack of geometric ergodicity, the diagnostics will always have residual marginal-dependence. Even a joint test will depend on which functions are used.
> The distance correlation is a scalar function of all the parameters in the posterior distribution. And the alternative hypothesis is that at least one of the chains has a different distribution than the rest, which is pretty general compared to a more limited thing like the means of one of the chains are too different relative to the other chains.
But we can appeal to stationarity for the distribution of the expectations — we can’t for the limiting distribution itself, which means strong Gaussianity assumptions will lead to poor power estimates.
> I'm still not understanding --- for the purpose of estimating the distance correlation between two states, t - s and t --- why you are doing the asymptotics with respect to the number of iterations instead of with respect to the number of chains.
There is absolutely no guarantee of anything by adding more Markov chains. None.
All more Markov chains give you is more realizations from the joint distribution over the entire chain, but you have to be very careful here because that joint distribution is not a product of distributions at each iteration but rather a semi-direct product. For example if you run m Markov chains to iteration n you’re not drawing n independent samples from the common distribution,
\delta_{q} T^{n},
you’re drawing samples from the individual conditional distributions,
\delta_{q_{m(n-1) } T.
When you have geometric ergodicity these distributions converge in a well-defined way to the target distribution and you can start considering auto and cross correlations as well defined objects. But without that there’s no strong information on how the \delta_{q_{m(n-1) } T are related, so calculations like the cross correlation between chains are not well-defined.
Yes running more chains tends to catch more pathologies empirically (including for Rhat), and until we understand ergodicity better there’s no reason to not run more chains if the resources are there, but there is no theoretical guarantee of anything and anything that might suggest otherwise to the users should be unwelcome. I don’t care if it scares off users who require a single number to feel good about their analysis — great power, great responsibility, blah, blah, blah.
As you probably know, Kenny Shirley and I never wrote that split R-hat paper, and I’d still like to write this up as a paper. Currently it’s in the Stan manual and BDA3 and that’s it. So if you have other ideas I’d be happy to work with you and we could write all this up. I think all of this is an important advance beyond regular R-hat (not to mention various single-chain crap that people do).
mc <-
'
data {
int<lower=1> K;
real<lower=0> nu;
}
transformed data {
matrix[K,K] I;
vector[K] zero;
zero <- rep_vector(0.0, K);
I <- rep_matrix(0.0, K, K);
for(k in 1:K) I[k,k] <- 1.0;
}
parameters {
vector[K] theta;
}
model {
theta ~ multi_student_t(nu, zero, I);
}
generated quantities {
real SS;
SS <- dot_self(theta);
}
'
bad_multi_t <- stan(model_code = mc, chains = 16, seed = 12345, data = list(K = 30, nu = 5))
bad_multi_t
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0.00 0.01 1.33 -2.59 -0.71 0.00 0.73 2.56 16000 1.00
theta[2] -0.01 0.01 1.40 -2.80 -0.75 0.00 0.73 2.69 16000 1.00
theta[3] -0.01 0.01 1.28 -2.56 -0.71 -0.02 0.70 2.54 16000 1.00
theta[4] 0.00 0.01 1.36 -2.63 -0.72 0.00 0.70 2.66 16000 1.00
theta[5] 0.01 0.01 1.38 -2.60 -0.74 0.00 0.75 2.68 16000 1.00
theta[6] -0.02 0.01 1.33 -2.66 -0.74 0.00 0.70 2.64 16000 1.00
theta[7] 0.01 0.01 1.36 -2.60 -0.72 0.00 0.73 2.66 16000 1.00
theta[8] 0.00 0.01 1.34 -2.61 -0.73 0.00 0.73 2.57 16000 1.00
theta[9] 0.00 0.01 1.34 -2.63 -0.73 0.00 0.75 2.66 16000 1.00
theta[10] 0.00 0.01 1.38 -2.58 -0.72 -0.01 0.71 2.64 16000 1.00
theta[11] 0.00 0.01 1.36 -2.65 -0.72 -0.02 0.72 2.71 16000 1.00
theta[12] 0.00 0.01 1.34 -2.67 -0.73 0.01 0.73 2.69 16000 1.00
theta[13] -0.04 0.01 1.38 -2.71 -0.77 -0.03 0.70 2.58 16000 1.00
theta[14] 0.01 0.01 1.37 -2.69 -0.76 0.00 0.75 2.69 16000 1.00
theta[15] 0.00 0.01 1.38 -2.62 -0.73 0.00 0.74 2.61 16000 1.00
theta[16] 0.00 0.01 1.34 -2.62 -0.72 0.00 0.74 2.63 16000 1.00
theta[17] 0.00 0.01 1.37 -2.62 -0.72 -0.01 0.73 2.60 16000 1.00
theta[18] -0.02 0.01 1.35 -2.68 -0.72 -0.01 0.72 2.55 16000 1.00
theta[19] 0.01 0.01 1.33 -2.57 -0.71 0.00 0.74 2.62 16000 1.00
theta[20] -0.02 0.01 1.38 -2.61 -0.74 -0.01 0.72 2.60 16000 1.00
theta[21] 0.00 0.01 1.39 -2.59 -0.74 0.00 0.73 2.69 16000 1.00
theta[22] -0.01 0.01 1.37 -2.68 -0.74 -0.01 0.72 2.68 16000 1.00
theta[23] -0.01 0.01 1.35 -2.58 -0.73 0.00 0.71 2.56 16000 1.00
theta[24] 0.00 0.01 1.37 -2.66 -0.75 0.00 0.73 2.63 16000 1.00
theta[25] 0.00 0.01 1.34 -2.68 -0.71 0.00 0.72 2.63 16000 1.00
theta[26] 0.00 0.01 1.38 -2.60 -0.73 0.00 0.73 2.60 16000 1.00
theta[27] 0.01 0.01 1.37 -2.61 -0.74 0.01 0.75 2.61 16000 1.00
theta[28] 0.00 0.01 1.37 -2.59 -0.72 -0.01 0.72 2.64 16000 1.00
theta[29] 0.00 0.01 1.38 -2.64 -0.76 -0.01 0.74 2.71 16000 1.00
theta[30] 0.00 0.01 1.38 -2.68 -0.74 -0.01 0.71 2.66 16000 1.00
SS 55.54 3.89 125.55 9.64 21.45 33.77 57.19 210.45 1041 1.01
lp__ -37.59 0.49 12.19 -65.86 -44.11 -35.84 -29.15 -18.80 622 1.03
convergence_test(bad_multi_t, thin = 20, R = 199, split = TRUE)
[1] "Multivariate loss = 1.018"
disco(x = posterior, factors = data.frame(chain_id), R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
chain_id 31 1262.50287 40.72590 1.353 0.005
Within 768 23118.71146 30.10249
Total 799 24381.21432
mc <-
'
data {
int<lower=1> K;
real<lower=0> nu; }
parameters {
vector[K] z;
real<lower=0> u;
}
transformed parameters {
vector[K] theta;
theta <- sqrt(nu / u) * z;
} model {
z ~ normal(0.0,1.0);
u ~ chi_square(nu);
}
generated quantities {
real SS;
SS <- dot_self(theta);
}
'
convergence_test(good_multi_t, thin = 4, R = 199, split = TRUE)
[1] "Multivariate loss = 1.002"
disco(x = posterior, factors = data.frame(chain_id), R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
chain_id 31 801.10788 25.84219 1.023 0.39
Within 3968 100260.43483 25.26725
Total 3999 101061.54271
<bad_multi_t.pdf>
BenWith a t_5, perhaps the mean is not such a good summary. Another option would be to first rank and then z-score the data before doing split R-hat, as I discussed in a recent email exchange with Bob.A
monitor(sims = apply(bad_multi_t, 2:3, FUN = function(x) scale(rank(x))),
warmup = 0, digits_summary = 2)
Inference for the input samples (16 chains: each with iter=1000; warmup=0):
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0 0.01 1 -1.64 -0.87 0 0.87 1.64 16000 1.00
theta[2] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[3] 0 0.01 1 -1.64 -0.86 0 0.86 1.64 16000 1.00
theta[4] 0 0.01 1 -1.65 -0.87 0 0.87 1.64 16000 1.00
theta[5] 0 0.01 1 -1.64 -0.86 0 0.87 1.64 16000 1.00
theta[6] 0 0.01 1 -1.64 -0.87 0 0.87 1.64 16000 1.00
theta[7] 0 0.01 1 -1.64 -0.86 0 0.87 1.64 16000 1.00
theta[8] 0 0.01 1 -1.64 -0.86 0 0.86 1.64 16000 1.00
theta[9] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[10] 0 0.01 1 -1.64 -0.86 0 0.87 1.64 16000 1.00
theta[11] 0 0.01 1 -1.64 -0.87 0 0.87 1.64 16000 1.00
theta[12] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[13] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[14] 0 0.01 1 -1.64 -0.86 0 0.87 1.64 16000 1.00
theta[15] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[16] 0 0.01 1 -1.64 -0.86 0 0.86 1.64 16000 1.00
theta[17] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[18] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[19] 0 0.01 1 -1.64 -0.86 0 0.87 1.64 16000 1.00
theta[20] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[21] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[22] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[23] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[24] 0 0.01 1 -1.64 -0.86 0 0.86 1.64 16000 1.00
theta[25] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[26] 0 0.01 1 -1.64 -0.86 0 0.87 1.64 16000 1.00
theta[27] 0 0.01 1 -1.64 -0.87 0 0.86 1.64 16000 1.00
theta[28] 0 0.01 1 -1.64 -0.86 0 0.87 1.64 16000 1.00
theta[29] 0 0.01 1 -1.64 -0.87 0 0.87 1.64 16000 1.00
theta[30] 0 0.01 1 -1.64 -0.86 0 0.86 1.64 16000 1.00
SS 0 0.04 1 -1.64 -0.86 0 0.86 1.64 755 1.01
lp__ 0 0.04 1 -1.64 -0.86 0 0.86 1.64 755 1.01
I’m not quite sure but I don’t think that this scale(rank(x)) is quite working the way it’s supposed to. For each quantity of interest, you want to rank all the simulations in all the chains. I think that here you’re doing a separate ordering for each chain, although I can’t be sure.
rank_rhats <- function(stan_object) {
sims <- apply(stan_object, 3, FUN = function(x) {
qnorm(ecdf(x)(x) - .Machine$double.eps) # vector of length iter x chains
})
dim(sims) <- dim(stan_object)
dimnames(sims) <- dimnames(stan_object)
monitor(sims, warmup = 0, digits_summary = 2)
}
> rank_rhats(bad_multi_t)
Inference for the input samples (16 chains: each with iter=1000; warmup=0):
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[2] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[3] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[4] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[5] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[6] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[7] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[8] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[9] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[10] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[11] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[12] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[13] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[14] 0 0.01 1.00 -1.96 -0.67 0 0.68 1.96 16000 1.00
theta[15] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[16] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[17] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[18] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[19] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[20] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[21] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[22] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[23] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[24] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[25] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[26] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[27] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[28] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[29] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
theta[30] 0 0.01 1.00 -1.96 -0.67 0 0.67 1.96 16000 1.00
SS 0 0.04 1.00 -1.96 -0.67 0 0.67 1.96 599 1.03
lp__ 0 0.04 1.01 -1.96 -0.67 0 0.67 1.96 595 1.03
Yes, that makes sense. So my point was a digression (although perhaps useful to get it out there).
then you have the Rhats calculation, and you can do it on split chains if you want. If so, then the more subjective pros and cons of those 4 changes for answering the question of whether all the chains have converged in distribution to the posterior (as opposed to the question of whether the means have converged in probability to the posterior expectations) are:
Additionally, in practice, my approach seems to require more than 4 chains, which tends to expose more problems but takes more wall time to execute. Also, for values of the sensitivity parameter less than 1, the test is valid if the posterior distribution lacks finite variances, while the Rhat is not.
Is there anything more that should be considered? If not, I would conclude that users should test the null that all the thinned, split chains have the same distribution against the alternative that at least one has a different distribution, where the thinning interval is chosen from the dependence plot as the smallest one where the estimated dependence is near zero over the entire post-warmup period. Then, if they fail to reject that null --- and do not see any problems via other plots and stuff --- then they should look only at the Rhat(s) for the function(s) of interest to decide if they are estimating the expectation(s) precisely enough. Here, it is useful to know that means from the unthinned sample are converging in probability to the posterior expectations at least as fast as they would in a hypothetical simple random sample from the posterior whose size is equal to that of the thinned sample. But that can be a vacuous statement if the thinned sample is small, so an Rhat for the margin(s) of interest would be helpful information.
Michael, I think, is mostly arguing from the other direction, that in the absence of geometric ergodicity my approach could suffer from the same problems as Rhat. Based on my experience, I don't think that is true. Specifically, the fact that there are several examples where the null is rejected when the Rhats are small and no examples that I know of (with 8+ chains) where the null is not rejected when the Rhats are not small, suggests that it does not suffer as badly as Rhat in slow convergence situations. They both have difficulty with multimodal posteriors but that is more of an issue with the samplers than the diagnostics.
Ben