I started with linear regression. I had code that I’d copied from an earlier Stan manual in which the model was not vectorized. With 1000 data points it ran really slow. So I think it’s a really good thing that in the current Stan manual, we start with the vectorized version and then only give the non-vectorized later. Otherwise people will use the slow code and think that Stan is slow.
Then I explained the output. I have mixed feelings about showing the Monte Carlo standard error. At one level, it’s a good thing to know. But for a practitioner it’s a bit of a distraction. In practice, all the user really needs to know is that this MCSE is less than 1/10 (say) of the posterior se.
So I’m wondering if we actually need a new summary, not quite R.hat but this other thing.
Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it.
Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error. But I’m open to persuasion on this.
On Feb 9, 2015, at 1:23 PM, Ben Goodrich <goodri...@gmail.com> wrote:On Monday, February 9, 2015 at 12:14:34 PM UTC-5, Andrew Gelman wrote:Then I explained the output. I have mixed feelings about showing the Monte Carlo standard error. At one level, it’s a good thing to know. But for a practitioner it’s a bit of a distraction. In practice, all the user really needs to know is that this MCSE is less than 1/10 (say) of the posterior se.
As Michael would say, conditional on the MCMC CLT holding, then the MCSE from NUTS will usually be small enough to ignore or not see. And if the MCMC CLT does not hold, then the MCSE calculation is not meaningful. But I don’t think it is correct to label an issue as important as this a "distraction".
So I’m wondering if we actually need a new summary, not quite R.hat but this other thing.
I think you need the plot that shows estimated within-chain dependence is negligible after a handful of lags and stays negligible throughout the post-warmup period. And a failure to reject the null hypothesis that all the chains have the same distribution when thinned to the critical number of lags.
Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it.
The problem with relying on R hat has never been that it produces high values when it should not; the problem is that it can produce low values when the chains have not converged or when the CLT does not hold. That, and it is marginal and only sensitive to means and variances.
Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error. But I’m open to persuasion on this.
On one hand, it is a question of whether you prefer squared error loss or absolute value loss.
What happened was, I ran it for 10 iterations.
So I’m wondering if we actually need a new summary, not quite R.hat but this other thing.
I think you need the plot that shows estimated within-chain dependence is negligible after a handful of lags and stays negligible throughout the post-warmup period. And a failure to reject the null hypothesis that all the chains have the same distribution when thinned to the critical number of lags.A plot is good for hard problems, to diagnose what’s wrong. But I think we also need a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem.
Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it.
The problem with relying on R hat has never been that it produces high values when it should not; the problem is that it can produce low values when the chains have not converged or when the CLT does not hold. That, and it is marginal and only sensitive to means and variances.Then let’s beef it up with some other checks. Rather than starting with the statistic and seeing what it does, let’s start with the goal—to give users a number they can use—and figure out how to get there. Split R-hat was a step forward, but lets’ do more.
Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error. But I’m open to persuasion on this.
On one hand, it is a question of whether you prefer squared error loss or absolute value loss.I’m not really thinking in terms of a loss function, it’s more of a question about having a general summary. In the applied world, people go with estimates and standard errors.
Hi, there are a few things going on here. First, we are currently preparing stan_lm, stan_glm, stan_lmer, etc., and these functions will be black boxes that can be treated as such,
Intercept ...
Age ...
...
sigma ...
beta[1] ...
beta[2] ...
...
sigma ...
On Feb 9, 2015, at 5:24 PM, Ben Goodrich <goodri...@gmail.com> wrote:On Monday, February 9, 2015 at 4:33:02 PM UTC-5, Andrew Gelman wrote:A plot is good for hard problems, to diagnose what’s wrong. But I think we also need a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem.
Perhaps, a p-value for the null hypothesis that all of the chains have the same distribution?
Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it.
The problem with relying on R hat has never been that it produces high values when it should not; the problem is that it can produce low values when the chains have not converged or when the CLT does not hold. That, and it is marginal and only sensitive to means and variances.Then let’s beef it up with some other checks. Rather than starting with the statistic and seeing what it does, let’s start with the goal—to give users a number they can use—and figure out how to get there. Split R-hat was a step forward, but lets’ do more.
Where's the beef? I know you favor doing R-hat on the z-scores, but I think that is a limited step.
I would agree with your goal of giving “a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem", but I don't understand why a p-value for the null hypothesis that the chains have the same distribution is not that number.
If you reject that null hypothesis, simulate longer or ask an expert for help. If you fail to reject that null hypothesis, you should ostensibly look for other evidence, but I don't know what else to systematically look for when the test has power against any alternative where at least one chain comes from a different distribution.
Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error. But I’m open to persuasion on this.
On one hand, it is a question of whether you prefer squared error loss or absolute value loss.I’m not really thinking in terms of a loss function, it’s more of a question about having a general summary. In the applied world, people go with estimates and standard errors.
They can estimate the mean and standard deviation or they can estimate the median and use the scaled MAD (or similar) to estimate the standard deviation. But I don't know how you would choose between those options without saying what is to be avoided. If the thing to be avoided is providing estimates of moments that don’t exist, it seems reasonable to go with quantile-based things, but we’re going to have a real hard time convincing ourselves that the chains have converged in non-trivial problems when means and / or variances do not exist.
On Feb 9, 2015, at 5:24 PM, Ben Goodrich <goodri...@gmail.com> wrote:On Monday, February 9, 2015 at 4:33:02 PM UTC-5, Andrew Gelman wrote:A plot is good for hard problems, to diagnose what’s wrong. But I think we also need a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem.
Perhaps, a p-value for the null hypothesis that all of the chains have the same distribution?No, not a p-value. The distributions are different, the question is, how different are they? R-hat answers that question.
sqrt(N / (N - 1) * W + 1 / N * B) / sqrt(W)
We can discuss. But, no, the p-value is definitely not the right answer to the question.
I’m up for doing something better than R-hat. But not a p-value.
B / W
sqrt(N / (N - 1) * W + 1 / N * B) / sqrt(W)
I think the code can be merged if it's in R. My opinion is that all these are only necessary and it's up to the user to decide.
Jiqiang
What about a rule such as “keep going until MCSE/SD < 0.1 for all quantities of interest”?I can believe that such a rule is flawed but if there are examples where it doesn’t work, we can look at these and maybe do better.
stopifnot(require(parallel))
normal_is <- function(chain, sigma, sims, burnin = 0) {
holder <- rep(NA_real_, sims - burnin)
names(holder) <- "X"
X <- rnorm(1)
for(i in 1:sims) {
Y <- rnorm(1, mean = 0, sd = sigma)
r <- dnorm(Y) * dnorm(X, mean = 0, sd = sigma) /
(dnorm(X) * dnorm(Y, mean = 0, sd = sigma))
if(runif(1) <= r) X <- Y
if(i > burnin) holder[i-burnin] <- X
}
return(holder)
}
RNGkind("L")
set.seed(123)
posterior <- simplify2array(mclapply(1:16, mc.cores = 8L, normal_is, sigma = 0.5, sims = 8000))
dim(posterior) <- c(dim(posterior), 1L)
rstan::monitor(posterior, digits = 2)
plot(density(posterior), main = "", las = 1)
curve(dnorm(x), from = -3, to = 3, col = 2, lty = 2, add = TRUE)
Inference for the input samples (16 chains: each with iter=8000; warmup=4000):
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
V1 -0.04 0.05 0.92 -1.88 -0.69 -0.03 0.62 1.67 405 1.03
> colMeans(posterior)
[,1]
[1,] 0.095047920
[2,] 0.033490434
[3,] 0.028307452
[4,] -0.121544967
[5,] 0.032662167
[6,] 0.007854844
[7,] 0.158175078
[8,] -0.029664244
[9,] 0.001635766
[10,] 0.066116518
[11,] 0.094888171
[12,] -0.032819708
[13,] -0.062017140
[14,] -0.094099226
[15,] 0.113922192
[16,] -0.332537536I’m confused. Is this an example where MCSE/SD < 0.1for all quantities of interest but convergence is poor?
If we want to play the “total variation norm” thing, that’s another story, then indeed we’d want to look at things a different way. Perhaps, for example, there’s a way to estimate total variation norm from the different sequences, trying each one out against the mixture of all the others?
Here, though, I’m talking about the more basic problem of getting good inferences for the quantities of interest.
I’m happy to run for more than 30 seconds if that’s what it takes. I just don’t think it makes sense to require Stan users to become experts on convergence diagnostics.
Think of it this way:- Sometimes we’re fitting easy models such as linear regression, glms, etc. Stan should be able to do these as automatically as can be done using lm, glm, etc. Actually the Stan fit should be more stable because of issues such as separation which we can handle easily with priors.
- Sometimes we fit models like lmer/glmer which appear in R to be easy to fit, but in practice have stability problems that the user doesn’t see. Stan should be able to fit these, again better than lmer etc.
- Sometimes we have tricky new models such as that a*exp(-b*t) + c model from my earlier email. In that case the user has to be prepared for red flags. Red flags are ok. But I’d rather give red flags (with messages such as “Try running longer or simplify your model or talk to an expert”) than give tons of numbers with the expectation that the user will have Betancourt-level understanding.
- Beyond all this, this conversation is reminding me of the huge importance of fake-data checking and posterior predictive checking, which I think will catch a lot of the serious problems in a useful way.
Firstly, Ben I implore you to start writing these proposed algorithms down in TeX
somewhere so that we can understand your implementations without having to appeal
to software archeology. The same problem has come up with LKJ priors — we really
need technical reports for these things to improve communication both internally and
externally.
That said, let me summarize my perspective on this matter.
— Let’s Review R_hat --
TL;DR Assuming the MC-CLT, R_hat is a natural diagnostic for MCMC. It’s just
not one that we can ever calibrate.
— DISCO —
One of the difficulties with this DISCO test is in constructing the proper significance
for the null hypothesis that the distributions being compared are exactly equal.
The authors of the above paper take a non-parametric approach with a
permutation test, but my understanding is that permutation tests are
computationally intense and they can be sensitive to the number of permutations
Like R_hat, there is no explicit alternative hypothesis so we can’t
calibrate any resulting statistic.
So why wouldn’t we want to be sensitive to the entire distribution of a
random variable?
————————————————————————————————————————
—“Convergence” and what we want in a practical convergence diagnostic —
TL;DR Practical convergence depends on the chosen random variable.
In practice the only robust way to use MCMC is to check R_hat for
every Monte Carlo estimator constructed.
This is especially a problem with Stan because Stan enables more ambituous hierarchical models and encourages introducing a lot of extra parameters through Matt tricks and matrix decompositions.
So, I think the workflow should be: First, convince yourself that all the chains have converged using functions of the joint distribution. Second, look at functions (n_eff, traceplots, MCSE, etc.) of the quantities that are of interest to see if you have enough detail for what you want to do and then ignore the rest of the posterior distribution. Once the first step is completed, I think we are agreeing.
What all of this means is that practical convergence depends on
the random variables, i.e. the functions whose expectations we
want to estimate. This is why I don’t think the added sensitivity of
the DISCO test is particularly useful — it requires more convergence
than necessary for a given application and can necessitate
running for much, much longer than necessary.
Now diagnosing geometric ergodicity is another matter altogether...
For LKJ, there isn't much to add that wasn't implied in the LKJ (2009) paper but that paper is not super-easy to follow and some stuff was less than explicit.
I haven't noticed sensitivity to the number of permutations although it is something to look into. But it isn't computationally intensive relative to MCMC. Yes, in toy problems it is a bit of an overkill and for people who insist on keeping a million iterations post-warmup it will exhaust their RAM. But for a couple thousand iterations, it only takes a few seconds and the flop count doesn't depend heavily on the number of parameters, so not that big a deal.
But unlike R_hat or anything DISCO with alpha = 2, the test is consistent against all alternatives. So, we would have to get hand-wavy about the number of chains and thinned iterations you need in practice, but we're good in principle. Oversensitivity when the null is only trivially false is the bigger concern, I think.
Here is where I think we have different criteria for a practical convergence diagnostic. From my perspective, one of the biggest problems with calculations on the marginals is that either
- People observe that the R_hat for their function of interest is near 1.0 and but little to no weight on the other margins or
- People remember that their function of interest is not independent of the other margins and then fall into an undecideable situation where most of the R_hats are near 1.0 but some are kind of iffy
This is especially a problem with Stan because Stan enables more ambituous hierarchical models and encourages introducing a lot of extra parameters through Matt tricks and matrix decompositions.
So, I think the workflow should be: First, convince yourself that all the chains have converged using functions of the joint distribution. Second, look at functions (n_eff, traceplots, MCSE, etc.) of the quantities that are of interest to see if you have enough detail for what you want to do and then ignore the rest of the posterior distribution. Once the first step is completed, I think we are agreeing.
What all of this means is that practical convergence depends on
the random variables, i.e. the functions whose expectations we
want to estimate. This is why I don’t think the added sensitivity of
the DISCO test is particularly useful — it requires more convergence
than necessary for a given application and can necessitate
running for much, much longer than necessary.
Okay, that is closer to Andrew's position I think. And it does seem to require more chains than the default of 4, so how long it takes depends a lot on how much parallel capacity the user has.
Conceptually, yes. But this is where I am surprised that we disagree. The added sensitivity of the DISCO test is really useful for identifying when it is seemingly impossible to get good draws from the current parameterization. In the situations that I know of --- like Rosenthal and Roberts (2011) where we can control whether the indepenent MH sampler is or is not geometrically ergodic or things like the funnel where there are obviously good and bad Stan reparameterizations --- in the unfavorable cases you notice that there is no thinning interval where you can fail to reject the null that all the chains have the same distribution. You have a plot in front of you that casts doubt on alpha-mixing. You can see the dependence lines bounce when there is Hamiltonians divergence. True, it doesn't tell you that the fundamental problem is lack of geometric ergodicity, but it tells you clearly that there is a fundamental problem.
But unlike R_hat or anything DISCO with alpha = 2, the test is consistent against all alternatives. So, we would have to get hand-wavy about the number of chains and thinned iterations you need in practice, but we're good in principle. Oversensitivity when the null is only trivially false is the bigger concern, I think.Consistent provided second moments exist so there’s still the concern about heavy tails.
The thinning also bothers me as a uniformly-thinned chain should behave the same as
the full chain provided its geometrically ergodic.
Here is where I think we have different criteria for a practical convergence diagnostic. From my perspective, one of the biggest problems with calculations on the marginals is that either
- People observe that the R_hat for their function of interest is near 1.0 and but little to no weight on the other margins or
- People remember that their function of interest is not independent of the other margins and then fall into an undecideable situation where most of the R_hats are near 1.0 but some are kind of iffy
This is especially a problem with Stan because Stan enables more ambituous hierarchical models and encourages introducing a lot of extra parameters through Matt tricks and matrix decompositions.
The point I was trying to make is that if R_hat for the desired function is near 1 then it doesn’tmatter what the R_hat for the other functions is!
So, I think the workflow should be: First, convince yourself that all the chains have converged using functions of the joint distribution. Second, look at functions (n_eff, traceplots, MCSE, etc.) of the quantities that are of interest to see if you have enough detail for what you want to do and then ignore the rest of the posterior distribution. Once the first step is completed, I think we are agreeing.What all of this means is that practical convergence depends on
the random variables, i.e. the functions whose expectations we
want to estimate. This is why I don’t think the added sensitivity of
the DISCO test is particularly useful — it requires more convergence
than necessary for a given application and can necessitate
running for much, much longer than necessary.
Okay, that is closer to Andrew's position I think. And it does seem to require more chains than the default of 4, so how long it takes depends a lot on how much parallel capacity the user has.Huh? Where did parallel chains come in? Splitting a few chainsshould be sufficient.
Conceptually, yes. But this is where I am surprised that we disagree. The added sensitivity of the DISCO test is really useful for identifying when it is seemingly impossible to get good draws from the current parameterization. In the situations that I know of --- like Rosenthal and Roberts (2011) where we can control whether the indepenent MH sampler is or is not geometrically ergodic or things like the funnel where there are obviously good and bad Stan reparameterizations --- in the unfavorable cases you notice that there is no thinning interval where you can fail to reject the null that all the chains have the same distribution. You have a plot in front of you that casts doubt on alpha-mixing. You can see the dependence lines bounce when there is Hamiltonians divergence. True, it doesn't tell you that the fundamental problem is lack of geometric ergodicity, but it tells you clearly that there is a fundamental problem.I think one source of confusion here is that we’re confounding too problems.Firstly there is the question of if we could compute expectations exactly thencould we diagnose convergence of a given Monte Carlo estimator? This isthe question I was addressing in my previous email.
Then there is the more insidious question of what to do when we don’t haveexact expectations for diagnosis but have to use Monte Carlo estimatorsto test Monte Carlo estimators. For example, a multimodal model maybe geometrically ergodic with respect to a sampler (like Random WalkMetropolis in lots of cases!) but Monte Carlo estimates for finite samplesize will be misleading because the chain takes _forever_ to switchbetween modes. Same thing with latent parameters in a hierarchicalmodel. For a short enough period of time we’re really computingexpectations conditional on the slow parameters which is really, really bad.I agree that this can be a serious problem (and one we’ve encounteredin Stan) but it is not a question of _convergence_ but rather of _mixing_.If we change the goal here and talk about using DISCO estimators todiagnose poor mixing then I am much less hesitant, although I thinkmore work has to be done to determine where and when it worksand doesn’t work.
The thinning is only necessary for the permutation test since the permutation test presumes exchangeability. That actually seems to work favorably because you reject the null if you don't thin enough. But the unthinned chain should still be used for subsequent inference.
I meant the other margins of the posterior distribution, rather than hypothetical functions that could have be calculated but are not of interest. So, you might have a big Cholesky factor of a correlation matrix declared in your parameters block that you don't really care about, except insofar as something in the generated quantities block depends on it.
They are definitely related questions. I would say the plot that I make in the first step is a mixing diagnostic in the sense that you have to find a thin interval to do the subsequent test. And if a Metropolis-based chain is geometrically ergodic, then it is alpha mixing and you should be able to find such a thin interval. When it doesn't look like you can or you think you have found one but the subsequent test rejects the null that all the chains have the same distribution, you know to be cautious. Maybe there is nothing that can be done in terms of reparameterizations etc. that can salvage the situation given the limitations of the sampler, but at least you would know if you do make inferences that you have to confine them to the center of fairly symmetric distributions. The permutation test of the null hypothesis that the chains all have the same joint distribution seems like an easier thing to explain as a convergence diagnostic, even if it is the case that you reject the null due to lack of exchangeability rather than a major difference in the distribution for at least one chain.
The thinning is only necessary for the permutation test since the permutation test presumes exchangeability. That actually seems to work favorably because you reject the null if you don't thin enough. But the unthinned chain should still be used for subsequent inference.You mean you thin until you have effectively independent samples?
What guarantees are there that the significance is accurate (or biasedthe right way) when you don’t thin enough? In other words, the testwon’t falsely reject until you think to independent samples?
and the "within" (half-chains) part. So, you get this B / W thing that is supposed to be near 1 but you don't really know how much above 1 is a problem.
For essentially the same reasons as in the R_hat case, if the chains are not thinned enough, then the W is less than the B. Basically, the distance between consecutive draws of the same chain is expected to be smaller than the distance between draws from different chains.
And then it starts doing permutations where is basically randomly reassigns draws to chains in arbitrary order. So, adjacent permuted draws have a very low probability of having any dependence; they probably are not from the same chain and even if they are they are likely to be many thinned iterations apart. Anyway, it calculates a distribution of B / W across a specified number of permutations and sees how far into the tail of that distribution the real B / W falls to get the p-value.
But isn’t the proper thinning exactly what we don’t know when themixing is poor?
I meant the other margins of the posterior distribution, rather than hypothetical functions that could have be calculated but are not of interest. So, you might have a big Cholesky factor of a correlation matrix declared in your parameters block that you don't really care about, except insofar as something in the generated quantities block depends on it.Yeah, but if there’s some generated quantity that you want to estimatethen you just compute the R_hat of that, which incorporates all of thenecessary correlations, for validation. That’s all I’m saying.
But if it's important to one of core developers, I think we should
err on the side of inclusiveness (not of core developers, but of
their code).
I think the thinning-to-get-a-valid-permutation is a serious hesitation — this doesn’tgive you a statistic but rather a procedure that you have to apply to try to infer alimiting value. In particular, what does this offer beyond reporting the maxsplit R_hat, or a visual diagnostic of the split R_hat empirical distribution forall parameters sampled?
The question is which of these are good specializations?
On Friday, February 13, 2015 at 4:52:54 AM UTC-5, Michael Betancourt wrote:I think the thinning-to-get-a-valid-permutation is a serious hesitation — this doesn’tgive you a statistic but rather a procedure that you have to apply to try to infer alimiting value. In particular, what does this offer beyond reporting the maxsplit R_hat, or a visual diagnostic of the split R_hat empirical distribution forall parameters sampled?
I've given lots of examples where this procedure reaches a different conclusion than R_hat. R_hat is a specialization of it where
- You set alpha = 2 instead of alpha < 2
- You call it on each margin separately instead of jointly
- You use sqrt( (N - 1) / N * W + 1 / N * B ) / sqrt(W) instead of B / W
- You stop instead of continuing on to do the permutation test (or tests if done marginally)
The question is which of these are good specializations?
- I don't see the case for only looking for mean differences. Maybe alpha should be closer to 2 (I've been using 1) but not exactly 2.
- We seem to be disagreeing on this one. I'd just refer you again to Jeff's paper.
- To me, the permutation test on the thinned samples gives you a reference point for the ratio you choose to calculate in (3), namely how does your realization compare to the permutation distribution? But if you do it unthinned (or underthinned), then by construction your W is smaller than your B and your B / W is larger than it is under almost all of the permutations that scramble the chain-time structure. In other words, if the chains are unthinned or underthinned, then you don't obtain a relevant reference point, regardless of whether the chains do or do not have the same distribution.
Do you think the thinning step in my procedure is somehow worse than the thinning step in the Raftery-Lewis diagnostic?
The question is which of these are good specializations?
- I don't see the case for only looking for mean differences. Maybe alpha should be closer to 2 (I've been using 1) but not exactly 2.
Because mathematically MCMC is all about computing means and that’swhere all of the theoretical results lie. In particular, as I previously discussedconvergence is not a universal property but depends on a choice of functionwhose mean you want to compute.
- We seem to be disagreeing on this one. I'd just refer you again to Jeff's paper.
I think Jeff’s paper completely misses the point. Again, as I talked aboutbefore trying to judge global convergence via something like the totalvariational distance is not a particularly practical game.
The problem here is the difference between exact and empirical expectations.Given exact means, marginal R_hats would diagnose problems just fine.In practice, however, empirical means require that the chain has gonethrough a few mixing times.
So what are examples where _all_ marginal R_hats look good but thechain has not yet mixed?
- To me, the permutation test on the thinned samples gives you a reference point for the ratio you choose to calculate in (3), namely how does your realization compare to the permutation distribution? But if you do it unthinned (or underthinned), then by construction your W is smaller than your B and your B / W is larger than it is under almost all of the permutations that scramble the chain-time structure. In other words, if the chains are unthinned or underthinned, then you don't obtain a relevant reference point, regardless of whether the chains do or do not have the same distribution.
That’s fine, but you’ve introduced the added problem of not really beingsure when you’ve thinned enough to achieve a relevant reference point.This is not a diagnostic value so much as a diagnostic procedure whichrequires interactivity and knowledge from the user.
Do you think the thinning step in my procedure is somehow worse than the thinning step in the Raftery-Lewis diagnostic?We’re not even considering RL so how is that relevant?
As I previously discussed, there’s no canonical way to threshold the TV norm.You right 0.01 — but is that good enough? What does that even mean interms of the accuracy and precision of Monte Carlo estimators? The valueof the TV norm can be converted into a reasonable number only once weselect out some class of functions whose expectations we want to compute(again, the intuition being that smooth functions need less convergence in theTV norm to yield accurate estimates than rough functions). In practice thosefunctions are marginal parameter means.
I misspoke earlier when I said “models wherethe sampling is bad but all the R_hats look good”. I really should have said“models where the sampling is bad but all the R_hats look good and thereare no outliers in the estimated ESS distribution”.
I am not against new diagnostics, I just don’t like these energy statistics.Firstly, they are sensitive to the structure of the intermediate distributionsof the Markov chain that may or may not be relevant (i.e it’s not clear howone calibrates them to the expectations of interest).
Secondly, they relyon permutation tests which then rely on getting the thinning just right.This requires user expertise in applying the test and interpreting theresults which I feel seriously degrades the user experience (certainlyat least until there’s some interactive Stan interface).
There’s some intuition as to how the test should behave when the chainsare under thinned, but what happens if you overshoot and over thin thechains? You start losing power but how does that manifest in the teststatistics? Is it obvious where to stop? Are there criteria to help guidethat decision or is this all an art?