Hi all. As we’ve been discussing, R-hat could fail in settings such as a Cauchy distribution because it is based on means and variances which are unstable. There are two issues here: First, with a long-tailed distribution, any statistics based on means and variances will be so noisy as to be useless. Second, we can have a situation where the MCMC really does converge but we can’t notice it because R-hat is crap.
temp <- dot_self(beta_check * L);
if(temp <= 1) increment_log_prob( (eta - 1) * log1m(temp) );
else increment_log_prob( negative_infinity() );
fit <- stan("onion.stan", chains = 0)
fit <- stan(fit = fit, iter = 20000, init_r = 0.5, data = list(eta = 1, K = 5))
print(fit, digits = 2)
Inference for Stan model: onion.
4 chains, each with iter=20000; warmup=10000; thin=1;
post-warmup draws per chain=10000, total post-warmup draws=40000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
l[1] -0.01 0.05 0.99 -1.87 -0.69 -0.02 0.65 1.97 356 1.02
l[2] 0.04 0.04 0.99 -1.90 -0.64 0.05 0.70 2.02 541 1.01
l[3] -0.02 0.04 0.95 -1.86 -0.68 -0.04 0.61 1.86 620 1.00
l[4] -0.04 0.04 0.95 -1.88 -0.70 -0.03 0.61 1.80 613 1.01
l[5] -0.04 0.05 0.99 -1.93 -0.72 -0.06 0.64 1.92 403 1.01
l[6] -0.07 0.05 1.02 -2.00 -0.76 -0.10 0.61 1.97 346 1.02
l[7] 0.02 0.05 1.01 -2.06 -0.65 0.03 0.71 1.96 441 1.00
l[8] 0.01 0.05 0.97 -1.89 -0.62 -0.01 0.64 1.97 364 1.01
l[9] 0.10 0.04 0.97 -1.74 -0.57 0.08 0.78 1.99 537 1.00
R2[1] 0.50 0.01 0.19 0.15 0.36 0.49 0.63 0.86 603 1.00
R2[2] 0.29 0.01 0.21 0.01 0.12 0.25 0.43 0.76 545 1.01
R2[3] 0.43 0.01 0.22 0.05 0.25 0.41 0.59 0.85 417 1.01
R2[4] 0.57 0.01 0.23 0.13 0.40 0.59 0.76 0.94 388 1.01
beta_check[1] 0.01 0.04 0.68 -1.26 -0.46 0.00 0.46 1.38 265 1.01
beta_check[2] 0.08 0.05 0.66 -1.19 -0.36 0.07 0.50 1.46 209 1.02
beta_check[3] -0.02 0.04 0.69 -1.42 -0.46 0.00 0.44 1.28 259 1.01
beta_check[4] -0.05 0.04 0.62 -1.31 -0.48 -0.05 0.39 1.08 231 1.02
beta_check[5] 0.07 0.04 0.67 -1.20 -0.38 0.05 0.50 1.40 306 1.00
lp__ -18.60 0.09 2.64 -24.73 -20.18 -18.22 -16.70 -14.49 816 1.00
pairs(fit, pars = c("beta_check", "lp__"))
> convergence(fit, R = 200, thin = 40)
[1] "Multivariate loss = 1.001"
disco(x = mat, factors = chain_id, distance = FALSE, R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
factors 3 14.94633 4.98211 1.731 0.0049751
Within 996 2867.41348 2.87893
Total 999 2882.35981
> convergence(fit, R = 200, thin = 4)
[1] "Multivariate loss = 1"
disco(x = mat, factors = chain_id, distance = FALSE, R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
factors 3 13.42599 4.47533 1.248 0.079602
Within 9996 35831.61495 3.58460
Total 9999 35845.04094On Friday, April 11, 2014 6:10:00 AM UTC-4, Andrew Gelman wrote:Hi all. As we’ve been discussing, R-hat could fail in settings such as a Cauchy distribution because it is based on means and variances which are unstable. There are two issues here: First, with a long-tailed distribution, any statistics based on means and variances will be so noisy as to be useless. Second, we can have a situation where the MCMC really does converge but we can’t notice it because R-hat is crap.I foisted a third situation on my QMSS class a couple of weeks ago, where the chains didn't converge but the R-hat statistics weren't that bad (depending on the seed). The attached .stan file has no data but draws from the prior predictive distribution of standardized coefficients that is implied by an LKJ prior on the correlation matrix among the predictors and outcome.
Then I ran in R for 10,000 iterations. I was surprised by the results. The first surprising thing was that cauchy_1 wasn’t so bad! It had some problems exploring the extremes but it did reasonably well (if not perfect) on its 95% intervals. The second surprise was that R-hat was stable for both models. I was expecting a noisy meaningless R-hat in both cases because of the whole variance issue. But it did not happen. I suspect that what’s going on is that, yes, means and variances for Cauchy are noisy (and, indeed, the posterior mean, se_mean, and sd of theta are meaningless for both models), but that because R-hat is a _ratio_, the worst behavior ends up canceling in the numerator and denominator.
library(rstan)
library(coda)
library(parallel)
cauchy <-
'
parameters {
real theta;
}
model {
theta ~ cauchy(0,1);
}
'
fit <- stan(model_code = cauchy, chains = 0)
fit <- sflist2stanfit(mclapply(1:16, mc.cores = detectCores(), FUN = function(i) {
stan(fit = fit, chains = 1, seed = 12345, chain_id = i, refresh = 0, iter = 10000)
}))
print(fit, digits = 2)
# Inference for Stan model: cauchy.
# 16 chains, each with iter=10000; warmup=5000; thin=1;
# post-warmup draws per chain=5000, total post-warmup draws=80000.
#
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# theta 0.46 0.73 47.97 -10.40 -0.94 -0.01 0.93 10.87 4322 1
# lp__ -1.28 0.03 1.70 -6.19 -1.75 -0.63 -0.15 0.00 4047 1
gelman.diag(rstan:::as.mcmc.list(fit), multivariate = FALSE)
# Potential scale reduction factors:
#
# Point est. Upper C.I.
# theta 1.2 1.22
# lp__ 1.0 1.01
normal_ratio <-
'
parameters {
real theta[2];
}
model {
theta ~ normal(0,1);
}
generated quantities {
real ratio;
ratio <- theta[1] / theta[2];
}
'
fit2 <- stan(model_code = normal_ratio, chains = 0)
fit2 <- sflist2stanfit(mclapply(1:16, mc.cores = detectCores(), FUN = function(i) {
stan(fit = fit2, chains = 1, seed = 12345, chain_id = i, refresh = 0, iter = 10000)
}))
print(fit2, digits = 2)
# Inference for Stan model: normal_ratio.
# 16 chains, each with iter=10000; warmup=5000; thin=1;
# post-warmup draws per chain=5000, total post-warmup draws=80000.
#
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# theta[1] 0.00 0.00 1.00 -1.95 -0.67 0.00 0.68 1.94 47707 1
# theta[2] 0.00 0.00 1.00 -1.96 -0.67 0.00 0.68 1.96 49254 1
# ratio -0.88 0.76 208.78 -12.84 -1.00 0.00 1.00 12.34 76002 1
# lp__ -1.00 0.01 0.99 -3.69 -1.38 -0.69 -0.29 -0.03 25426 1
gelman.diag(rstan:::as.mcmc.list(fit2), multivariate = FALSE)
# Potential scale reduction factors:
#
# Point est. Upper C.I.
# theta[1] 1.00 1.00
# theta[2] 1.00 1.00
# ratio 1.07 1.07
# lp__ 1.00 1.00
library(energy)
convergence <- function(object, R = 0, thin = 1, ...) {
mat <- as.matrix(object)
mat <- mat[1:nrow(mat) %% thin == 0,]
chains <- if(is.list(object)) length(object) else ncol(object)
n <- nrow(mat) / chains
chain_id <- rep(1:chains, each = n)
test <- disco(mat, factors = chain_id, distance = FALSE, R = R, ...)
W <- test$within / test$Df.e
B <- test$between / (chains - 1)
names(B) <- NULL
total <- (n - 1) / n * W + B / n
loss <- sqrt(total / W)
test$loss <- loss
print(paste("Multivariate loss =", round(loss, 3)))
return(test)
}
It does seem plausible that B / W is bounded, in which case Stan's Rhat converges to 1. But if you do the dof adjustment, then Rhat converges to the square root of (d+3)/(d+1). Coda does not estimate d to be 1 in this case, which may be reasonable since neither the numerator nor the denominator of d exist either.
Ben
The problem is in the sparsity of theory regarding multivariate estimators from Markov chains.Rhat works because assuming that the Markov chain has converged a univariate MCMC estimatorconvergences to a gaussian distribution and the distribution of subsamples/different chains convergesto an F distribution.
But here you’re trying to do look at multivariate estimators, and there is no theory that we’ve beenable to find that discusses the distribution to which these estimators converge. It’s probably goingto be gaussian, but will it be diagonal or will there be correlations? Obviously these details willbe important to establishing a single number that attempts to quantify convergence.
The paper claims a non parametric test via permutations, but there are two concerns. Firstly,permutation tests have a reputation for horrible type II errors making them weak tests in generaland, secondly, the significance test is trickier here. If the chain hasn’t converged then you’renot comparing samples from two different distributions but rather samples from an evolvingdistribution and it’s not clear how robust permutations are to this difference.
If you want to turn this into a visual diagnostic that indicates a possible lack of convergencethen I have no objection — my objection is in defining a new statistic in which one can claim“convergence” as we currently treat Rhat.
fit2 <- stan(model_code = normal_ratio, chains = 0)
fit2 <- sflist2stanfit(mclapply(1:16, mc.cores = detectCores(), FUN = function(i) {
stan(fit = fit2, chains = 1, seed = 12345, chain_id = i, refresh = 0, iter = 10000)
}))
print(fit2, digits = 2)
# Inference for Stan model: normal_ratio.
# 16 chains, each with iter=10000; warmup=5000; thin=1;
# post-warmup draws per chain=5000, total post-warmup draws=80000.
#
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# theta[1] 0.00 0.00 1.00 -1.95 -0.67 0.00 0.68 1.94 47707 1
# theta[2] 0.00 0.00 1.00 -1.96 -0.67 0.00 0.68 1.96 49254 1
# ratio -0.88 0.76 208.78 -12.84 -1.00 0.00 1.00 12.34 76002 1
# lp__ -1.00 0.01 0.99 -3.69 -1.38 -0.69 -0.29 -0.03 25426 1
The paper claims a non parametric test via permutations, but there are two concerns. Firstly,permutation tests have a reputation for horrible type II errors making them weak tests in generaland, secondly, the significance test is trickier here. If the chain hasn’t converged then you’renot comparing samples from two different distributions but rather samples from an evolvingdistribution and it’s not clear how robust permutations are to this difference.
library(energy)
convergence <- function(object, thin = NA_integer_, R = 0,
split = TRUE, ...) {
if(!isTRUE(thin > 0)) stop("'thin' must be a positive integer")
mat <- as.matrix(object)
if( (nrow(mat) %% thin) != 0 ) {
stop("'thin' value leaves a remainder")
}
mat <- mat[ (1:nrow(mat)) %% thin == 0,,drop=FALSE]
chains <- ifelse(is.list(object), length(object), ncol(object))
if(split) chains <- chains * 2
n <- nrow(mat) / chains
chain_id <- rep(1:chains, each = n)
test <- disco(mat, factors = chain_id,
distance = FALSE, R = R, ...)
W <- test$within / test$Df.e
B <- test$between / (chains - 1)
names(B) <- NULL
total <- (n - 1) / n * W + B / n
loss <- sqrt(total / W)
test$loss <- loss
print(paste("Multivariate loss =", round(loss, 3)))
return(test)
}
[1] "Multivariate loss = 1"
disco(x = mat, factors = chain_id, distance = FALSE, R = R)
Distance Components: index 1.00
Source Df Sum Dist Mean Dist F-ratio p-value
factors 15 8529.01958 568.60131 1.005 0.49
Within 3984 2253667.17960 565.67951
Total 3999 2262196.19918