things I learned from doing a Stan demo in an intro class

234 views
Skip to first unread message

Andrew Gelman

unread,
Feb 9, 2015, 12:14:34 PM2/9/15
to stan-...@googlegroups.com, stan...@googlegroups.com
Hi all. I did an unrehearsed live demo of Stan in my survey sampling class and I learned some interesting things.

I started with linear regression. I had code that I’d copied from an earlier Stan manual in which the model was not vectorized. With 1000 data points it ran really slow. So I think it’s a really good thing that in the current Stan manual, we start with the vectorized version and then only give the non-vectorized later. Otherwise people will use the slow code and think that Stan is slow.

Then I explained the output. I have mixed feelings about showing the Monte Carlo standard error. At one level, it’s a good thing to know. But for a practitioner it’s a bit of a distraction. In practice, all the user really needs to know is that this MCSE is less than 1/10 (say) of the posterior se. So I’m wondering if we actually need a new summary, not quite R.hat but this other thing. Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it. Then a similar issue with all the quantiles and effective sample size: it’s more trouble to explain that it’s worth. I think the basic output should be estimate, sd, and some summary of OK-ness (R.hat or something else). And for this I’m not just thinking of basic users. Even advanced users such as myself, when we’re doing applied statistics, we usually just want an estimate, and uncertainty, and an indication that we can trust the computation. The only thing I’m still undecided on is whether to use mean & sd, or median & quantiles. Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error. But I’m open to persuasion on this.

Next I decided, just for fun, to fit a nonlinear model: y = a*exp(-b*x) + c + error, with N(0,sigma) errors. I simulated some data (x taking on the values between 0 and 23, setting a=.5, b=1/10, c=1, sigma=.2, and I fit it constraining all the parameters to be positive. (In retrospect I’m thinking I should’ve just constrained b and sigma to be positive; constraining a and c made sense in the context of my hypothetical example, but in general it’s poor practice to put in hard constraints unless absolutely necessary.)

I simulated data and fit the model and all was fine. But then I repeated and the fit blew up, I was getting some estimates of b that were either really close to 0 (1e-10, etc) or really high (1e10, etc), I can’t remember which. The point is that there can be instability at the extremes, and I realized the model need a bit of hand-holding. As a kluge I constrained b to be between 1/50 and 1 (which made sense in the context of the example). A hard constraint was ugly but was easy enough to understand. I didn’t want to confuse the class by introducing a normal prior here.

So this second example was a bit of a mess. Which suggests to me that in the manual/book we need to throw in an example to explicitly address instability and constraints. Since, ultimately, this is an inherent difficulty with modeling, it’s not anything to do with Stan in particular.

A

Bob Carpenter

unread,
Feb 9, 2015, 12:39:48 PM2/9/15
to stan...@googlegroups.com
The minimal thing to do would be to mark parameters with too high
R-hat with an asterisk or something (reversing the usual convention
of adding asterisks for signficance levels).

I like median and intervals, but posterior mean is the classic
Bayesian estimate.

The R-hat, n_eff, and and mcmc-se calcs are all for the mean, right?

I've been moving the manual over to vectorized versions first.
And I think I got rid of the last of the interval-constrained scale
parameters in this round, too, only to have you reintroduce them!
We should talk about what we recommend people do here and whether
we want to talk about adding interval constraints back in earlier
in the manual, or whether I should just address it in a new section
in the problematic posteriors chapter.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Ben Goodrich

unread,
Feb 9, 2015, 1:23:23 PM2/9/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Monday, February 9, 2015 at 12:14:34 PM UTC-5, Andrew Gelman wrote:
I started with linear regression.  I had code that I’d copied from an earlier Stan manual in which the model was not vectorized.  With 1000 data points it ran really slow.  So I think it’s a really good thing that in the current Stan manual, we start with the vectorized version and then only give the non-vectorized later.  Otherwise people will use the slow code and think that Stan is slow.

Agreed. Actually, I don't even think the non-vectorized version should be shown.
 
Then I explained the output.  I have mixed feelings about showing the Monte Carlo standard error.  At one level, it’s a good thing to know.  But for a practitioner it’s a bit of a distraction.  In practice, all the user really needs to know is that this MCSE is less than 1/10 (say) of the posterior se.  

As Michael would say, conditional on the MCMC CLT holding, then the MCSE from NUTS will usually be small enough to ignore or not see. And if the MCMC CLT does not hold, then the MCSE calculation is not meaningful. But I don't think it is correct to label an issue as important as this a "distraction".
 
So I’m wondering if we actually need a new summary, not quite R.hat but this other thing.

I think you need the plot that shows estimated within-chain dependence is negligible after a handful of lags and stays negligible throughout the post-warmup period. And a failure to reject the null hypothesis that all the chains have the same distribution when thinned to the critical number of lags.
 
Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it.  

The problem with relying on R hat has never been that it produces high values when it should not; the problem is that it can produce low values when the chains have not converged or when the CLT does not hold. That, and it is marginal and only sensitive to means and variances.
 
Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error.  But I’m open to persuasion on this.

On one hand, it is a question of whether you prefer squared error loss or absolute value loss. On the other hand, I worry that default reporting of the median and some quantile-based measure of uncertainty will be used to further legitimate analyzing chains whose tails do not correspond to the posterior distribution.

Ben

Andrew Gelman

unread,
Feb 9, 2015, 4:27:12 PM2/9/15
to stan...@googlegroups.com

> On Feb 9, 2015, at 12:38 PM, Bob Carpenter <ca...@alias-i.com> wrote:
>
> The minimal thing to do would be to mark parameters with too high
> R-hat with an asterisk or something (reversing the usual convention
> of adding asterisks for signficance levels).
>
> I like median and intervals, but posterior mean is the classic
> Bayesian estimate.

Sure, but there are some (rare) examples of long-tailed posteriors, in which case the mean can have problems. The median always exists. And this goes double for the sd.
>
> The R-hat, n_eff, and and mcmc-se calcs are all for the mean, right?

Yes, but we’ve talked about quantile-based versions of these statistics. Simplest (but perhaps not cheapest to compute) is, for each quantity of interest, to pass the simulations thru a ranking and then do R-hat etc. on that. I played around with this awhile ago and it works ok. As usual with rank-based methods, it trades off efficiency for robustness.

> I've been moving the manual over to vectorized versions first.
> And I think I got rid of the last of the interval-constrained scale
> parameters in this round, too, only to have you reintroduce them!
> We should talk about what we recommend people do here and whether
> we want to talk about adding interval constraints back in earlier
> in the manual, or whether I should just address it in a new section
> in the problematic posteriors chapter.

I _don’t_ think interval-constraint is the right way to handle this problem. It’s just something I came up with on the fly when teaching my class. Better would be to use informative priors and soft constraint.

Andrew Gelman

unread,
Feb 9, 2015, 4:33:02 PM2/9/15
to Ben Goodrich, stan...@googlegroups.com
On Feb 9, 2015, at 1:23 PM, Ben Goodrich <goodri...@gmail.com> wrote:

On Monday, February 9, 2015 at 12:14:34 PM UTC-5, Andrew Gelman wrote:
Then I explained the output.  I have mixed feelings about showing the Monte Carlo standard error.  At one level, it’s a good thing to know.  But for a practitioner it’s a bit of a distraction.  In practice, all the user really needs to know is that this MCSE is less than 1/10 (say) of the posterior se.  

As Michael would say, conditional on the MCMC CLT holding, then the MCSE from NUTS will usually be small enough to ignore or not see. And if the MCMC CLT does not hold, then the MCSE calculation is not meaningful. But I don’t think it is correct to label an issue as important as this a "distraction".

What happened was, I ran it for 10 iterations.

 
So I’m wondering if we actually need a new summary, not quite R.hat but this other thing.

I think you need the plot that shows estimated within-chain dependence is negligible after a handful of lags and stays negligible throughout the post-warmup period. And a failure to reject the null hypothesis that all the chains have the same distribution when thinned to the critical number of lags.

A plot is good for hard problems, to diagnose what’s wrong.  But I think we also need a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem.
 
Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it.  

The problem with relying on R hat has never been that it produces high values when it should not; the problem is that it can produce low values when the chains have not converged or when the CLT does not hold. That, and it is marginal and only sensitive to means and variances.

Then let’s beef it up with some other checks.  Rather than starting with the statistic and seeing what it does, let’s start with the goal—to give users a number they can use—and figure out how to get there.  Split R-hat was a step forward, but lets’ do more.
 
Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error.  But I’m open to persuasion on this.

On one hand, it is a question of whether you prefer squared error loss or absolute value loss. 

I’m not really thinking in terms of a loss function, it’s more of a question about having a general summary.  In the applied world, people go with estimates and standard errors.

Bob Carpenter

unread,
Feb 9, 2015, 4:36:49 PM2/9/15
to stan...@googlegroups.com

> On Feb 9, 2015, at 4:27 PM, Andrew Gelman <Gel...@stat.columbia.edu> wrote:
>
>
>> On Feb 9, 2015, at 12:38 PM, Bob Carpenter <ca...@alias-i.com> wrote:

...

>> I've been moving the manual over to vectorized versions first.
>> And I think I got rid of the last of the interval-constrained scale
>> parameters in this round, too, only to have you reintroduce them!
>> We should talk about what we recommend people do here and whether
>> we want to talk about adding interval constraints back in earlier
>> in the manual, or whether I should just address it in a new section
>> in the problematic posteriors chapter.
>
> I _don’t_ think interval-constraint is the right way to handle this problem. It’s just something I came up with on the fly when teaching my class. Better would be to use informative priors and soft constraint.

The question's really what the manual should say. Should I include
interval constraints in a discussion of this model and problems it
causes? Did you have a better example?

- Bob

Michael Betancourt

unread,
Feb 9, 2015, 4:52:09 PM2/9/15
to stan...@googlegroups.com
I just want to offer a completely different persecutive.  It appears
that I use print very differently from most people — first looking at
R-hat for convergence then Ess relative to the number of samples
for any HMC pathologies then mean + MCSE for general precision
and then mean vs quantiles to get an idea of the shape of the 
posterior.  If this is too much for a simple user then introduce a
new simple_print or the like, but don’t kill the basic information needed
to diagnose a chain.

And it’s important to remember that none of this stuff will ever be
completely black box.  There are always going to be pathologies (MC-CLT not
holding, multi-modalities, etc) that require the user inspect more than
just a single number.  Personally, I think that selling the prospect of a
single-number diagnosis is really dangerous, _especially_ when you
want to propagate it lots of people without the necessary expertise.

Andrew Gelman

unread,
Feb 9, 2015, 4:58:03 PM2/9/15
to stan...@googlegroups.com
Hi, there are a few things going on here.  First, we are currently preparing stan_lm, stan_glm, stan_lmer, etc., and these functions will be black boxes that can be treated as such, indeed it could be worth working on some auto-convergence rule so the user doesn’t even need to specify #iter and #chains.  Second, users will be fitting models for applied purposes.  For these users, we want some convergence information.  I don’t think it helps to give people several columns of converence information, _by default_.  This info should all be accessible but if we give people too many read-outs, it’s not like they’ll suddenly become Betancourts.  Instead they’ll get overwhelmed by the info.  So I agree that if we have one summary number we should not pretend it’s magic.  We should give people the best summary they can get, and we should also make it easy for people to do fake-data checking and posterior predictive checking.

Bob Carpenter

unread,
Feb 9, 2015, 4:59:49 PM2/9/15
to stan...@googlegroups.com

> On Feb 9, 2015, at 4:52 PM, Michael Betancourt <betan...@gmail.com> wrote:
>
> I just want to offer a completely different persecutive. It appears
> that I use print very differently from most people — first looking at
> R-hat for convergence then Ess relative to the number of samples
> for any HMC pathologies then mean + MCSE for general precision
> and then mean vs quantiles to get an idea of the shape of the
> posterior. If this is too much for a simple user then introduce a
> new simple_print or the like, but don’t kill the basic information needed
> to diagnose a chain.

That's exactly what I do! Whatever we did, I'd just set it back to this
style print if it remains available.

> And it’s important to remember that none of this stuff will ever be
> completely black box. There are always going to be pathologies (MC-CLT not
> holding, multi-modalities, etc) that require the user inspect more than
> just a single number. Personally, I think that selling the prospect of a
> single-number diagnosis is really dangerous, _especially_ when you
> want to propagate it lots of people without the necessary expertise.

It's very good to think about who our audience is. I think we
should continue to focus on the hard models.

We can prepackage some simple models that we know aren't pathological
and perhaps simplify their diagnostics. And maybe give those simpler
interfaces.

At the very least, whatever we do automatically should encode what
Michael does looking at a model so that a warning light can go on
and a mechanic called in for a diagnosis.

Ben Goodrich

unread,
Feb 9, 2015, 5:24:12 PM2/9/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Monday, February 9, 2015 at 4:33:02 PM UTC-5, Andrew Gelman wrote:
What happened was, I ran it for 10 iterations.

Now you tell us :). In that case, the problem is less than the MCSE is a distraction, it is that any summary of 10 realizations from different distributions, none of which are the posterior distribution, is not meaningful. Just look at the draws in that case. We should have a minimum number warmup and post-warmup iterations before print() says anything numerical.
 
So I’m wondering if we actually need a new summary, not quite R.hat but this other thing.

I think you need the plot that shows estimated within-chain dependence is negligible after a handful of lags and stays negligible throughout the post-warmup period. And a failure to reject the null hypothesis that all the chains have the same distribution when thinned to the critical number of lags.

A plot is good for hard problems, to diagnose what’s wrong.  But I think we also need a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem.

Perhaps, a p-value for the null hypothesis that all of the chains have the same distribution?
Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it.  

The problem with relying on R hat has never been that it produces high values when it should not; the problem is that it can produce low values when the chains have not converged or when the CLT does not hold. That, and it is marginal and only sensitive to means and variances.

Then let’s beef it up with some other checks.  Rather than starting with the statistic and seeing what it does, let’s start with the goal—to give users a number they can use—and figure out how to get there.  Split R-hat was a step forward, but lets’ do more.

Where's the beef? I know you favor doing R-hat on the z-scores, but I think that is a limited step. I would agree with your goal of giving "a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem", but I don't understand why a p-value for the null hypothesis that the chains have the same distribution is not that number. If you reject that null hypothesis, simulate longer or ask an expert for help. If you fail to reject that null hypothesis, you should ostensibly look for other evidence, but I don't know what else to systematically look for when the test has power against any alternative where at least one chain comes from a different distribution.
 
Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error.  But I’m open to persuasion on this.

On one hand, it is a question of whether you prefer squared error loss or absolute value loss. 

I’m not really thinking in terms of a loss function, it’s more of a question about having a general summary.  In the applied world, people go with estimates and standard errors.

They can estimate the mean and standard deviation or they can estimate the median and use the scaled MAD (or similar) to estimate the standard deviation. But I don't know how you would choose between those options without saying what is to be avoided. If the thing to be avoided is providing estimates of moments that don't exist, it seems reasonable to go with quantile-based things, but we're going to have a real hard time convincing ourselves that the chains have converged in non-trivial problems when means and / or variances do not exist.

Ben

Ben Goodrich

unread,
Feb 9, 2015, 5:45:36 PM2/9/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Monday, February 9, 2015 at 4:58:03 PM UTC-5, Andrew Gelman wrote:
Hi, there are a few things going on here.  First, we are currently preparing stan_lm, stan_glm, stan_lmer, etc., and these functions will be black boxes that can be treated as such,

I don't know if I would go quite that far, but I would agree that for stan_foo(), the default summarized output should look roughly like that for foo(), including substituting variable names so it reads like

Intercept ...
Age ...
...
sigma
...


instead of

beta[1] ...
beta
[2] ...
...
sigma
...


They might not even be doing MCMC, so we definitely need something like display() in arm.

Ben

Andrew Gelman

unread,
Feb 9, 2015, 8:19:43 PM2/9/15
to Ben Goodrich, stan...@googlegroups.com
On Feb 9, 2015, at 5:24 PM, Ben Goodrich <goodri...@gmail.com> wrote:

On Monday, February 9, 2015 at 4:33:02 PM UTC-5, Andrew Gelman wrote:
A plot is good for hard problems, to diagnose what’s wrong.  But I think we also need a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem.

Perhaps, a p-value for the null hypothesis that all of the chains have the same distribution?

No, not a p-value.  The distributions are different, the question is, how different are they?  R-hat answers that question.

Or maybe it’s fine to just use R.hat because, in practice, if R.hat is less than 1.1, the MCSE is probably so low that we don’t need to worry about it.  

The problem with relying on R hat has never been that it produces high values when it should not; the problem is that it can produce low values when the chains have not converged or when the CLT does not hold. That, and it is marginal and only sensitive to means and variances.

Then let’s beef it up with some other checks.  Rather than starting with the statistic and seeing what it does, let’s start with the goal—to give users a number they can use—and figure out how to get there.  Split R-hat was a step forward, but lets’ do more.

Where's the beef? I know you favor doing R-hat on the z-scores, but I think that is a limited step.

Let’s start with the examples you have where poor convergence is not diagnosed by split R-hat, and we can go from there.

I would agree with your goal of giving “a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem", but I don't understand why a p-value for the null hypothesis that the chains have the same distribution is not that number.

No, not a p-value.  That’s wrong for a zillion reasons.

If you reject that null hypothesis, simulate longer or ask an expert for help. If you fail to reject that null hypothesis, you should ostensibly look for other evidence, but I don't know what else to systematically look for when the test has power against any alternative where at least one chain comes from a different distribution.
 
Right now I’m leaning toward reporting the median as the estimate, and reporting the half-width of the central 68% interval as the standard error.  But I’m open to persuasion on this.

On one hand, it is a question of whether you prefer squared error loss or absolute value loss. 

I’m not really thinking in terms of a loss function, it’s more of a question about having a general summary.  In the applied world, people go with estimates and standard errors.

They can estimate the mean and standard deviation or they can estimate the median and use the scaled MAD (or similar) to estimate the standard deviation. But I don't know how you would choose between those options without saying what is to be avoided. If the thing to be avoided is providing estimates of moments that don’t exist, it seems reasonable to go with quantile-based things, but we’re going to have a real hard time convincing ourselves that the chains have converged in non-trivial problems when means and / or variances do not exist.

Again, back to the users:  I think it’s important to have an estimate and a s.e. for each parameter.  I think the median is a good choice.

Ben Goodrich

unread,
Feb 9, 2015, 9:44:57 PM2/9/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Monday, February 9, 2015 at 8:19:43 PM UTC-5, Andrew Gelman wrote:
On Feb 9, 2015, at 5:24 PM, Ben Goodrich <goodri...@gmail.com> wrote:

On Monday, February 9, 2015 at 4:33:02 PM UTC-5, Andrew Gelman wrote:
A plot is good for hard problems, to diagnose what’s wrong.  But I think we also need a simple number so that basic users can know to simulate longer or to ask an expert for help if there’s a problem.

Perhaps, a p-value for the null hypothesis that all of the chains have the same distribution?

No, not a p-value.  The distributions are different, the question is, how different are they?  R-hat answers that question.

Now we're getting somewhere. I disagree that the R-hat answers the question of how different the distributions are; it sort of answer the question of how different the means of the distributions are. It sort of answers the question of how much the scale could potentially be reduced, conditional on the chains having converged and the MCMC CLT holding.

I could easily produce (in fact, I had it in for a while and took it out) a R-hat like statistic instead of a p-value using the formula

sqrt(N / (N - 1) * W + 1 / N * B) / sqrt(W)

where W and B are distance-based instead of variance based. And I would get a number between 1 and 1.01 essentially every time. And I would have no clue as to what constitutes a "small" value vs. a "large" value except to say how often it is larger the corresponding value when the null hypothesis is true.

If users see a value of R-hat like statistic of 1.005, they are going to conclude it is safe to analyze when often times it is not. If users see a p-value of 0.005, they are going to conclude it is not safe to analyze, when sometimes it is. But nothing much is lost by running the chains more.

Ben

Andrew Gelman

unread,
Feb 9, 2015, 10:17:42 PM2/9/15
to Ben Goodrich, stan...@googlegroups.com
Ben
We can discuss.  But, no, the p-value is definitely not the right answer to the question.
A

Ben Goodrich

unread,
Feb 9, 2015, 10:28:07 PM2/9/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Monday, February 9, 2015 at 10:17:42 PM UTC-5, Andrew Gelman wrote:
We can discuss.  But, no, the p-value is definitely not the right answer to the question.

I've been discussing it, almost entirely with myself, for 50 weeks now

https://groups.google.com/d/msg/stan-dev/_SJBiCaUbt4/I_G38zlHGmoJ

with no progress toward resolving the fundamental dilemma:
  • Mike thinks my proposal will fail to detect non-convergence too often
  • You think my proposal will reject convergence too often
  • No one else says anything
So, the code doesn't get merged.

Ben

Andrew Gelman

unread,
Feb 9, 2015, 10:32:16 PM2/9/15
to stan...@googlegroups.com
I’m up for doing something better than R-hat.  But not a p-value.

Ben Goodrich

unread,
Feb 9, 2015, 10:39:36 PM2/9/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Monday, February 9, 2015 at 10:32:16 PM UTC-5, Andrew Gelman wrote:
I’m up for doing something better than R-hat.  But not a p-value.

Then you need to tell me at what value a user should decide to run the chains longer or ask for help based on distance-based versions of

B / W

or


sqrt(N / (N - 1) * W + 1 / N * B) / sqrt(W)

Ben


Andrew Gelman

unread,
Feb 9, 2015, 10:42:51 PM2/9/15
to Ben Goodrich, stan...@googlegroups.com
What about a rule such as “keep going until MCSE/SD < 0.1 for all quantities of interest”?
I can believe that such a rule is flawed but if there are examples where it doesn’t work, we can look at these and maybe do better.
A

Jiqiang Guo

unread,
Feb 9, 2015, 10:43:50 PM2/9/15
to stan...@googlegroups.com

I think the code can be merged if it's in R.  My opinion is that all these are only necessary and it's up to the user to decide. 

Jiqiang

Andrew Gelman

unread,
Feb 9, 2015, 10:48:42 PM2/9/15
to stan...@googlegroups.com
It may be up to the user to decide, but I want us to decide what makes sense to us, because we need to decide what goes into the default display.

Ben Goodrich

unread,
Feb 9, 2015, 11:12:10 PM2/9/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Monday, February 9, 2015 at 10:42:51 PM UTC-5, Andrew Gelman wrote:
What about a rule such as “keep going until MCSE/SD < 0.1 for all quantities of interest”?
I can believe that such a rule is flawed but if there are examples where it doesn’t work, we can look at these and maybe do better.

Good examples where we know (to what extent) the chain has converged are hard to come by. However, in section 5 of Roberts and Rosenthal (2011)

http://www.probability.ca/jeff/ftpdir/kerrie.pdf

shows that for independent MH sampling of a standard normal with a standard normal proposal distribution, if the standard deviation of the proposal is 0.5, then the chain will have converged in total variation norm to at most 0.01 by 8000 iterations.

Let's do the R

stopifnot(require(parallel))
normal_is
<- function(chain, sigma, sims, burnin = 0) {
  holder
<- rep(NA_real_, sims - burnin)
  names
(holder) <- "X"
  X
<- rnorm(1)
 
for(i in 1:sims) {
    Y
<- rnorm(1, mean = 0, sd = sigma)
    r
<-  dnorm(Y) * dnorm(X, mean = 0, sd = sigma) /
     
(dnorm(X) * dnorm(Y, mean = 0, sd = sigma))
   
if(runif(1) <= r) X <- Y
   
if(i > burnin) holder[i-burnin] <- X
 
}
 
return(holder)
}
RNGkind("L")
set.seed(123)
posterior
<- simplify2array(mclapply(1:16, mc.cores = 8L, normal_is, sigma = 0.5, sims = 8000))
dim
(posterior) <- c(dim(posterior), 1L)
rstan
::monitor(posterior, digits = 2)
plot
(density(posterior), main = "", las = 1)
curve
(dnorm(x), from = -3, to = 3, col = 2, lty = 2, add = TRUE)

The tails are still not quite right:
Inference for the input samples (16 chains: each with iter=8000; warmup=4000):

    mean se_mean   sd  
2.5%   25%   50%  75% 97.5% n_eff Rhat
V1
-0.04    0.05 0.92 -1.88 -0.69 -0.03 0.62  1.67   405 1.03



but the Rhat is small and the MCSE / SD is less than 0.1. And you would reach the same conclusion with much fewer samples.

As R&R claim, the problem is that for sigma < 1, the chain is not geometrically ergodic so, as Mike would say, the MCSE calculation is not valid. But there seems to be no thinning value where I can fail to reject the null that the chains have the same distribution, which makes sense because they don't have the same distribution, not even if we just focus on the means

> colMeans(posterior)
             
[,1]
 
[1,]  0.095047920
 
[2,]  0.033490434
 
[3,]  0.028307452
 
[4,] -0.121544967
 
[5,]  0.032662167
 
[6,]  0.007854844
 
[7,]  0.158175078
 
[8,] -0.029664244
 
[9,]  0.001635766
[10,]  0.066116518
[11,]  0.094888171
[12,] -0.032819708
[13,] -0.062017140
[14,] -0.094099226
[15,]  0.113922192
[16,] -0.332537536

Ben

Andrew Gelman

unread,
Feb 9, 2015, 11:15:29 PM2/9/15
to Ben Goodrich, stan...@googlegroups.com
Ben:
I’m confused.  Is this an example where MCSE/SD < 0.1for all quantities of interest but convergence is poor?
If we want to play the “total variation norm” thing, that’s another story, then indeed we’d want to look at things a different way.  Perhaps, for example, there’s a way to estimate total variation norm from the different sequences, trying each one out against the mixture of all the others?  Here, though, I’m talking about the more basic problem of getting good inferences for the quantities of interest.
A

Ben Goodrich

unread,
Feb 10, 2015, 12:45:14 AM2/10/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Monday, February 9, 2015 at 11:15:29 PM UTC-5, Andrew Gelman wrote:
I’m confused.  Is this an example where MCSE/SD < 0.1for all quantities of interest but convergence is poor?

Yes. And an example of R-hat being low when the convergence is poor but the posterior distribution is a univariate standard normal. And one of the few examples where there are lower bounds and upper bounds on the total variation norm. And one of the few examples where we can make the chain geometrically ergodic or not depending on a tuning parameter. And one of the many examples where the poor convergence and bad mixing is evident under my approach.
 
If we want to play the “total variation norm” thing, that’s another story, then indeed we’d want to look at things a different way.  Perhaps, for example, there’s a way to estimate total variation norm from the different sequences, trying each one out against the mixture of all the others?  

I don't know of any practical way to estimate the total variation norm to the posterior distribution empirically. But that wasn't my aim and you can use the energy distance metric between distributions instead of the total variation norm metric. In this case and many others, you can ascertain that things are not good with energy statistics because there is no thining amount for which you can get a not-small p-value for the null that the chains have the same distribution.
 
Here, though, I’m talking about the more basic problem of getting good inferences for the quantities of interest.

That isn't more basic. If the chains haven't converged to the posterior distribution, then you can't rely too heavily on your inferences for the quantities of interest. Sometimes it is okay if you really only care about point estimates and sometimes it ends badly. But since software writers can't know what quantities of interest users are calculating with the draws from the software, I think we have to try really hard to get the whole distribution right and be really conservative about convergence because nothing is lost if users just do more iterations or more chains.

It seems that you have the opposite perspective: that software should be liberal with convergence diagnostics because if users think that there chains don't converge with Stan in 30 seconds, then they will go use some other software, and even if the chains don't have the same distribution, if their centers are about in the same place, then the inferences they make won't be that bad.

I could see how that perspective would have merit in the BUGS era, but I don't think it is right for Stan. You can usually get Stan to fail to reject a null hypothesis that the chains have the same distribution if you really work hard to write a good .stan program. And if it takes rejecting the null hypothesis that the chains have the same distribution to get people to work hard at writing good .stan programs, so be it.

Ben

Andrew Gelman

unread,
Feb 10, 2015, 1:15:01 AM2/10/15
to Ben Goodrich, stan...@googlegroups.com
Ben
I’m happy to run for more than 30 seconds if that’s what it takes.  I just don’t think it makes sense to require Stan users to become experts on convergence diagnostics.  Think of it this way:
- Sometimes we’re fitting easy models such as linear regression, glms, etc.  Stan should be able to do these as automatically as can be done using lm, glm, etc.  Actually the Stan fit should be more stable because of issues such as separation which we can handle easily with priors.
- Sometimes we fit models like lmer/glmer which appear in R to be easy to fit, but in practice have stability problems that the user doesn’t see.  Stan should be able to fit these, again better than lmer etc.
- Sometimes we have tricky new models such as that a*exp(-b*t) + c model from my earlier email.  In that case the user has to be prepared for red flags.  Red flags are ok.  But I’d rather give red flags (with messages such as “Try running longer or simplify your model or talk to an expert”) than give tons of numbers with the expectation that the user will have Betancourt-level understanding.
- Beyond all this, this conversation is reminding me of the huge importance of fake-data checking and posterior predictive checking, which I think will catch a lot of the serious problems in a useful way.
A

Ben Goodrich

unread,
Feb 10, 2015, 1:44:36 AM2/10/15
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Tuesday, February 10, 2015 at 1:15:01 AM UTC-5, Andrew Gelman wrote:
I’m happy to run for more than 30 seconds if that’s what it takes.  I just don’t think it makes sense to require Stan users to become experts on convergence diagnostics.  

They need to know that that they shouldn't be making strong inferences if there is evidence of non-convergence. They need to understand that MCMC can be highly dependent within-chain. If they can pick a thin interval from a plot that makes the dependence negligible and read an ANOVA-like table of output, they can reject or fail to reject a null hypothesis that all the chains have the same distribution. I don't think that entails the user being an expert on convergence diagnostics.

If they are doing MCMC for models that are more complicated than what can easily be done by ML, then I'm sure they will run into cases where they reject the null of convergence. At that point, they probably do need to write a better .stan program (avoiding gamma(0,0) priors, doing Matt tricks, decomposing covariance matrices into standard deviations and correlation matrices, etc). Maybe that entails the user being something of an expert on Stan programming but that is a good thing.
 
Think of it this way:
- Sometimes we’re fitting easy models such as linear regression, glms, etc.  Stan should be able to do these as automatically as can be done using lm, glm, etc.  Actually the Stan fit should be more stable because of issues such as separation which we can handle easily with priors.

I doubt an easy model would have any difficulty failing to reject a null hypothesis that all the chains have the same distribution. Most of the BUGS examples fall in this category.
 
- Sometimes we fit models like lmer/glmer which appear in R to be easy to fit, but in practice have stability problems that the user doesn’t see.  Stan should be able to fit these, again better than lmer etc.
 
- Sometimes we have tricky new models such as that a*exp(-b*t) + c model from my earlier email.  In that case the user has to be prepared for red flags.  Red flags are ok.  But I’d rather give red flags (with messages such as “Try running longer or simplify your model or talk to an expert”) than give tons of numbers with the expectation that the user will have Betancourt-level understanding.

Right now, we have one R-hat per margin, which in non-simple models makes it virtually impossible to say anything useful about whether the chain has converged to the joint posterior distribution, much less the ensemble of chains. We need to get convergence diagnostics down to a few numbers by default and the only way I can imagine doing that is with scalar functions of the ensemble.
 
- Beyond all this, this conversation is reminding me of the huge importance of fake-data checking and posterior predictive checking, which I think will catch a lot of the serious problems in a useful way.

We need to get users to do that anyway, and since the posterior predictive distribution is a function of everything in the model that is a good place to look for differences across chains or over iterations. It is harder to automate though and not going to discriminate much with weakly identified posteriors or chains that fail to get into the tails enough.

Ben

Michael Betancourt

unread,
Feb 10, 2015, 7:34:09 AM2/10/15
to stan...@googlegroups.com
Firstly, Ben I implore you to start writing these proposed algorithms down in TeX
somewhere so that we can understand your implementations without having to appeal
to software archeology. The same problem has come up with LKJ priors — we really
need technical reports for these things to improve communication both internally and
externally.

That said, let me summarize my perspective on this matter.

--------------------------------------------------------------------------------
— Generalizing R_hat by moving from ANOVA to DISCO ---


— Let’s Review R_hat --

TL;DR Assuming the MC-CLT, R_hat is a natural diagnostic for MCMC. It’s just
not one that we can ever calibrate.

Recall that R_hat is a transformed p-value arising from an ANOVA
test on realizations of random variables partitioned into different
groups. In particular, the p-value is transformed such that p = 0 (rejecting the null)
corresponds to R_hat = infinity and p = 1 (accept the null) corresponds to R_hat = 1.

The null hypothesis in ANOVA is that the random variables in each group are
distributed according to the the same Gaussian distribution, while the alternative
hypothesis is that the random variables distributed as Gaussian distributions in each
group but with differing means and common variances.

One of the reasons why R_hat works so well for MCMC is that the null hypothesis
is actually valid conditioned on the MC-CLT. If the MC-CLT holds then once
we control for the MCSE the distribution for the random variable in any subsequence
of a Markov chain is the same Gaussian distribution.

The alternative hypothesis is another story. For the assumed ANOVA alternative
hypothesis to hold we’d have to have each chain converged to different Gaussians
with the same variances (this could conceivably happen in multimodal distributions
but it’s not generic). Without a pre-determined alternative hypothesis we can’t
ever really compute the power of the test, which means we don’t have an a priori
calibration of R_hat. This is why we can’t set some threshold like R_hat > 1.01
is always bad and instead have to resort to empirical intuition. Formally, we can’t
even assume that R_hat is consistent (it may not reject the null if the null is
false, even with infinite data).

— DISCO —

TL;DR The DISCO test would generalize R_hat to be more sensitive to the
entire distribution of each random variable, which may or may not be a good
thing.

DISCO is a generalization of ANOVA that uses different metrics besides the
usual Euclidean squared-difference metric, in particular D(x, y) = || x - y ||^{alpha}.
The important feature of this generalization is given in Theorem 1 of
http://projecteuclid.org/euclid.aoas/1280842151. While alpha = 2 (which is the
same thing as ANOVA) is sensitive to only the means of the random variable
being tested, 0 < alpha < 2 is sensitive to the entire distribution of the random
variable being tested. This certainly sounds appealing, but I argue below that
the additional sensitivity is largely irrelevant.

One of the difficulties with this DISCO test is in constructing the proper significance
for the null hypothesis that the distributions being compared are exactly equal.
The authors of the above paper take a non-parametric approach with a
permutation test, but my understanding is that permutation tests are
computationally intense and they can be sensitive to the number of permutations
applied. Like R_hat, there is no explicit alternative hypothesis so we can’t
calibrate any resulting statistic.

So why wouldn’t we want to be sensitive to the entire distribution of a
random variable?

————————————————————————————————————————
—“Convergence” and what we want in a practical convergence diagnostic —

TL;DR Practical convergence depends on the chosen random variable.
In practice the only robust way to use MCMC is to check R_hat for
every Monte Carlo estimator constructed.

A critical concept we need to keep in mind here is that in MCMC there is no
such thing as “pre-convergence” and “post-convergence”. There is only
convergence. The total variation distance between the Markov chain and
the target distribution vanishes only in the limit of an infinite number of
transitions — in practice they will _never_ be equal.

When the Markov chain is geometrically ergodic, however, we know
something about _how_ the chain converges. Intuitively the Markov
chain first captures the gross features of the target distribution and
then eventually captures finer and finer details. If we only need
gross features of a distribution then we can stop early, but if we
look for exponentially finer details then we’ll have to wait exponentially
longer for the Markov chain to capture them.

Of course, in practice we don’t care about every little detail of the target
distribution. Given some level of accuracy we’d be perfectly fine
with the approximation at some point — which is where the intuition
of “pre-convergence” and “post-convergence” and warmup and sampling
all come from.

But how do we define our desired level of accuracy? ___We do it by
selecting a function whose expectation we want to estimate___
In order to estimate the expectation of some wildly-varying
function we need a pretty accurate Markov chain, but in order to
estimate the expectation of some relatively smooth function we can
get away with only a mildly accurate chain. For example,
a constant function can be estimated accurately no matter for how
long the chain as been running!

What all of this means is that practical convergence depends on
the random variables, i.e. the functions whose expectations we
want to estimate. This is why I don’t think the added sensitivity of
the DISCO test is particularly useful — it requires more convergence
than necessary for a given application and can necessitate
running for much, much longer than necessary. Any diagnostic
that tries to be sensitive to the convergence of the chain itself
will suffer from this problem.

Ultimately we’ll never be able to guarantee that the samples
generated by Stan can be used to construct good estimators
for _any_ random variable. The best we can do is offer a box
that takes in the samples and a desired random variable and
spits out the Monte Carlo estimator, an estimate of its standard
error, and the corresponding R_hat. Hey, that’s exactly what
bin/print does right now!

Now diagnosing geometric ergodicity is another matter altogether...

Ben Goodrich

unread,
Feb 10, 2015, 11:42:00 AM2/10/15
to stan...@googlegroups.com
On Tuesday, February 10, 2015 at 7:34:09 AM UTC-5, Michael Betancourt wrote:
Firstly, Ben I implore you to start writing these proposed algorithms down in TeX
somewhere so that we can understand your implementations without having to appeal
to software archeology.  The same problem has come up with LKJ priors — we really
need technical reports for these things to improve communication both internally and
externally.

For LKJ, there isn't much to add that wasn't implied in the LKJ (2009) paper but that paper is not super-easy to follow and some stuff was less than explicit. For this convergence thing, I totally agree, particularly with the thinning step that doesn't arise in the DISCO paper. I started to write over winter break but then the semester started.
 
That said, let me summarize my perspective on this matter.

— Let’s Review R_hat --

TL;DR Assuming the MC-CLT, R_hat is a natural diagnostic for MCMC.  It’s just
not one that we can ever calibrate.

Agreed on R_hat
 
— DISCO —

One of the difficulties with this DISCO test is in constructing the proper significance
for the null hypothesis that the distributions being compared are exactly equal.
The authors of the above paper take a non-parametric approach with a
permutation test, but my understanding is that permutation tests are
computationally intense and they can be sensitive to the number of permutations

I haven't noticed sensitivity to the number of permutations although it is something to look into. But it isn't computationally intensive relative to MCMC. Yes, in toy problems it is a bit of an overkill and for people who insist on keeping a million iterations post-warmup it will exhaust their RAM. But for a couple thousand iterations, it only takes a few seconds and the flop count doesn't depend heavily on the number of parameters, so not that big a deal.
 
Like R_hat, there is no explicit alternative hypothesis so we can’t
calibrate any resulting statistic.

But unlike R_hat or anything DISCO with alpha = 2, the test is consistent against all alternatives. So, we would have to get hand-wavy about the number of chains and thinned iterations you need in practice, but we're good in principle. Oversensitivity when the null is only trivially false is the bigger concern, I think.
 
So why wouldn’t we want to be sensitive to the entire distribution of a
random variable?

————————————————————————————————————————
—“Convergence” and what we want in a practical convergence diagnostic —

TL;DR Practical convergence depends on the chosen random variable.
In practice the only robust way to use MCMC is to check R_hat for
every Monte Carlo estimator constructed.

Here is where I think we have different criteria for a practical convergence diagnostic. From my perspective, one of the biggest problems with calculations on the marginals is that either
  1. People observe that the R_hat for their function of interest is near 1.0 and but little to no weight on the other margins or
  2. People remember that their function of interest is not independent of the other margins and then fall into an undecideable situation where most of the R_hats are near 1.0 but some are kind of iffy

This is especially a problem with Stan because Stan enables more ambituous hierarchical models and encourages introducing a lot of extra parameters through Matt tricks and matrix decompositions.

So, I think the workflow should be: First, convince yourself that all the chains have converged using functions of the joint distribution. Second, look at functions (n_eff, traceplots, MCSE, etc.) of the quantities that are of interest to see if you have enough detail for what you want to do and then ignore the rest of the posterior distribution. Once the first step is completed, I think we are agreeing.

What all of this means is that practical convergence depends on
the random variables, i.e. the functions whose expectations we
want to estimate.  This is why I don’t think the added sensitivity of
the DISCO test is particularly useful — it requires more convergence
than necessary for a given application and can necessitate
running for much, much longer than necessary.  

Okay, that is closer to Andrew's position I think. And it does seem to require more chains than the default of 4, so how long it takes depends a lot on how much parallel capacity the user has. In response, I would say that we need to be conservative about giving users the impression of convergence and that there is little cost to users running their chains longer or asking for help reparameterizing. Usually, even with non-trivial models, you can fail to reject the null that all the chains have the same distribution if you have a good .stan program and non-BUGS priors.

Andrew seems to want something some convergence diagnostic that works reasonably reliably for a more limited class of models that are overrepresented by people who are just starting to get into Bayesian analysis. For something like rstanarm where we write Bayesian analogues of popular models that would otherwise be estimated by ML, convergence might indeed not be a big issue much of the time. But I haven't seen the DISCO test rejecting the null in situations like those.

Now diagnosing geometric ergodicity is another matter altogether...

Conceptually, yes. But this is where I am surprised that we disagree. The added sensitivity of the DISCO test is really useful for identifying when it is seemingly impossible to get good draws from the current parameterization. In the situations that I know of --- like Rosenthal and Roberts (2011) where we can control whether the indepenent MH sampler is or is not geometrically ergodic or things like the funnel where there are obviously good and bad Stan reparameterizations --- in the unfavorable cases you notice that there is no thinning interval where you can fail to reject the null that all the chains have the same distribution. You have a plot in front of you that casts doubt on alpha-mixing. You can see the dependence lines bounce when there is Hamiltonians divergence. True, it doesn't tell you that the fundamental problem is lack of geometric ergodicity, but it tells you clearly that there is a fundamental problem.

Ben

Michael Betancourt

unread,
Feb 10, 2015, 12:28:10 PM2/10/15
to stan...@googlegroups.com
For LKJ, there isn't much to add that wasn't implied in the LKJ (2009) paper but that paper is not super-easy to follow and some stuff was less than explicit.

Which is why a readable paper is desperately needed.  Especially one discussing
the partial correlations interpretation of the unconstrained parameterization.

I haven't noticed sensitivity to the number of permutations although it is something to look into. But it isn't computationally intensive relative to MCMC. Yes, in toy problems it is a bit of an overkill and for people who insist on keeping a million iterations post-warmup it will exhaust their RAM. But for a couple thousand iterations, it only takes a few seconds and the flop count doesn't depend heavily on the number of parameters, so not that big a deal.

Sure, but it’ll take some substantial empirical familiarity before we can comfortably
release some diagnostic with a default set of parameters.

One of the things I like about the R_hat null is that it’s true if the MC-CLT is true,
which means that’s a necessary but not sufficient condition for Monte Carlo
estimators to be well-behaved and “necessary but not sufficient” is something
that can be communicated to a general audience.

The necessity of the DISCO criterion, on the other hand, depends on the tuning
of the permutations relative to the target distribution.  I know it’s not a show stopper,
but this kind of sensitivity always bothers me.

But unlike R_hat or anything DISCO with alpha = 2, the test is consistent against all alternatives. So, we would have to get hand-wavy about the number of chains and thinned iterations you need in practice, but we're good in principle. Oversensitivity when the null is only trivially false is the bigger concern, I think.

Consistent provided second moments exist so there’s still the concern about heavy tails.
The thinning also bothers me as a uniformly-thinned chain should behave the same as
the full chain provided its geometrically ergodic.

Here is where I think we have different criteria for a practical convergence diagnostic. From my perspective, one of the biggest problems with calculations on the marginals is that either
  1. People observe that the R_hat for their function of interest is near 1.0 and but little to no weight on the other margins or
  2. People remember that their function of interest is not independent of the other margins and then fall into an undecideable situation where most of the R_hats are near 1.0 but some are kind of iffy

This is especially a problem with Stan because Stan enables more ambituous hierarchical models and encourages introducing a lot of extra parameters through Matt tricks and matrix decompositions.

The point I was trying to make is that if R_hat for the desired function is near 1 then it doesn’t
matter what the R_hat for the other functions is!  The utility of the samples depends on the
properties of the function whose expectations we want to take and not on all possible functions
whose expectation we could have taken.  In particular, for any finite number of samples there is 
no such thing as “joint convergence”.

See below for more on (2).

So, I think the workflow should be: First, convince yourself that all the chains have converged using functions of the joint distribution. Second, look at functions (n_eff, traceplots, MCSE, etc.) of the quantities that are of interest to see if you have enough detail for what you want to do and then ignore the rest of the posterior distribution. Once the first step is completed, I think we are agreeing.

What all of this means is that practical convergence depends on
the random variables, i.e. the functions whose expectations we
want to estimate.  This is why I don’t think the added sensitivity of
the DISCO test is particularly useful — it requires more convergence
than necessary for a given application and can necessitate
running for much, much longer than necessary.  

Okay, that is closer to Andrew's position I think. And it does seem to require more chains than the default of 4, so how long it takes depends a lot on how much parallel capacity the user has.

Huh?  Where did parallel chains come in?  Splitting a few chains
should be sufficient.

Conceptually, yes. But this is where I am surprised that we disagree. The added sensitivity of the DISCO test is really useful for identifying when it is seemingly impossible to get good draws from the current parameterization. In the situations that I know of --- like Rosenthal and Roberts (2011) where we can control whether the indepenent MH sampler is or is not geometrically ergodic or things like the funnel where there are obviously good and bad Stan reparameterizations --- in the unfavorable cases you notice that there is no thinning interval where you can fail to reject the null that all the chains have the same distribution. You have a plot in front of you that casts doubt on alpha-mixing. You can see the dependence lines bounce when there is Hamiltonians divergence. True, it doesn't tell you that the fundamental problem is lack of geometric ergodicity, but it tells you clearly that there is a fundamental problem.

I think one source of confusion here is that we’re confounding too problems.

Firstly there is the question of if we could compute expectations exactly then
could we diagnose convergence of a given Monte Carlo estimator?  This is
the question I was addressing in my previous email.

Then there is the more insidious question of what to do when we don’t have
exact expectations for diagnosis but have to use Monte Carlo estimators
to test Monte Carlo estimators.  For example, a multimodal model may
be geometrically ergodic with respect to a sampler (like Random Walk
Metropolis in lots of cases!) but Monte Carlo estimates for finite sample
size will be misleading because the chain takes _forever_ to switch
between modes.  Same thing with latent parameters in a hierarchical
model.  For a short enough period of time we’re really computing
expectations conditional on the slow parameters which is really, really bad.
I agree that this can be a serious problem (and one we’ve encountered
in Stan) but it is not a question of _convergence_ but rather of _mixing_.

If we change the goal here and talk about using DISCO estimators to
diagnose poor mixing then I am much less hesitant, although I think
more work has to be done to determine where and when it works
and doesn’t work.


Ben Goodrich

unread,
Feb 10, 2015, 2:26:53 PM2/10/15
to stan...@googlegroups.com
On Tuesday, February 10, 2015 at 12:28:10 PM UTC-5, Michael Betancourt wrote:
But unlike R_hat or anything DISCO with alpha = 2, the test is consistent against all alternatives. So, we would have to get hand-wavy about the number of chains and thinned iterations you need in practice, but we're good in principle. Oversensitivity when the null is only trivially false is the bigger concern, I think.

Consistent provided second moments exist so there’s still the concern about heavy tails.

In the remark after the consistency result, they claim it would continue to hold for any alpha < epsilon / 2 where the epsilon-moment exists, although it would be difficult empirically in the heavy tailed case to recognize at what epsilon the epsilon-moment exists.
 
The thinning also bothers me as a uniformly-thinned chain should behave the same as
the full chain provided its geometrically ergodic.

The thinning is only necessary for the permutation test since the permutation test presumes exchangeability. That actually seems to work favorably because you reject the null if you don't thin enough. But the unthinned chain should still be used for subsequent inference.
 
Here is where I think we have different criteria for a practical convergence diagnostic. From my perspective, one of the biggest problems with calculations on the marginals is that either
  1. People observe that the R_hat for their function of interest is near 1.0 and but little to no weight on the other margins or
  2. People remember that their function of interest is not independent of the other margins and then fall into an undecideable situation where most of the R_hats are near 1.0 but some are kind of iffy

This is especially a problem with Stan because Stan enables more ambituous hierarchical models and encourages introducing a lot of extra parameters through Matt tricks and matrix decompositions.

The point I was trying to make is that if R_hat for the desired function is near 1 then it doesn’t
matter what the R_hat for the other functions is!  

I meant the other margins of the posterior distribution, rather than hypothetical functions that could have be calculated but are not of interest. So, you might have a big Cholesky factor of a correlation matrix declared in your parameters block that you don't really care about, except insofar as something in the generated quantities block depends on it.
 
So, I think the workflow should be: First, convince yourself that all the chains have converged using functions of the joint distribution. Second, look at functions (n_eff, traceplots, MCSE, etc.) of the quantities that are of interest to see if you have enough detail for what you want to do and then ignore the rest of the posterior distribution. Once the first step is completed, I think we are agreeing.
What all of this means is that practical convergence depends on
the random variables, i.e. the functions whose expectations we
want to estimate.  This is why I don’t think the added sensitivity of
the DISCO test is particularly useful — it requires more convergence
than necessary for a given application and can necessitate
running for much, much longer than necessary.  

Okay, that is closer to Andrew's position I think. And it does seem to require more chains than the default of 4, so how long it takes depends a lot on how much parallel capacity the user has.

Huh?  Where did parallel chains come in?  Splitting a few chains
should be sufficient.

I just meant running in parallel reduces the wall time. You can do the chains serially and the diagnostic on one core, but it takes longer. By default, it is doing a two-way ANOVA-like test stratifying by chain, by half, and by their interaction.
Conceptually, yes. But this is where I am surprised that we disagree. The added sensitivity of the DISCO test is really useful for identifying when it is seemingly impossible to get good draws from the current parameterization. In the situations that I know of --- like Rosenthal and Roberts (2011) where we can control whether the indepenent MH sampler is or is not geometrically ergodic or things like the funnel where there are obviously good and bad Stan reparameterizations --- in the unfavorable cases you notice that there is no thinning interval where you can fail to reject the null that all the chains have the same distribution. You have a plot in front of you that casts doubt on alpha-mixing. You can see the dependence lines bounce when there is Hamiltonians divergence. True, it doesn't tell you that the fundamental problem is lack of geometric ergodicity, but it tells you clearly that there is a fundamental problem.

I think one source of confusion here is that we’re confounding too problems.

Firstly there is the question of if we could compute expectations exactly then
could we diagnose convergence of a given Monte Carlo estimator?  This is
the question I was addressing in my previous email.

I think the answer to this question is "yes, under the null" but what precisely happens when the null is false is a bit up in the air.
 
Then there is the more insidious question of what to do when we don’t have
exact expectations for diagnosis but have to use Monte Carlo estimators
to test Monte Carlo estimators.  For example, a multimodal model may
be geometrically ergodic with respect to a sampler (like Random Walk
Metropolis in lots of cases!) but Monte Carlo estimates for finite sample
size will be misleading because the chain takes _forever_ to switch
between modes.  Same thing with latent parameters in a hierarchical
model.  For a short enough period of time we’re really computing
expectations conditional on the slow parameters which is really, really bad.
I agree that this can be a serious problem (and one we’ve encountered
in Stan) but it is not a question of _convergence_ but rather of _mixing_.

If we change the goal here and talk about using DISCO estimators to
diagnose poor mixing then I am much less hesitant, although I think
more work has to be done to determine where and when it works
and doesn’t work.

 They are definitely related questions. I would say the plot that I make in the first step is a mixing diagnostic in the sense that you have to find a thin interval to do the subsequent test. And if a Metropolis-based chain is geometrically ergodic, then it is alpha mixing and you should be able to find such a thin interval. When it doesn't look like you can or you think you have found one but the subsequent test rejects the null that all the chains have the same distribution, you know to be cautious. Maybe there is nothing that can be done in terms of reparameterizations etc. that can salvage the situation given the limitations of the sampler, but at least you would know if you do make inferences that you have to confine them to the center of fairly symmetric distributions. The permutation test of the null hypothesis that the chains all have the same joint distribution seems like an easier thing to explain as a convergence diagnostic, even if it is the case that you reject the null due to lack of exchangeability rather than a major difference in the distribution for at least one chain.

Ben

Michael Betancourt

unread,
Feb 10, 2015, 4:41:54 PM2/10/15
to stan...@googlegroups.com
The thinning is only necessary for the permutation test since the permutation test presumes exchangeability. That actually seems to work favorably because you reject the null if you don't thin enough. But the unthinned chain should still be used for subsequent inference.

You mean you thin until you have effectively independent samples?
What guarantees are there that the significance is accurate (or biased
the right way) when you don’t thin enough?  In other words, the test
won’t falsely reject until you think to independent samples?

But isn’t the proper thinning exactly what we don’t know when the
mixing is poor?  

I meant the other margins of the posterior distribution, rather than hypothetical functions that could have be calculated but are not of interest. So, you might have a big Cholesky factor of a correlation matrix declared in your parameters block that you don't really care about, except insofar as something in the generated quantities block depends on it. 

Yeah, but if there’s some generated quantity that you want to estimate
then you just compute the R_hat of that, which incorporates all of the
necessary correlations, for validation.  That’s all I’m saying.

 They are definitely related questions. I would say the plot that I make in the first step is a mixing diagnostic in the sense that you have to find a thin interval to do the subsequent test. And if a Metropolis-based chain is geometrically ergodic, then it is alpha mixing and you should be able to find such a thin interval. When it doesn't look like you can or you think you have found one but the subsequent test rejects the null that all the chains have the same distribution, you know to be cautious. Maybe there is nothing that can be done in terms of reparameterizations etc. that can salvage the situation given the limitations of the sampler, but at least you would know if you do make inferences that you have to confine them to the center of fairly symmetric distributions. The permutation test of the null hypothesis that the chains all have the same joint distribution seems like an easier thing to explain as a convergence diagnostic, even if it is the case that you reject the null due to lack of exchangeability rather than a major difference in the distribution for at least one chain.

I just think we have to figure out something theoretically well-motivated
before worrying about what to sell to the users.  Again, I think that
having the same joint distribution is nice but in practice that may be
too much of an ideal when all MCMC is trying to estimate are marginal
expectations.  

One immediate compromise is to have some outlier detection, whether it
be graphical or numerical, for the R_hat and ESS distribution over all
of the nominal parameters in order to identify poor mixing.

Bob Carpenter

unread,
Feb 10, 2015, 10:52:50 PM2/10/15
to stan...@googlegroups.com
In Ben's defense, he did write all of his LKJ stuff in LaTeX.
I think that was well before Michael joined the team, so it
went into the code and doc. Then it all got written up again in
his joint paper.

Is there something else that needs to go into the doc that's
not there?

- Bob

Bob Carpenter

unread,
Feb 10, 2015, 10:57:49 PM2/10/15
to stan...@googlegroups.com
I don't feel like I understand the issue deeply enough to
add anything other than noise. For instance, I have no idea
why Andrew's objecting to p-values.

But my silence shouldn't be taken as saying I don't think
it should go in. The only downside to adding more code is the
usual one --- it makes all our work in the future harder because we
have to doc and maintain it. And it can confuse users when there
are oodles of options for them to choose among.

But if it's important to one of core developers, I think we should
err on the side of inclusiveness (not of core developers, but of
their code).

I'm torn on Andrew's dumbing down of the default print to
"no signal, assume all is well" vs. "Danger, Will Robinson".
I see the motivation for that, but I agree with Michael that
it'll give users a false sense of security.

The way things are set up now, it'd have to be implemented
in all the interfaces anyway. That may change fairly soon if
the common/command/services or whatever it's now called gets
beefed up to where it can handle RStan and PyStan.

I think it should be up to you and Jiqiang (with input from
Andrew) as to what goes into RStan.


- Bob





> On Feb 9, 2015, at 10:43 PM, Jiqiang Guo <guo...@gmail.com> wrote:
>

Andrew Gelman

unread,
Feb 10, 2015, 11:02:59 PM2/10/15
to stan...@googlegroups.com
I’m objecting to p-values for good reasons that I’m just too exhausted to put into email. But, believe me, these are very good reasons.

Michael Betancourt

unread,
Feb 11, 2015, 6:40:31 AM2/11/15
to stan...@googlegroups.com
> In Ben's defense, he did write all of his LKJ stuff in LaTeX.
> I think that was well before Michael joined the team, so it
> went into the code and doc. Then it all got written up again in
> his joint paper.
>
> Is there something else that needs to go into the doc that's
> not there?

I think the doc is fine for Stan, but there’s all kind of interesting
structure (in particular, _why_ is the LKJ prior so nice?) that
isn’t there nor readily accessible in one place. A short paper
reviewing the transformation, the Jacobian-as-distribution prior,
and the rich statistical interpretation of the transformation
(partial correlations, etc) would be huge in motivating the
statistical community at large.

Ben Goodrich

unread,
Feb 12, 2015, 1:25:51 AM2/12/15
to stan...@googlegroups.com
On Tuesday, February 10, 2015 at 4:41:54 PM UTC-5, Michael Betancourt wrote:
The thinning is only necessary for the permutation test since the permutation test presumes exchangeability. That actually seems to work favorably because you reject the null if you don't thin enough. But the unthinned chain should still be used for subsequent inference.

You mean you thin until you have effectively independent samples?

Yes, but only for testing, not for inference.
 
What guarantees are there that the significance is accurate (or biased
the right way) when you don’t thin enough?  In other words, the test
won’t falsely reject until you think to independent samples?

With the permutation test, it estimates the energy distance of all the (thinned) draws to all the other thinned draws and then decomposes that sum ANOVA-style into the "between" parts that is attributable to
  • Different chains
  • Across all chains, the draws obtained before halftime and the draws obtained after halftime
  • The interaction of those two things (split half-chains basically)

and the "within" (half-chains) part. So, you get this B / W thing that is supposed to be near 1 but you don't really know how much above 1 is a problem.

For essentially the same reasons as in the R_hat case, if the chains are not thinned enough, then the W is less than the B. Basically, the distance between consecutive draws of the same chain is expected to be smaller than the distance between draws from different chains.

And then it starts doing permutations where is basically randomly reassigns draws to chains in arbitrary order. So, adjacent permuted draws have a very low probability of having any dependence; they probably are not from the same chain and even if they are they are likely to be many thinned iterations apart. Anyway, it calculates a distribution of B / W across a specified number of permutations and sees how far into the tail of that distribution the real B / W falls to get the p-value.

Thus, you can get a "false" rejection if the chains do in fact have the same distribution, but were not thinned enough to make the draws virtually exchangeable. And that is observationally equivalent to a "true" rejection where the chains do not all have the same distribution but are essentially exchangeable within chains after thinning. In the first case, you can thin a bit more and try the test again, which might invite a slight multiple testing problem if you don't rerun the chains but probably is not so bad. In the second case, you need to run the chains longer or reparameterize. Either way, you know that you are not out of the woods yet. Presumably, it is better to err on the side of overthinning based on the initial plot. 
 
But isn’t the proper thinning exactly what we don’t know when the
mixing is poor? 

When the mixing is poor, it does seem more difficult to discern the thinning interval precisely but it is evident the thinning interval is not small. By that I mean you are deciding whether to thin every 20th iteration or every 24th, instead of some single-digit number that tends to be adequate in the good-mixing cases. An extreme example was this Gibbs sampler vs. NUTS

where the Gibbs sampler couldn't get near exchangeable even after 600 lags and NUTS was near exchangeable in single-digit lags (and then failed to reject the null).
 
I meant the other margins of the posterior distribution, rather than hypothetical functions that could have be calculated but are not of interest. So, you might have a big Cholesky factor of a correlation matrix declared in your parameters block that you don't really care about, except insofar as something in the generated quantities block depends on it. 

Yeah, but if there’s some generated quantity that you want to estimate
then you just compute the R_hat of that, which incorporates all of the
necessary correlations, for validation.  That’s all I’m saying.

But if anything upstream of that generated quantity doesn't adequately get to the tails by the time you stop, then your generated quantity can look better than it should be. Again, its both a mixing thing and a convergence thing, but this was the paper

http://www.artsci.wustl.edu/~jgill/papers/converge_gill.pdf

that really convinced me that we needed to think harder about multivariate convergence diagnostics.

Ben

Ben Goodrich

unread,
Feb 12, 2015, 1:42:49 AM2/12/15
to stan...@googlegroups.com
On Tuesday, February 10, 2015 at 10:57:49 PM UTC-5, Bob Carpenter wrote:
But if it's important to one of core developers, I think we should
err on the side of inclusiveness (not of core developers, but of
their code).

It's pretty important to me that code one developer favors does not get merged when there is a lot of opposition from other developers. We don't want a systemd situation in Stan or any other open source project for that matter.

But the reality of it is, until Andrew tells users to use a different convergence diagnostic, they are going to keep using R_hat. And all Andrew is saying about my proposal is that he won't accept anything involving p-values for reasons, which if they are the same reasons that have been given in three editions of BDA, are reasons I agreed with when considering BUGS but I don't think are persuasive for Stan. Hypothesizing that the chains have the same distribution under the default settings is a reasonable null iff you have a good .stan program (and you typically fail to reject that null). It is a more strict test and would generate a lot of noise on stan-users if Stan users used it, but it would be a learning opportunity for them to write their .stan programs better and to get rid of their gamma(0.001, 0.001) priors and whatnot.

Ben


Michael Betancourt

unread,
Feb 13, 2015, 4:52:54 AM2/13/15
to stan...@googlegroups.com
I think the thinning-to-get-a-valid-permutation is a serious hesitation — this doesn’t
give you a statistic but rather a procedure that you have to apply to try to infer a
limiting value.  In particular, what does this offer beyond reporting the max
split R_hat, or a visual diagnostic of the split R_hat empirical distribution for
all parameters sampled?

Bob Carpenter

unread,
Feb 13, 2015, 4:58:52 AM2/13/15
to stan...@googlegroups.com

> On Feb 12, 2015, at 5:42 PM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> On Tuesday, February 10, 2015 at 10:57:49 PM UTC-5, Bob Carpenter wrote:
> But if it's important to one of core developers, I think we should
> err on the side of inclusiveness (not of core developers, but of
> their code).
>
> It's pretty important to me that code one developer favors does not get merged when there is a lot of opposition from other developers. We don't want a systemd situation in Stan or any other open source project for that matter.

I don't know what a "systemd situation" is, but the kind
of thing you're proposing wouldn't interfere with anything
else if it were added. It'd just be another diagnostic.

> But the reality of it is, until Andrew tells users to use a different convergence diagnostic, they are going to keep using R_hat.

Oh, if only Andrew had that power. Many users write in and
say their bosses require one Coda convergence diagnostic or
the other instead of R-hat.

If we had a general Coda-like program that wasn't part
of Stan (it could be part of ShinyStan or standalone), then
it would make sense to have any options people think aren't
just fundamentally broken.

- Bob

Ben Goodrich

unread,
Feb 14, 2015, 12:30:11 AM2/14/15
to stan...@googlegroups.com
On Friday, February 13, 2015 at 4:52:54 AM UTC-5, Michael Betancourt wrote:
I think the thinning-to-get-a-valid-permutation is a serious hesitation — this doesn’t
give you a statistic but rather a procedure that you have to apply to try to infer a
limiting value.  In particular, what does this offer beyond reporting the max
split R_hat, or a visual diagnostic of the split R_hat empirical distribution for
all parameters sampled?

I've given lots of examples where this procedure reaches a different conclusion than R_hat. R_hat is a specialization of it where
  1. You set alpha = 2 instead of alpha < 2
  2. You call it on each margin separately instead of jointly
  3. You use sqrt( (N - 1) / N * W + 1 / N * B ) / sqrt(W) instead of B / W
  4. You stop instead of continuing on to do the permutation test (or tests if done marginally)

The question is which of these are good specializations?

  1. I don't see the case for only looking for mean differences. Maybe alpha should be closer to 2 (I've been using 1) but not exactly 2.
  2. We seem to be disagreeing on this one. I'd just refer you again to Jeff's paper.
  3. This is probably not a big deal because we can't say what is a "big" value of either ratio absolutely. It does seem that there can be situations where B / W is diverging slowly but B / (N * W) is converging to zero.
  4. To me, the permutation test on the thinned samples gives you a reference point for the ratio you choose to calculate in (3), namely how does your realization compare to the permutation distribution? But if you do it unthinned (or underthinned), then by construction your W is smaller than your B and your B / W is larger than it is under almost all of the permutations that scramble the chain-time structure. In other words, if the chains are unthinned or underthinned, then you don't obtain a relevant reference point, regardless of whether the chains do or do not have the same distribution.
Do you think the thinning step in my procedure is somehow worse than the thinning step in the Raftery-Lewis diagnostic?

Ben

Michael Betancourt

unread,
Feb 14, 2015, 5:52:12 AM2/14/15
to stan...@googlegroups.com
On Feb 14, 2015, at 5:30 AM, Ben Goodrich <goodri...@gmail.com> wrote:

On Friday, February 13, 2015 at 4:52:54 AM UTC-5, Michael Betancourt wrote:
I think the thinning-to-get-a-valid-permutation is a serious hesitation — this doesn’t
give you a statistic but rather a procedure that you have to apply to try to infer a
limiting value.  In particular, what does this offer beyond reporting the max
split R_hat, or a visual diagnostic of the split R_hat empirical distribution for
all parameters sampled?

I've given lots of examples where this procedure reaches a different conclusion than R_hat. R_hat is a specialization of it where
  1. You set alpha = 2 instead of alpha < 2
  2. You call it on each margin separately instead of jointly
  3. You use sqrt( (N - 1) / N * W + 1 / N * B ) / sqrt(W) instead of B / W
  4. You stop instead of continuing on to do the permutation test (or tests if done marginally)

The question is which of these are good specializations?

  1. I don't see the case for only looking for mean differences. Maybe alpha should be closer to 2 (I've been using 1) but not exactly 2.
Because mathematically MCMC is all about computing means and that’s
where all of the theoretical results lie.  In particular, as I previously discussed
convergence is not a universal property but depends on a choice of function
whose mean you want to compute.
  1. We seem to be disagreeing on this one. I'd just refer you again to Jeff's paper.
I think Jeff’s paper completely misses the point.  Again, as I talked about
before trying to judge global convergence via something like the total
variational distance is not a particularly practical game.

The problem here is the difference between exact and empirical expectations.
Given exact means, marginal R_hats would diagnose problems just fine.
In practice, however, empirical means require that the chain has gone
through a few mixing times.  More particularly, a given mean requires
running for O(longest autocorrelation time of any parameters on which
the current parameter is conditionally dependent).  In hierarchical models
these would be the latent parameters.

So what are examples where _all_ marginal R_hats look good but the
chain has not yet mixed?
  1. To me, the permutation test on the thinned samples gives you a reference point for the ratio you choose to calculate in (3), namely how does your realization compare to the permutation distribution? But if you do it unthinned (or underthinned), then by construction your W is smaller than your B and your B / W is larger than it is under almost all of the permutations that scramble the chain-time structure. In other words, if the chains are unthinned or underthinned, then you don't obtain a relevant reference point, regardless of whether the chains do or do not have the same distribution.
That’s fine, but you’ve introduced the added problem of not really being
sure when you’ve thinned enough to achieve a relevant reference point.
This is not a diagnostic value so much as a diagnostic procedure which
requires interactivity and knowledge from the user.

Do you think the thinning step in my procedure is somehow worse than the thinning step in the Raftery-Lewis diagnostic?

We’re not even considering RL so how is that relevant?

Bob Carpenter

unread,
Feb 14, 2015, 8:23:53 AM2/14/15
to stan...@googlegroups.com

> On Feb 14, 2015, at 9:52 PM, Michael Betancourt <betan...@gmail.com> wrote:
>
> ...
> Because mathematically MCMC is all about computing means and that’s
> where all of the theoretical results lie. In particular, as I previously discussed
> convergence is not a universal property but depends on a choice of function
> whose mean you want to compute.

I'm confused now. I thought before you were arguing that it is
global and we should only be declaring convergence when all (well
behaved in some sense?) functions converge.

If you have time, could you take a look at the discussion of
convergence I inserted into the language manual and suggest corrections
or clarifications?

- Bob

Michael Betancourt

unread,
Feb 14, 2015, 8:42:17 AM2/14/15
to stan...@googlegroups.com
> I'm confused now. I thought before you were arguing that it is
> global and we should only be declaring convergence when all (well
> behaved in some sense?) functions converge.

No, no.

_Convergence_ requires the choice of function to define any kind
of convergence threshold (for example, if you have exponential
decay then when are you close enough to zero?).

Mixing is a trickier problem because we’re computing empirical
Monte Carlo standard errors. If we knew the marginal variances
and autocorrelations exactly then poor mixing would result in
huge MCSE along with the poor Monte Carlo estimators. But
in practice we estimate the marginal variances and ESS
numerically and when the autocorrelation time of any
parameter is much larger than the sampling time these
numerical estimators are extremely imprecise.

Another way of thinking about it is that if any of the parameters
have huge autocorrelations than any Monte Carlo estimator
is really estimating the expectation _conditional_ on that
parameter instead of marginal to it.

> If you have time, could you take a look at the discussion of
> convergence I inserted into the language manual and suggest corrections
> or clarifications?

I’ll try, but I think the entire Monte Carlo section needs to
be rewritten if we want to touch on the practical consequences
of some of these theoretical issues. Not that I know how to do
that, as every time I think I can explain it it turns out that I can’t.

Ben Goodrich

unread,
Feb 15, 2015, 11:31:10 AM2/15/15
to stan...@googlegroups.com
On Saturday, February 14, 2015 at 5:52:12 AM UTC-5, Michael Betancourt wrote:
The question is which of these are good specializations?
  1. I don't see the case for only looking for mean differences. Maybe alpha should be closer to 2 (I've been using 1) but not exactly 2.
Because mathematically MCMC is all about computing means and that’s
where all of the theoretical results lie.  In particular, as I previously discussed
convergence is not a universal property but depends on a choice of function
whose mean you want to compute.

A common convention in the literature is to define convergence as the condition that total variation norm is below some threshold like 0.01. That does not require you to specify what your quantity of interest is, only what the posterior distribution is and what distribution is induced by the transition kernel. I realize it is not practical to estimate the total variation norm, but one could define convergence as the condition that energy distance to the posterior is below some threshold. Then, you are on the road to asking practical questions like how can we decompose the estimated energy distances and whether the within-chain distance is similar to the between-chain distance under the DISCO test?

I agree that we are typically interested in computing means for inferences purposes, but to know best whether those means are reliable for inference, it would be good to marshall all the information we have available in the samples.
  1. We seem to be disagreeing on this one. I'd just refer you again to Jeff's paper.
I think Jeff’s paper completely misses the point.  Again, as I talked about
before trying to judge global convergence via something like the total
variational distance is not a particularly practical game.

Misses whose point? You asked in your previous post what value my procedure has over reporting all the R_hats.  Jeff gives theory and examples where having a small R_hat on one margin or on all margins does not yield good inferences or is overly sensitive to the choice of hyperparameters.

That is why in the Recommendations section he says that he would prefer a joint measure, but basically resigns himself to the fact that, at the time, no good ones existed (except the multivariate R_hat, which isn't reported much and isn't reported by Stan). And we know that the multivariate R_hat is an upper bound to the maximum of the marginal R_hats that is achieved by the worst linear projection. And we know that the multivariate R_hat is hard to estimate accurately. So, things could be worse than they seem even if you are only interested in mean differences.
 
The problem here is the difference between exact and empirical expectations.
Given exact means, marginal R_hats would diagnose problems just fine.
In practice, however, empirical means require that the chain has gone
through a few mixing times.  

I think I am agreeing with you about the limitations of empirical means. The means and variances with which the marginal R_hats are estimated are not even admissible in light of the Stein paradox. Given they are not reliable, it makes sense to look beyond the marginal means for additional evidence of problems.
 
So what are examples where _all_ marginal R_hats look good but the
chain has not yet mixed?
  • There is one in table 2 of Jeff's paper.
  • There are two mentioned previously in this thread (the hierachical Poisson via Gibbs, plus the Roberts and Rosenthal's (2011) example of independent MH of a standard normal)
  • The same happens with Roberts and Rosenthal's (2011) exponential example.
  • The funnel example (sometimes, depending on the seed) in Stan but the funnel_reparam example passes.
  • The centered parameterizations in your paper with Mark on hierarchical models, although I can't seem to get R_hat values that as low as the ones in Figure 12 with multiple chains and Stan 2.6.
  • Four out of the six BUGS models (excluding birats and stagnant) under the heading "Stan has bad mixing at a minimum" at https://groups.google.com/d/msg/stan-dev/jEZvT9sOwnA/A4uCIQGsH5gJ (although you have to increase the warmum to get a low maximum R_hat for the magnesium model)
  • The eyes model under the heading "Stan doesn't converge" at the previous link
  • Arguably the example in the MCMCpack R package given for MCMClogit with Cauchy priors at https://groups.google.com/d/msg/stan-dev/_SJBiCaUbt4/MblkpLElI4MJ . The mixing is not horrible for this M-H model but much worse than for Gibbs sampling of a probit model with data augmentation or the same logit model with NUTS.
  • Drawing from the distribution of standardized regression coefficients implied by a uniform LKJ prior on the correlation matrix at https://groups.google.com/d/msg/stan-dev/89GSQ9wrIQI/ulzkFzpQ2UgJ but only for the bad parameterization
  • The multivariate student t distribution with dimension much greater than the degrees of freedom at https://groups.google.com/d/msg/stan-dev/lMy2CZJD9fk/zZmusos9slkJ but only for the bad parameterization
  • The bivariate Clayton copula with two standard normal marginals at https://groups.google.com/d/msg/stan-dev/7pssn2kkw-I/S7e7hBFqF_AJ but probably any copula with normal marginals or even symmetric marginals.
  • The most literal analogues of the CAR models that GeoBUGS estimates at https://groups.google.com/d/msg/stan-users/M7T7EIlyhoo/CffB9HE6LcUJ although Jonah has had better luck with some better Stan reparameterizations
  • Sampling from an inverse Wishart distribution of order K with K degrees of freedom and scale matrix K * I for non-tiny K
Does anyone have a link to the PDF I sent out that had 4 dependence plots on a page for all the different examples Stan had at the time? I can't find it anymore but there were more examples there.
  1. To me, the permutation test on the thinned samples gives you a reference point for the ratio you choose to calculate in (3), namely how does your realization compare to the permutation distribution? But if you do it unthinned (or underthinned), then by construction your W is smaller than your B and your B / W is larger than it is under almost all of the permutations that scramble the chain-time structure. In other words, if the chains are unthinned or underthinned, then you don't obtain a relevant reference point, regardless of whether the chains do or do not have the same distribution.
That’s fine, but you’ve introduced the added problem of not really being
sure when you’ve thinned enough to achieve a relevant reference point.
This is not a diagnostic value so much as a diagnostic procedure which
requires interactivity and knowledge from the user.

I would say it is a diagnostic procedure that results in a p-value but all null hypothesis tests are procedures in that sense. I don't mind the interactivity, but you could try a few lags automatically without doing all the permutations until you get a realized B / W under 1.25 or so. It usually rejects above that.
Do you think the thinning step in my procedure is somehow worse than the thinning step in the Raftery-Lewis diagnostic?

We’re not even considering RL so how is that relevant?

We're considering thinning. The Raftery and Lewis procedure does thinning. That is where I got the idea from, having not seen anyone object to the fact that their procedure entails thinning in the 20 years since that paper was published.

Ben

Michael Betancourt

unread,
Feb 16, 2015, 6:48:07 PM2/16/15
to stan...@googlegroups.com
Let me summarize:

As I previously discussed, there’s no canonical way to threshold the TV norm.
You right 0.01 — but is that good enough?  What does that even mean in
terms of the accuracy and precision of Monte Carlo estimators?  The value
of the TV norm can be converted into a reasonable number only once we
select out some class of functions whose expectations we want to compute
(again, the intuition being that smooth functions need less convergence in the
TV norm to yield accurate estimates than rough functions).  In practice those 
functions are marginal parameter means.

Given nice chains (geometric ergodicity) exact expectations, the marginal R_hats 
are entirely satisfactory for determining sufficient accuracy.  The problem is
whether or not the chains are nice and then, even if they are, if the numerical
estimators are sufficiently precise.  In latent models the latent parameters can
have very high autocorrelations which induce huge estimator variances that
are hard to diagnose — but in those cases the high autocorrelations should be
sufficient auxiliary diagnostics.  I misspoke earlier when I said “models where
the sampling is bad but all the R_hats look good”.  I really should have said
“models where the sampling is bad but all the R_hats look good and there
are no outliers in the estimated ESS distribution”.

I am not against new diagnostics, I just don’t like these energy statistics.
Firstly, they are sensitive to the structure of the intermediate distributions
of the Markov chain that may or may not be relevant (i.e it’s not clear how
one calibrates them to the expectations of interest).  Secondly, they rely
on permutation tests which then rely on getting the thinning just right.
This requires user expertise in applying the test and interpreting the
results which I feel seriously degrades the user experience (certainly
at least until there’s some interactive Stan interface).  

There’s some intuition as to how the test should behave when the chains 
are under thinned, but what happens if you overshoot and over thin the 
chains?  You start losing power but how does that manifest in the test 
statistics?  Is it obvious where to stop?  Are there criteria to help guide
that decision or is this all an art?  

Ultimately I feel that it’s an awkward procedure that tests for the wrong
features of the Markov chain.  That’s my opinion and I’ll let others
express their own.


Ben Goodrich

unread,
Feb 17, 2015, 2:22:34 PM2/17/15
to stan...@googlegroups.com
On Monday, February 16, 2015 at 6:48:07 PM UTC-5, Michael Betancourt wrote:
As I previously discussed, there’s no canonical way to threshold the TV norm.
You right 0.01 — but is that good enough?  What does that even mean in
terms of the accuracy and precision of Monte Carlo estimators?  The value
of the TV norm can be converted into a reasonable number only once we
select out some class of functions whose expectations we want to compute
(again, the intuition being that smooth functions need less convergence in the
TV norm to yield accurate estimates than rough functions).  In practice those 
functions are marginal parameter means.

OK, that is a fair point. But I think these 1-dimensional independent M-H samplers of Roberts and Rosenthal are useful because they come with lower and upper bounds on the iteration time to reach an arbitrarily specified threshold for the TV norm. For example, between 4000 and 8000 iterations in the case of a standard normal posterior when the proposal normal has a standard deviation of 0.5. And the R_hat for the parameter gets below the arbitrary 1.1 threshold before 4000 iterations and the DISCO test (with an arbitrary 0.05 cutoff) rejects the null past 8000 iterations. So, we know the relative sensitivity of the three quantities here and can go back to discussing whether more sensitivity to more of the distribution is a good thing or not.
 
I misspoke earlier when I said “models where
the sampling is bad but all the R_hats look good”.  I really should have said
“models where the sampling is bad but all the R_hats look good and there
are no outliers in the estimated ESS distribution”.

At a minimum, we would need an ESS outlier detection procedure and given that, the "no outliers in the estimated ESS distribution" qualification should be added to the manual, BDA IV, presentations, etc. Although I still favor energy statistics stuff. Also, if you can find a thin interval that gets you virtually independent observations within chains, you have a lower bound to the ESS for any function. Often, this bound is too loose to be helpful, in which case the ESS that Stan calculates is better. But sometimes the ESS that Stan calculates is below this lower bound, in which case the MCMC-CLT is probably suspect.
 
I am not against new diagnostics, I just don’t like these energy statistics.
Firstly, they are sensitive to the structure of the intermediate distributions
of the Markov chain that may or may not be relevant (i.e it’s not clear how
one calibrates them to the expectations of interest).  

Yes, and I don't have a firm understanding of this, but it seems to be good at rejecting when you do a two-way ANOVA-like decomposition of energy distance into a contribution from the chains, a contribution from the two halves of the sampling period, and the interaction of the two. So, if all the chains are roughly in the same an intermediate distribution when you stop, the p-value on chains stratum might not be small, but the other two should be.
 
Secondly, they rely
on permutation tests which then rely on getting the thinning just right.
This requires user expertise in applying the test and interpreting the
results which I feel seriously degrades the user experience (certainly
at least until there’s some interactive Stan interface).  

Jonah wants to including them in SHINYstan, which is a start. My QMSS class last year did not have much trouble with them. If you know how to read a correlogram and know how to read an ANOVA table, then the energy distance analogues are not much of a leap.
 
There’s some intuition as to how the test should behave when the chains 
are under thinned, but what happens if you overshoot and over thin the 
chains?  You start losing power but how does that manifest in the test 
statistics?  Is it obvious where to stop?  Are there criteria to help guide
that decision or is this all an art?  

I used to worry about this, too. But the power mostly comes from having multiple chains (because there are chains choose 2 distances among chains) and from having lots of parameters in the parameters block (because it is joint). We need to calibrate it for MCMC, but the DISCO papers show good power with like 30 exchangeable observations per stratum. In that case, with 1000 realizations you could thin to every 20th for the test. But if you have to thin to every 20th, you already have poor mixing by Stan standards (where a thin interval of 5 is enough in nice cases). For Gibbs or MH, you would need a lot more than 1000 realizations to start with.

Ben

Bob Carpenter

unread,
Feb 18, 2015, 12:24:12 AM2/18/15
to stan...@googlegroups.com
My own feeling about this, without understanding much of the
technical details, is that if it's better at detecting
non-convergence than R-hat, then it'd be useful to have.

I think that's true even if there are some cases R-hat
diagnoses that B-hat won't ("B-hat" is what I think
Andrew's been calling your approach).

- Bob

Andrew Gelman

unread,
Feb 18, 2015, 4:03:00 PM2/18/15
to stan...@googlegroups.com
I agree with Bob’s “If . . . then . . .” statement.
Reply all
Reply to author
Forward
0 new messages