Rhat for lp__

371 views
Skip to first unread message

Joshua Rosenau

unread,
Jul 31, 2014, 7:34:14 PM7/31/14
to stan-...@googlegroups.com
Forgive me if this is a dumb question (and if it is, please let me know why so I can correct my confusion).

For a model I’m running, the parameters I’m interested in are all showing good convergence (high n_eff, Rhat near 1, traceplots look well-mixed), but Rhat for lp__ is 1.60. Does that indicate that I should let the model run longer, or is it not crucial that lp__ converge as long as everything else converges?

Thanks.

Josh

Bob Carpenter

unread,
Jul 31, 2014, 7:50:28 PM7/31/14
to stan-...@googlegroups.com
I don't think it's a dumb question. It's something we should
answer in our doc, because I think it's a common occurrence.

I too wonder why we always see Rhat so high for lp__ and if it
should be a concern. It's definitely a concern if you want to
estimate lp__ accurate. So the real question is whether we can
trust parameter estimates with high Rhat values for lp__.

I'll write the answer into the manual.

lp__ the log posterior up to a constant. Because it's
a highly non-linear combination of the parameters, there's
no reason to expect it to have the same convergence properties.

Another case to think about is X^2. That's what you need to
converge to get a good estimate of the posterior variance of X.
And it will often converge more slowly than X does (I don't know
if the reverse will ever be true).

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Joshua Rosenau

unread,
Jul 31, 2014, 7:54:27 PM7/31/14
to stan-...@googlegroups.com
Thanks, that clarifies the issue.

Bob Carpenter

unread,
Jul 31, 2014, 11:08:54 PM7/31/14
to stan-...@googlegroups.com
Not for me! I still want to know why the Rhats for lp__ are
usually so much higher and whether it could be problematic for
the parameter estimates if lp__ estimates don't seem to have
converged according to Rhat.

- Bob

Ben Goodrich

unread,
Aug 1, 2014, 12:03:46 AM8/1/14
to stan-...@googlegroups.com
On Thursday, July 31, 2014 11:08:54 PM UTC-4, Bob Carpenter wrote:
Not for me!  I still want to know why the Rhats for lp__ are
usually so much higher and whether it could be problematic for
the parameter estimates if lp__ estimates don't seem to have
converged according to Rhat.

Perhaps, but we already have examples where the Rhat is close to 1 for everything, including lp__, despite the chains not having converged.

To me, I think the bigger question is whether it makes sense to make a decision about a chain-level phenomenon (convergence) from the margins? A related question is that since the Markov Chain operates in the unconstrained space, does it make sense to evaluate convergence in the constrained space? The lp__ is at least a function of all the parameters, and it is the log-posterior with respect to the parameters in the unconstrained space. The Rhat makes more sense when you really care about one thing (which is presumably not lp__) and want to know how much is variance is inflated by having only a given number of draws.

Ben

Bob Carpenter

unread,
Aug 1, 2014, 1:00:53 AM8/1/14
to stan-...@googlegroups.com

On Aug 1, 2014, at 12:03 AM, Ben Goodrich <goodri...@gmail.com> wrote:

> On Thursday, July 31, 2014 11:08:54 PM UTC-4, Bob Carpenter wrote:
> Not for me! I still want to know why the Rhats for lp__ are
> usually so much higher and whether it could be problematic for
> the parameter estimates if lp__ estimates don't seem to have
> converged according to Rhat.
>
> Perhaps, but we already have examples where the Rhat is close to 1 for everything, including lp__, despite the chains not having converged.

I know that Rhat near 1 is necessary for convergence, but
not sufficient.

The question is really whether Rhat not having converged has
dire implications for the other parameters, which do have Rhat
near 1.

> To me, I think the bigger question is whether it makes sense to make a decision about a chain-level phenomenon (convergence) from the margins?

We can reject chain-level convergence of a vector
if one of the elements hasn't converged, right?

> A related question is that since the Markov Chain operates in the unconstrained space, does it make sense to evaluate convergence in the constrained space?

I don't see why not. It's just a transform of the parameters.
Just like X^2 and I know that makes sense to monitor.

> The lp__ is at least a function of all the parameters, and it is the log-posterior with respect to the parameters in the unconstrained space.

Right. Does that mean we should always run long enough that
Rhat for lp__ is close to 1?

> The Rhat makes more sense when you really care about one thing (which is presumably not lp__) and want to know how much is variance is inflated by having only a given number of draws.

I understand that, too. We could invent all kinds of crazy functions
of parameters that wouldn't converge. I'm just asking if lp__ is
one of them or whether its nonconvergence signals problems for the chain
as a whole.

- Bob

Michael Betancourt

unread,
Aug 1, 2014, 5:07:47 AM8/1/14
to stan-...@googlegroups.com
1) Markov chain convergence is a global property and does not depend on the choice of function.
Rhat considers the composition of a Markov chain and a function, and if the Markov chain has
converged then each Markov chain + function composition will have converged. Multivariate
functions converge when all of their margins have converged by the Cramer-Wold theorem.

2) The transformation from unconstrained space to constrained space is just another function,
so does not effect convergence.

3) Different functions may have different autocorrelations, but if the Markov chain has equilibrated
then all Markov chain + function compositions should be consistent with convergence. Formally
any function that appears inconsistent is of concern and although it would be unreasonable to test
_every_ function, lp should at least be consistent.

The obvious difference in lp is that it tends to vary quickly with position and is consequently
susceptible to outliers. Not to mention the fact that its magnitude is typically large and it’s worth
looking into how accurate the floating point computations are in the Rhat calculation.

Andrew Gelman

unread,
Aug 1, 2014, 10:12:49 AM8/1/14
to stan-...@googlegroups.com
I think it’s a mistake to declare convergence in any practical sense if Rhat for lp__ is high. If lp__ has not converged then the different chains are really in different parts of the space.

Avraham Adler

unread,
Aug 1, 2014, 10:33:44 AM8/1/14
to stan-...@googlegroups.com, gel...@stat.columbia.edu
That should probably be in the manual somewhere.

Avi
>> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Bob Carpenter

unread,
Aug 1, 2014, 11:47:40 AM8/1/14
to stan-...@googlegroups.com
Absolutely --- that's why I was persistent in trying to
get an explanation. I added a comment to the ongoing issue
for the next manual:

https://github.com/stan-dev/stan/issues/786#issuecomment-50900058

- Bob
> >> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Avraham Adler

unread,
Aug 1, 2014, 12:20:24 PM8/1/14
to stan-...@googlegroups.com
Great!

Thanks.

Avi

Bob Carpenter

unread,
Aug 1, 2014, 12:37:37 PM8/1/14
to stan-...@googlegroups.com
Following up, given the way we're evaluating convergence, namely
with R-hat, we get different answers depending on what the transform
is.

That is, X^2 converges at a different rate than X if we're looking
at R-hat. So you're saying that X itself hasn't converged unless
X^2 also converges by R-hat? And X^3, and all the other moments,
as well as exp(X), log(X) and so on? (I know we can't evaluate all
these, I'm just not sure I understand what's required.)

- Bob

Andrew Gelman

unread,
Aug 1, 2014, 3:55:18 PM8/1/14
to stan-...@googlegroups.com
I’ve played around with doing R-hat nonparametrically by first replacing the simulations by their ranks (for each scalar quantity being monitored) and then applying R-hat on the ranks. It works ok and solves problems such as the Cauchy where the mean is meaningless. So that would be one way to go. However it could be costly to compute these ranks, no?

Bob Carpenter

unread,
Aug 1, 2014, 5:00:01 PM8/1/14
to stan-...@googlegroups.com
I'm not sure what you mean by "replacing simulations by their rank."
Are the draw's for a single parameter sorted and then ranked from 1 to
num_samples? I don't even see how that'd make sense.

It would also be relatively costly to compute compared to
doing nothing, namely -- O(N * log N) for N samples per parameter
and also a bit of fussing with memory allocation and freeing to
do the sort, which can't be done in place if we want to save the
parameters.

But I'm guessing you mean something else.

- Bob

Andrew Gelman

unread,
Aug 1, 2014, 5:02:08 PM8/1/14
to stan-...@googlegroups.com
I can’t find the code right now, but the idea goes as follows.

You have m chains, each of legnth n. (I’m assuming we’ve already split the chains, so if we’ve run 4 chains, then m=8.)
Suppose we’re looking a scalar quantity, alpha.

Then we first take the mn values of alpha and rank them (from 1 to mn, using the usual convention for ties). Then we apply R-hat to these values.

Bob Carpenter

unread,
Aug 1, 2014, 5:07:39 PM8/1/14
to stan-...@googlegroups.com
To do that, and then sort them back out to the various chains
incurs additional overhead, because now when we sort, we have
to keep track of the original indices. But it's not too hard
if we use a custom sorting functor in C++ that creates an
array of indices, then sorts that by using a compartor that
compares values of the indices.

We could certainly try it.

- Bob

Andrew Gelman

unread,
Aug 1, 2014, 5:26:03 PM8/1/14
to stan-...@googlegroups.com
Perhaps worth doing it. Yet another twist would be to replace the ranks by the corresponding quantiles of the normal dist. In practice I don’t see the need for this but I could see that it could appeal to some people because then the use of first and second moments would be approximately appropriate if the target dist really is normal.

Michael Betancourt

unread,
Aug 2, 2014, 7:12:15 AM8/2/14
to stan-...@googlegroups.com
Issue One: What is Convergence?

There is no hard cutoff between “transience” and “equilibrium". What happens is that as N-> infinity the distribution of possible states in the Markov chain approaches the target distribution and _in that limit_ the expected value of the Monte Carlo estimator of _any integrable function_ converges to the true expectation. There is nothing like warmup here because in the N->infinity limit the effects of initial state are completely washed out.

The problem is what happens for finite N. If we can prove a strong _geometric ergodicity_ property (which depends on the sampler and the target distribution) then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time — but even if you can show it exists and is finite (which is nigh impossible) you can’t compute an actual value analytically.

So what you do in practice is hope that N is large enough for the expectations to be reasonably accurate. Removing warmup iterations improves the accuracy of the expectations but there is no guarantee that removing any finite number of samples will be enough.

Issue Two: Why Inconsistent Rhats?

There are two things to worry about here.

Firstly, as noted above, for any finite N there will always be some residual effect of the initial state, which typically manifests as some small (or large if the autocorrelation time is huge) probability of having a large outlier. Functions robust to such outliers (say, quantiles) will appear more stable and have better Rhats. Functions vulnerable to such outliers may show fragility.

Secondly, Rhat makes very strong assumptions. In particular is assumes that the functions being considered are Gaussian (or it only uses the first two moments and assumes some kind of independence — the point is that strong assumptions are made) that do not always hold. In particular, the distribution for the log posterior density almost never looks Gaussian, instead it features long tails that can lead to large Rhats even in the large N limit.

The tweaks that Andrew keeps talking about all have the flavor of making the samples of interest more Gaussian and hence the Rhat calculation more accurate.

Conclusion:

“Convergence” is a global property and holds for all integrable functions at once, but Rhat requires additional assumptions so may not work for all functions equally well.

Note that if you just compare the _expectations_ between chains then we can rely on the Markov chain asymptotics for Gaussian distributions and can apply the standard tests.

Bob Carpenter

unread,
Aug 2, 2014, 12:11:34 PM8/2/14
to stan-...@googlegroups.com
Thanks so much. That was super helpful.

You should write an intro to MCMC book! Or longish
survey article.

- Bob

On Aug 2, 2014, at 7:12 AM, Michael Betancourt <betan...@gmail.com> wrote:

> Issue One: What is Convergence?
>
> There is no hard cutoff between "transience" and "equilibrium". What happens is that as N-> infinity the distribution of possible states in the Markov chain approaches the target distribution and _in that limit_ the expected value of the Monte Carlo estimator of _any integrable function_ converges to the true expectation. There is nothing like warmup here because in the N->infinity limit the effects of initial state are completely washed out.
>
> The problem is what happens for finite N. If we can prove a strong _geometric ergodicity_ property (which depends on the sampler and the target distribution) then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time -- but even if you can show it exists and is finite (which is nigh impossible) you can't compute an actual value analytically.
>
> So what you do in practice is hope that N is large enough for the expectations to be reasonably accurate. Removing warmup iterations improves the accuracy of the expectations but there is no guarantee that removing any finite number of samples will be enough.
>
> Issue Two: Why Inconsistent Rhats?
>
> There are two things to worry about here.
>
> Firstly, as noted above, for any finite N there will always be some residual effect of the initial state, which typically manifests as some small (or large if the autocorrelation time is huge) probability of having a large outlier. Functions robust to such outliers (say, quantiles) will appear more stable and have better Rhats. Functions vulnerable to such outliers may show fragility.
>
> Secondly, Rhat makes very strong assumptions. In particular is assumes that the functions being considered are Gaussian (or it only uses the first two moments and assumes some kind of independence -- the point is that strong assumptions are made) that do not always hold. In particular, the distribution for the log posterior density almost never looks Gaussian, instead it features long tails that can lead to large Rhats even in the large N limit.
Reply all
Reply to author
Forward
0 new messages