More iterations -> worse convergence ?

sp_r...@yahoo.it

unread,

Oct 12, 2013, 9:08:23 AM10/12/13

to stan-...@googlegroups.com

I've generated fake data for a multilevel varying-intercept logistic model (code below). Then I've compiled and run a Stan model (code below).
After 2000 iterations, some Rhat was greater than 1.1.
After 4000 iterations convergence was reached (max Rhat = 1.06).
So I've tried 8000 and 16000 iterations and looked at the tracepolot()s (attached).
Surprise!
4000 iterations:
-- max Rhat == 1.06
-- mu_a: imperfect mixing, but within the overall variability
-- sigma_a: imperfect mixing, a bit beyong the overall variability
8000 iterations:
-- Rhat: mu_a = 1.20, sigma_a = 1.13
-- worse mixing after warmup (i.e. after 4000 iterations)
16000 iterations:
-- max Rhat == 1.09, but:
-- several paths beyond the overall variability

It looks somewhat strange. Doesn't it?
Tranks
Sergio

# Fake data generation

set.seed(1)
N <- 1500
G <- 5
mu_a <- 1.25
sigma_a <- 1.4
# a: varying intercept with mean(a) = mu_a, sd(a) = sigma_a
a <- rnorm(G, mu_a, sigma_a)
a <- ( (a - mean(a)) / sd(a) ) * sigma_a + mu_a
b <- 0.65
group <- rep(1:G, 1:G)
group <- rep(group, each = N / length(group))
# x: predictor with mean(x) = 0, sd(x) = 2
x <- rnorm(N, 0, 2)
x <- ( (x - mean(x)) / sd(x) ) * 2
# outcome
y <- rbinom(N, 1, plogis(a[group] + b * x))

# Stan model

data {
    int<lower=1> N;
    int<lower=1> G;
    int<lower=0, upper=1> y[N];
    vector[N] x;
    int<lower=1, upper=G> group[N];
}
parameters {
    vector[G] a_std;
    real mu_a;
    real<lower=0> sigma_a;
    real b;
}
transformed parameters {
    vector[G] a;
    a <- mu_a + sigma_a * a_std;
}
model {
    a_std ~ normal(0, 1);
    for (n in 1:N)
        y[n] ~ bernoulli_logit(a[group[n]] + b * x[n]);
}

tp4000.png

tp8000.png

tp16000.png

Michael Betancourt

unread,

Oct 12, 2013, 1:19:58 PM10/12/13

to stan-...@googlegroups.com

> It looks somewhat strange. Doesn't it?

Not at all.

Remember that HMC, like most MCMC algorithms, satisfies detailed balance, which means that the probability from transitioning from neighborhood A to neighborhood B is the same as transitioning from neighborhood B to neighborhood A. When you're in the bulk of the distribution this ensures quick mixing, but it also means that if it's hard to transition from the bulk to the tails, then it will be hard to transition from the tails to the bulk and this manifests as chains getting "stuck" in the tails for long periods of time. Usually the probability of transitions to the tail is small enough that you never see this behavior, but if you run long enough you increase the chance of getting stuck at least once and all it takes is one sojourn into the tails to mess up ESS and Rhat calculations.

You can try to avoid these sojourns by making your priors tighter, by decreasing the step size below the nominal optimization value (taking care to increase the max_depth if necessary), or you can try adding jitter to the step size.

Andrew Gelman

unread,

Oct 12, 2013, 5:32:48 PM10/12/13

to stan-...@googlegroups.com

What Mike said. Indeed, this happened with the 8-schools model before we parameterized it using the Matt trick. R-hat is an excellent summary of mixing, but it is not perfect; it does not catch places where the chains have not gone. One thing that we are working on is diagnostics that would use gradient information to reveal the convergence problems in such cases.

A

--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
<tp4000.png><tp8000.png><tp16000.png>

Sergio Polini

unread,

Oct 13, 2013, 10:38:29 AM10/13/13

to stan-...@googlegroups.com

I thank you, however:
a) I can't think of tighter priors than:
mu_a ~ normal(1.25, 1); // 1.25 is the "true" mean
sigma_a ~ cauchy(0, 2.5);
but they don't avoid those sojourns (traceplot attached, file
tp16000_priors.png);
b) I used the Matt trick (I suppose...):
parameters {
vector[G] a_std;
...

}
transformed parameters {
vector[G] a;
a <- mu_a + sigma_a * a_std;
}
model {
a_std ~ normal(0, 1);

...
}

So I've tried another solution: increasing the number of groups (G = 50,
was 5) and observations (N = 12750, was 1500).
As far I can understand, traceplots are better looking (file
tp16000_size.png).

May I think that even in the simplest multilevel logistic model one
needs a fair amount of data (groups and/or observations) to avoid
convergence problems?
May I think that a multilevel logistic model needs more data than a
linear one?

Thanks
Sergio

Il 12/10/2013 23:32, Andrew Gelman ha scritto:
> What Mike said. Indeed, this happened with the 8-schools model before
> we parameterized it using the Matt trick. R-hat is an excellent summary
> of mixing, but it is not perfect; it does not catch places where the
> chains have not gone. One thing that we are working on is diagnostics
> that would use gradient information to reveal the convergence problems
> in such cases.
> A
>
> On Oct 12, 2013, at 3:08 PM, sp_r...@yahoo.it

>> int<lower=0,upper=1>y[N];
>> vector[N]x;
>> int<lower=1,upper=G>group[N];

>> }
>> parameters {
>> vector[G]a_std;
>> real mu_a;
>> real<lower=0>sigma_a;
>> real b;
>> }
>> transformed parameters {
>> vector[G]a;
>> a <-mu_a +sigma_a *a_std;
>> }
>> model {
>> a_std ~normal(0,1);
>> for(n in1:N)

>> y[n]~bernoulli_logit(a[group[n]]+b *x[n]);

>> }
>> |
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "stan users mailing list" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to stan-users+...@googlegroups.com

>> <mailto:stan-users+...@googlegroups.com>.

tp16000_priors.png

tp16000_size.png

Bob Carpenter

unread,

Oct 13, 2013, 3:15:40 PM10/13/13

to stan-...@googlegroups.com

On 10/13/13 10:38 AM, Sergio Polini wrote:
> I thank you, however:
> a) I can't think of tighter priors than:
> mu_a ~ normal(1.25, 1); // 1.25 is the "true" mean

You could always reduce the standard deviation.

What I was finding fitting an IRT model that if I made
a hierarchical model for item difficulties, the scale
was fit to about 0.2, whereas even taking a normal(0,1)
prior was too fat. It actually fit much faster with
the hierarchical component than without. That problem
had a lot of data (thousands of observations), but it
still had problems fitting with a much-fatter-than-required
prior.

> sigma_a ~ cauchy(0, 2.5);

The Cauchy is a very fat-tailed prior!

> but they don't avoid those sojourns (traceplot attached, file tp16000_priors.png);
> b) I used the Matt trick (I suppose...):
> parameters {
> vector[G] a_std;
> ...
> }
> transformed parameters {
> vector[G] a;
> a <- mu_a + sigma_a * a_std;
> }
> model {
> a_std ~ normal(0, 1);
> ...
> }
>
> So I've tried another solution: increasing the number of groups (G = 50, was 5) and observations (N = 12750, was 1500).
> As far I can understand, traceplots are better looking (file tp16000_size.png).
>
> May I think that even in the simplest multilevel logistic model one needs a fair amount of data (groups and/or
> observations) to avoid convergence problems?

Not quite. We've fit some pretty large regression problems
with fairly small data sizes per group. But if you want to
use small data sizes per group, you need a good hierarchical
prior (that is, more pooling) in order to get tight estimates.

> May I think that a multilevel logistic model needs more data than a linear one?

Yes, because the data get quantized in the
logistic level to 0 or 1 (or 1:K in the multi-logit case),
rather than just giving you the linear predictor.

Logistic regression also introduces separability issues
if there are predictors that are aligned only with one
category.

- Bob

Matt Hoffman

unread,

Oct 13, 2013, 5:46:46 PM10/13/13

to stan-...@googlegroups.com

>> sigma_a ~ cauchy(0, 2.5);
>
>
> The Cauchy is a very fat-tailed prior!

I had the same thought! But looking more closely at the model, I don't
see this prior actually appearing anywhere—it looks like sigma_a has
an improper uniform prior. (Which is about as far from tight as it
gets.)

Best,
Matt

Sergio Polini

unread,

Oct 14, 2013, 2:23:32 AM10/14/13

to stan-...@googlegroups.com

Il 13/10/2013 23:46, Matt Hoffman ha scritto:
>>> sigma_a ~ cauchy(0, 2.5);
>>
>>
>> The Cauchy is a very fat-tailed prior!
>
> I had the same thought! But looking more closely at the model, I don't

> see this prior actually appearing anywhereï¿½it looks like sigma_a has
> an improper uniform prior.

In the model I attached to my first post sigma_a has an improper uniform
prior indeed, but I've tried several models.
In the "best" one (absolutely no problem) mu_a and sigma_a are constants
defined in the transformed data block and are assigned their true
values, but it's just cheating.
I've tried mu_a ~ normal(1.25, 1) and sigma_a ~ cauchy(0, 2.5) too (and
attached the respective traceplot as tp16000_priors.png). I know that
the cauchy is fat-tailed, but I felt guilty because defining mu_a as
normal(1.25, 1) _is_ cheating ;-)

Defining tight priors is easy when using fake data (I know the "true"
priors), but what could I do if they were real data? Where tight priors
could come from?
I've tried complete pooling, no pooling (group as a factor), separate
regressions, but I can't see any clear way to extract tight priors from
their results.

This is why I was going to think that multilevel logistic models require
a fair amount of data, much more than linear ones.

Sergio

Bob Carpenter

unread,

Oct 14, 2013, 3:54:25 AM10/14/13

to stan-...@googlegroups.com

On 10/14/13 2:23 AM, Sergio Polini wrote:
> Il 13/10/2013 23:46, Matt Hoffman ha scritto:
>>>> sigma_a ~ cauchy(0, 2.5);
>>>
>>>
>>> The Cauchy is a very fat-tailed prior!
>>
>> I had the same thought! But looking more closely at the model, I don't
>> see this prior actually appearing anywhereï¿½it looks like sigma_a has
>> an improper uniform prior.
>
> In the model I attached to my first post sigma_a has an improper uniform prior indeed, but I've tried several models.
> In the "best" one (absolutely no problem) mu_a and sigma_a are constants defined in the transformed data block and are
> assigned their true values, but it's just cheating.
> I've tried mu_a ~ normal(1.25, 1) and sigma_a ~ cauchy(0, 2.5) too (and attached the respective traceplot as
> tp16000_priors.png). I know that the cauchy is fat-tailed, but I felt guilty because defining mu_a as normal(1.25, 1)
> _is_ cheating ;-)
>
> Defining tight priors is easy when using fake data (I know the "true" priors), but what could I do if they were real
> data? Where tight priors could come from?

Ideally, knowledge of the problem.

Otherwise, you're going to have to use trial and error, which
is really just a kind of ad-hoc "empirical Bayes", and you typically
don't want answers that are that sensitive to the prior.

> I've tried complete pooling, no pooling (group as a factor), separate regressions, but I can't see any clear way to
> extract tight priors from their results.

Can you fit the multilevel model (i.e., where you
fit the amount of pooling)? Also, what's the difference
between no pooling and separate regressions?

> This is why I was going to think that multilevel logistic models require a fair amount of data, much more than linear ones.

I don't know that it's the multilevel aspect of it
so much as the fact that you lose information by only
observing a 0/1 outcome vs. scalar outcomes.

- Bob

Michael Betancourt

unread,

Oct 14, 2013, 4:22:19 AM10/14/13

to stan-...@googlegroups.com

> In the model I attached to my first post sigma_a has an improper uniform prior indeed, but I've tried several models.
> In the "best" one (absolutely no problem) mu_a and sigma_a are constants defined in the transformed data block and are assigned their true values, but it's just cheating.
> I've tried mu_a ~ normal(1.25, 1) and sigma_a ~ cauchy(0, 2.5) too (and attached the respective traceplot as tp16000_priors.png). I know that the cauchy is fat-tailed, but I felt guilty because defining mu_a as normal(1.25, 1) _is_ cheating ;-)
>
> Defining tight priors is easy when using fake data (I know the "true" priors), but what could I do if they were real data? Where tight priors could come from?
> I've tried complete pooling, no pooling (group as a factor), separate regressions, but I can't see any clear way to extract tight priors from their results.

The assumption to made here is one of sparsity, or "most effects are probably going to be small". For models like these you want to define your model so that values of 0 correspond to no effect (for example in the usual parameterizations of linear and generalized-linear models) with "average" effect sizes around 1. Regarding the latter, the idea is that an effect size of 5 or higher would be considered unlikely, either for physical reasons or because such a large effect would have been seen already. With this kind of model in hand priors like N(0, 1) or Cauchy(0, 2.5) are very natural and about as "tight" as comfortable.

> This is why I was going to think that multilevel logistic models require a fair amount of data, much more than linear ones.

Let's step back a second here. What is the fundamental problem you're seeing? The issue is that in big hierarchical models you're using partial-pooling to learn from limited data, and the trade-off is that the resulting posterior is pretty wacky, with pretty significant variations in curvature within the neighborhood of high posterior probability mass. This kind of curvature variation stresses all MCMC algorithms (it's even worse for something like Metropolis, but Metropolis is so slow you never see it!), and vanilla HMC is no exception. Here the problem manifests as no single step size being sufficient for efficient integration everywhere.

Tighter priors constrain the neighborhood of high posterior probability, reducing the curvature variation and reducing the pathology. It's not about needing more data, it's more that more information leads to a more well-behaved model.

If you can't change the model then the safest thing to do is to lower the step size by hand, or use the step size jitter options to vary the step size with each transition (in fact, you should probably do the former anyways just to check that you're not missing any regions with high curvature only because the step size was set too high). Riemannian HMC will also help here.

Sergio Polini

unread,

Oct 14, 2013, 9:32:38 AM10/14/13

to stan-...@googlegroups.com

I thank you for your kind, and clear, lesson.
I didn't care about step size, because that looked somewhat mysterious
to me.
But I'm a learner, so I'll keep learning. And I'll try to understand
what step size is and to lower and jitter it.
And I'll let you know my results, if you are willing.
Thanks
Sergio

Bob Carpenter

unread,

Oct 14, 2013, 1:08:14 PM10/14/13

to stan-...@googlegroups.com

On 10/14/13 9:32 AM, Sergio Polini wrote:
> I thank you for your kind, and clear, lesson.
> I didn't care about step size, because that looked somewhat mysterious to me.
> But I'm a learner, so I'll keep learning. And I'll try to understand what step size is and to lower and jitter it.

We're all learning how this works together!

> And I'll let you know my results, if you are willing.

Please do share.

- Bob

Sergio Polini

unread,

Oct 15, 2013, 6:16:25 PM10/15/13

to stan-...@googlegroups.com

Il 14/10/2013 10:22, Michael Betancourt ha scritto:

> If you can't change the model then the safest thing to do is to lower
> the step size by hand, or use the step size jitter options to vary the
> step size with each transition
>

I've tried delta = 0.9 to lower the step size and epsilon_pm = 1 to
jitter it. Together.
The traceplot looks better indeed (attached as tp16000_stepsize.png).
There are many long but narrow spikes, but I'd think that there is not
enough "information" to get a better profile.

Thanks again for your advice.

Sergio

tp16000_stepsize.png

Michael Betancourt

unread,

Oct 15, 2013, 6:32:14 PM10/15/13

to stan-...@googlegroups.com

> I've tried delta = 0.9 to lower the step size and epsilon_pm = 1 to jitter it. Together.
> The traceplot looks better indeed (attached as tp16000_stepsize.png).
> There are many long but narrow spikes, but I'd think that there is not enough "information" to get a better profile.

The narrow spikes are actually good! Hierarchical models tend to have fat tails, and that spiking behavior is exactly
what you want to see when sampling fat tails efficiently.

You can probably stick to a higher delta and leave epsilon_pm at zero. In fact, if you really want to find the optimal
performance vary delta between 0.6 and 0.99 and record the inferences (mean+/- std-dev or, even better, the percentiles)
of sigma. You should see the inferences stabilize at some value of delta (i.e. they're constant for all greater values of delta) --
that will be the optimal value to use.

Also keep an eye out on the tree_depth. If it starts to push up against the default max you'll have to increase the max_depth
yourself to maintain efficient sampling.

Sergio Polini

unread,

Oct 16, 2013, 7:40:23 AM10/16/13

to stan-...@googlegroups.com

Il 16/10/2013 00:32, Michael Betancourt ha scritto:
> The narrow spikes are actually good! Hierarchical models tend to have fat tails, and that spiking behavior is exactly
> what you want to see when sampling fat tails efficiently.

Good. But what about the pairs() output?
The off-diagonal plots (file pairs.png attached) look, how could I say,
somewhat strange-shaped to me.

> You can probably stick to a higher delta and leave epsilon_pm at zero.

I've tried. You are right. Of course ;-)
However, if I set delta = 0.99 and leave epsilon_pm at zero (and probs =
c(0.025, 0.5, 0.975)) I get:

mu_a 1.14 0.02 1.05 -0.81 1.11 3.28 4062 1
sigma_a 1.98 0.03 1.37 0.80 1.62 5.48 2517 1

i.e. mu_a in (-0.81, 3.28), sigma_a in (0.80, 5.48).

Setting epsilon_pm = 1 and delta = 0.9:

mu_a 1.12 0.02 0.94 -0.81 1.12 3.09 2597 1
sigma_a 1.91 0.03 1.13 0.79 1.60 4.95 1752 1

that looks a bit better. Just by chance?

> In fact, if you really want to find the optimal
> performance vary delta between 0.6 and 0.99 and record the inferences (mean+/- std-dev or, even better, the percentiles)
> of sigma. You should see the inferences stabilize at some value of delta (i.e. they're constant for all greater values of delta) --
> that will be the optimal value to use.

I'm playing with a lot of toy models because I've to work on a tough
real model: a multilevel logistic non-nested model, two 10x10 covariance
matrices, just 25000 observations. A nightmare.
Looking for the optimal value of delta could take a long time...
What if I just try 0.99? Any problems?

> Also keep an eye out on the tree_depth. If it starts to push up against the default max you'll have to increase the max_depth
> yourself to maintain efficient sampling.

May I say that get_sampler_params() isn't that convenient? ;-)
I've written a plot_sampler_params() function (code and plots attached).
Yes, looking at the plots it looks as if I really should increase
max_treedepth.
I'll try.

Thanks again!

Sergio

pairs.png

plot_sampler_params.R

samplerparams_incwarmup.png

samplerparams_nowarmup.png

Michael Betancourt

unread,

Oct 16, 2013, 8:32:56 AM10/16/13

to stan-...@googlegroups.com

Good. But what about the pairs() output?

The off-diagonal plots (file pairs.png attached) look, how could I say, somewhat strange-shaped to me.

I would say they look as expected -- the mu/sigma correlations exhibit the canonical "funnel" shape, and the "banana"

like shape of the sigma/lp correlation is pretty common for multiplicative dependencies.

You can probably stick to a higher delta and leave epsilon_pm at zero.

I've tried. You are right. Of course ;-)
However, if I set delta = 0.99 and leave epsilon_pm at zero (and probs = c(0.025, 0.5, 0.975)) I get:

mu_a 1.14 0.02 1.05 -0.81 1.11 3.28 4062 1
sigma_a 1.98 0.03 1.37 0.80 1.62 5.48 2517 1

i.e. mu_a in (-0.81, 3.28), sigma_a in (0.80, 5.48).

Setting epsilon_pm = 1 and delta = 0.9:

mu_a 1.12 0.02 0.94 -0.81 1.12 3.09 2597 1
sigma_a 1.91 0.03 1.13 0.79 1.60 4.95 1752 1

that looks a bit better. Just by chance?

The improvement comes from the lower step sizes you get with the jitter, but you also get half of your samples performing worse

because the step size is increased. I would just keep increasing delta to push the average step size lower.

I'm playing with a lot of toy models because I've to work on a tough real model: a multilevel logistic non-nested model, two 10x10 covariance matrices, just 25000 observations. A nightmare.
Looking for the optimal value of delta could take a long time...
What if I just try 0.99? Any problems?

Oh, sure. A smaller step size is always going to be safer than a larger step size. But a larger step size is also more efficient as less iterations are needed for each trajectory.

If factors of O(1) aren't super important then just run with 0.99, but be sure to include a run again with an even larger delta (say, when you're writing everything up) to ensure you get consistent results.

May I say that get_sampler_params() isn't that convenient? ;-)

I'm a command line guy, so I'll leave others to comment...

I've written a plot_sampler_params() function (code and plots attached). Yes, looking at the plots it looks as if I really should increase max_treedepth.
I'll try.

Yeah, if you ever hit up against the max_treedepth boundary it can skew the results. Increase and see if the test look the same.

Bob Carpenter

unread,

Oct 16, 2013, 1:14:35 PM10/16/13

to stan-...@googlegroups.com

...snip...

> May I say that get_sampler_params() isn't that convenient? ;-)

How would you like to see it behave? Or see a different
function behave?

- Bob

Sergio Polini

unread,

Oct 16, 2013, 1:38:28 PM10/16/13

to stan-...@googlegroups.com

get_sampler_params() is useful, but very often a plot is better than a
table.
I've already written a plot_sampler_params() function, but I'm improving
it (thin != 1, max_treedepth == -1, ...).
I'll send it in a few hours... or days ;-)

BTW: perhaps get_sampler_params() shoud show what print() doesn't show
(epsilon, epsilon_pm, delta, gamma, max_treedepth) and then a summary of
treedepth and stepsize.
I'll submit a concrete example -- words are just words.

Sergio

Bob Carpenter

unread,

Oct 16, 2013, 1:43:55 PM10/16/13

to stan-...@googlegroups.com

There is a plot(fit), but the issue with something like this is
the number of parameters can get out of hand in latent variable
models. Same with print(fit), though at least that will all dump
out and let you scroll.

Words are what doc and specifications (there's a thin line between
the two in my world) are made of!

- Bob

Sergio Polini

unread,

Oct 16, 2013, 4:07:02 PM10/16/13

to stan-...@googlegroups.com

I attach my plot_sampler_params() function and an example.
Please, look at the example. If you see that the color changes in the
left column plots, you know that tree depth is hitting max_treedepth so
you _must_ increase max_treedepth (or set it to -1).

"My" get_sampler_params() is coming.

Sergio

plot_sampler_params.R

plot_sampler_params.png

sp_r...@yahoo.it

unread,

Oct 16, 2013, 4:49:22 PM10/16/13

to stan-...@googlegroups.com

I attach "my" get_sampler_params().
The output:

> sample_params <- get_sampler_params_2(mlogis_i.sf)
Chain 1
epsilon = -1, epsilon_pm = 1, delta = 0.9, gamma = 0.05, max_treedepth = 10
tree depth: 1 0 0 3 3 4 3 6 7 3 5 6 5 2 4 4 6 5 6 6 ... [TRUNCATED]
step size : 0.0625 0.6844809 0.2434576 0.06489445 ... [TRUNCATED]

Chain 2
epsilon = -1, epsilon_pm = 1, delta = 0.9, gamma = 0.05, max_treedepth = 10
tree depth: 2 0 0 2 2 5 2 4 4 4 3 4 4 3 5 4 3 5 3 3 ... [TRUNCATED]
step size : 0.0625 0.6844809 0.2434576 0.06489445 ... [TRUNCATED]

Chain 3
epsilon = -1, epsilon_pm = 1, delta = 0.9, gamma = 0.05, max_treedepth = 10
tree depth: 0 0 0 2 2 4 3 2 3 3 3 5 3 2 5 4 6 3 2 3 ... [TRUNCATED]
step size : 0.125 0.5515415 0.1498416 0.03424575 ... [TRUNCATED]

Chain 4
epsilon = -1, epsilon_pm = 1, delta = 0.9, gamma = 0.05, max_treedepth = 10
tree depth: 1 0 0 2 4 3 4 6 4 4 4 5 6 4 5 5 5 6 3 6 ... [TRUNCATED]
step size : 0.0625 0.6844809 0.2434576 0.06489445 ... [TRUNCATED]