Improving warmup adaptation

251 views
Skip to first unread message

Charles Driver

unread,
Jun 17, 2016, 9:35:00 AM6/17/16
to stan-...@googlegroups.com
I've been testing a hierarchical state space model with 500 iterations, adapt_delta = .8 and max_treedepth=12. I noticed that using default init buffer and window settings, the stepsize increases substantially, and treedepth correspondingly drops, after iteration 100. I'm just wondering what the logic behind the warmup window adaptation parameter values is - are there pitfalls to setting the initial buffer and initial window to much lower values, like 5? So far, setting this seems to cut fitting time by about 5x, and if anything, improves rhat and neff .

Bob Carpenter

unread,
Jun 17, 2016, 3:23:01 PM6/17/16
to stan-...@googlegroups.com
The bonus and pitfall are the same: increased
adaptation speed.

We haven't really done systematic tuning of those
window parameters across models, so would actually be
interested in what you're observing.

In particular, you want to check that you get to the
same adaptation parameters (step size and mass matrix).
Adaptation converges when it's generated a good enough
set of draws to estimate (co)variance of the typical
set (where most of the posterior mass is and where MCMC
visits).

- Bob

> On Jun 17, 2016, at 9:35 AM, Charles Driver <cdriv...@gmail.com> wrote:
>
> I've been testing a hierarchical state space model with 500 iterations, adapt_delta = .8 and max_treedepth=12. I noticed that the stepsize increases substantially, and treedepth correspondingly drops, after iteration 100. I'm just wondering what the logic behind the warmup window adaptation parameter values is - are there pitfalls to setting the initial buffer and initial window to much lower values, like 5? So far, this seems to cut fitting time by about 5x, and if anything, improves rhat and neff .
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Charles Driver

unread,
Jun 18, 2016, 8:56:30 AM6/18/16
to Stan users mailing list
ok. I doubt the mass matrix and step size are 'the same' for the 500 iteration case, but why is the slower adaptation version necessarily the better one? It seems so far the opposite is the case.
Testing on a more complex model this took fitting time from days or weeks to hours, with good neff in output and no divergences... is there some concern that even with apparently good output results according to all checks, they may be in some way biased by the faster adaptation approach?
The manual speaks of 'expanding windows' during slow adaptation - does this mean that if I specify a window of 5, that this is the first window and later windows are larger? how much larger?

Bob Carpenter

unread,
Jun 18, 2016, 9:16:58 AM6/18/16
to stan-...@googlegroups.com



> On Jun 18, 2016, at 8:56 AM, Charles Driver <cdriv...@gmail.com> wrote:
>
> ok. I doubt the mass matrix and step size are 'the same' for the 500 iteration case, but why is the slower adaptation version necessarily the better one?

If everything's working, the adaptation in all chains should
hit roughly the same step size and mass matrix. Getting to the
right answer faster is better.

> It seems so far the opposite is the case.
> Testing on a more complex model this took fitting time from days or weeks to hours, with good neff in output and no divergences... is there some concern that even with apparently good output results according to all checks, they may be in some way biased by the faster adaptation approach?

It is possible to adapt too quickly and get stuck outside
of the typical set you want to visit for sampling. So we tend to
be conservative in our approach in the hopes that more models adapt
properly.

But as I said last time, we've tried to be conservative with
adaptation and haven't done a lot of testing on different window sizes
or alternative adaptation strategies.

> The manual speaks of 'expanding windows' during slow adaptation - does this mean that if I specify a window of 5, that this is the first window and later windows are larger? how much larger?

Yes. They double each time up until the final window.

- Bob

Charles Driver

unread,
Jun 18, 2016, 9:54:41 AM6/18/16
to Stan users mailing list

If everything's working, the adaptation in all chains should
hit roughly the same step size and mass matrix.  Getting to the
right answer faster is better.


Right yes. I guess this is a good indicator that even though the chains looked ok to my eye, the adaptation parameters on the slow version were not - there were quite some differences in step size.
 
It is possible to adapt too quickly and get stuck outside
of the typical set you want to visit for sampling.

But do such cases generate otherwise healthy looking output? I would find that surprising and a little worrying!
 
So we tend to
be conservative in our approach in the hopes that more models adapt
properly.

I agree that increasing robustness at possible cost of speed is a reasonable approach for defaults, within limits of course :) Though given the results here, I do wonder how many other 'slow hierarchical model' questions are in part a result of this. I was dropping parameters from the model to deal with it before stumbling on this issue...
 
> The manual speaks of 'expanding windows' during slow adaptation - does this mean that if I specify a window of 5, that this is the first window and later windows are larger? how much larger?

Yes.  They double each time up until the final window.

Great, then it seems so long as sampling reaches the typical set, there should be no concerns about having started with a low value... 

Krzysztof Sakrejda

unread,
Jun 18, 2016, 4:06:08 PM6/18/16
to Stan users mailing list
Charles, can you post an example where this occurs? I've only ever run into examples where the fast version pathological. Krzysztof

Bob Carpenter

unread,
Jun 18, 2016, 5:03:20 PM6/18/16
to stan-...@googlegroups.com

> On Jun 18, 2016, at 9:54 AM, Charles Driver <cdriv...@gmail.com> wrote:
>
>
> If everything's working, the adaptation in all chains should
> hit roughly the same step size and mass matrix. Getting to the
> right answer faster is better.
>
>
> Right yes. I guess this is a good indicator that even though the chains looked ok to my eye, the adaptation parameters on the slow version were not - there were quite some differences in step size.

Were you getting divergences? That can indicate that
the step size is too large. Otherwise, it should be OK.

>
> It is possible to adapt too quickly and get stuck outside
> of the typical set you want to visit for sampling.
>
> But do such cases generate otherwise healthy looking output? I would find that surprising and a little worrying!

They could, but typically won't for HMC. What will happen
is that there will be divergences or bad mixing when
starting from diffuse starting points.

> So we tend to
> be conservative in our approach in the hopes that more models adapt
> properly.
>
> I agree that increasing robustness at possible cost of speed is a reasonable approach for defaults, within limits of course :) Though given the results here, I do wonder how many other 'slow hierarchical model' questions are in part a result of this. I was dropping parameters from the model to deal with it before stumbling on this issue...

Certainly something we could look into. As Krzysztof said,
it'd help if you can share the example.

> > The manual speaks of 'expanding windows' during slow adaptation - does this mean that if I specify a window of 5, that this is the first window and later windows are larger? how much larger?
>
> Yes. They double each time up until the final window.
>
> Great, then it seems so long as sampling reaches the typical set, there should be no concerns about having started with a low value...

No, it'll wind up with a winow of roughly half the warmup iterations.
The main problem that arises is if one chain doesn't hit the proper
mean and variance, and R-hat will usually pick that up if you have diffuse
inits.

- Bob

Charles Driver

unread,
Jun 19, 2016, 6:14:51 AM6/19/16
to Stan users mailing list
I can't post the real data problem where it has been most evident on, but I'm running a few simulations and will see if I recreate it...
Reply all
Reply to author
Forward
0 new messages