Stan in parallel: chains versus iterations

John Palmer

unread,

Feb 16, 2015, 6:10:14 AM2/16/15

to stan-...@googlegroups.com

I've been trying to figure out the best way to run Stan in a parallel environment, and I would be very curious to hear people's thoughts on the best ways to choose number of chains and number of iterations. After some experiments, I've come to the conclusion that, if one has access to multiple cores, then it makes sense to set iterations to a value just above the number necessary for convergence, and then run as many chains as are needed for the desired sample size. So for instance, if my model seems to be converging after 500 iterations, I could run 10 chains (each on its own core) with 600 iterations each, set the warmup to 500, and end up with a total sample size of 1000 after combining the chains.

Does this make sense? Are there better ways to approach this? My goal is to parallelize as much as possible in order to optimize speed (and I am dealing with models that otherwise take a long time because of the size of the datasets).

Thanks,
John

Tamas Papp

unread,

Feb 16, 2015, 6:36:38 AM2/16/15

to stan-...@googlegroups.com

I am curious how you can determine the number of iterations "just above
the number necessary for convergence". In my experience, diagnosing
(non)convergence is tricky.

My approach is somewhat unscientific, but this is what I do:

1. Build up models gradually, making sure that the results make sense
with the addition of each new component. If something causes a drastic
change, it is suspect.

2. Try to get effective samples, using transformations, the "Matt
trick", etc. Also I try to optimize the Stan program for speed
(vectorization, etc) whenever possible. For the latter, I do test runs
on a single chain, for about 100 iterations, and time them. I have found
profiling with operf & co cumbersome and not really informative, maybe I
am not doing it right, but for my purposes simply benchmarking and
tweaking works fine.

3. If I have achieved reasonably effective sampling, 1000 samples per
chain are often enough for my purposes.

4. Then I run 3 to 7 chains (depending on whether I use my 4-core laptop
or a more powerful multicore server), with overdispersed starting
points, and check convergence diagnostics.

IMO step 3 above is the most work but also provides the largest
benefit. YMMV.

HTH,

Tamas

Bob Carpenter

unread,

Feb 16, 2015, 6:57:56 AM2/16/15

to stan-...@googlegroups.com

The main problem is that the number of iterations required
for convergence is a function of initialization, and pretty
much impossible to diagnose until you've gone way past what
you need. Unitl you've made 50 or so draws *after warmup*, it's hard
to diagnose non-convergence.

This is why we haven't automated any of this or
worried too much about restarting.

There's some interesting open research questions around all of
this, such as sharing adapation across chains and minimizing number
of post-warmup chains required to diagnose convergence. The whole
convergence thing is a can of worms, as you may have gathered if
you're following Ben's and Andrew's and Michael's discussion on
the stan-dev mailing list.

To address Tamas's specific procedure:

(1) by all means, this is the way to go.

(2) ditto.

(4) ditto.

(3) usually, we run until the MCMC std error is an
order of magnitude or so below the posterior std deviation.
And that's usually well below 1000 iterations total unless
we need very precise answers or have a model with difficult geometry
where the effective sample size rate per iteration is low.

We usually get there by doubling. Start with 100 iterations
total, half warmup, half sampling, and if that's not enough,
start again and double it. At most you waste about 50% of
your time compared to the absolute ideal (which as Tamas points
out is very hard to diagnose).

- Bob

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Tamas Papp

unread,

Feb 16, 2015, 7:23:10 AM2/16/15

to stan-...@googlegroups.com

Regarding 4: is there a way to make it easier (semi-automated) to run
chains with overdispersed starting points?

I was thinking of something like this: given a posterior distribution of
parameters in the R^n space that Stan uses internally, estimate a
multivariate normal or T, scale up the variance, and draw some initial
values from that.

This would be obvious to implement if it wasn't for the transformation
between R^n and the constrained parameter space: given the posterior I
would need to undo this, and then in rstan I could not find a way to
init with internal R^n values. So mostly now my overdispersed points are
ad hoc.

The recent discussion about convergence was interesting, but at some
point started going over my head. Having a nice canned convergence
metric would be useful I guess, but for me the most reliable indicator
of convergence is stability of results when building up the model
gradually. Writing multiple models used to be very costly for me, Stan
helped a lot here so it is not a big deal any more (surprisingly, now
analysing the results takes more time than programming/running the
model).

Best,

Tamas

Bob Carpenter

unread,

Feb 16, 2015, 7:34:55 AM2/16/15

to stan-...@googlegroups.com

> On Feb 16, 2015, at 11:23 PM, Tamas Papp <tkp...@gmail.com> wrote:
>
> Regarding 4: is there a way to make it easier (semi-automated) to run
> chains with overdispersed starting points?

That's the default, taking random inits in (-2,2) on the unconstrained
scale.

> I was thinking of something like this: given a posterior distribution of
> parameters in the R^n space that Stan uses internally, estimate a
> multivariate normal or T, scale up the variance, and draw some initial
> values from that.

Isn't that circular? How do you know what the posterior is?

> This would be obvious to implement if it wasn't for the transformation
> between R^n and the constrained parameter space: given the posterior I
> would need to undo this, and then in rstan I could not find a way to
> init with internal R^n values. So mostly now my overdispersed points are
> ad hoc.

It's possible --- the transforms are exposed. Ben or Jiqiang can
probably give you a one-liner (I'm not an R expert and pretty much
always just use our default init strategy).

> The recent discussion about convergence was interesting, but at some
> point started going over my head.

Me, too.

> Having a nice canned convergence
> metric would be useful I guess, but for me the most reliable indicator
> of convergence is stability of results when building up the model
> gradually. Writing multiple models used to be very costly for me, Stan
> helped a lot here so it is not a big deal any more (surprisingly, now
> analysing the results takes more time than programming/running the
> model).

Glad to hear it. I keep trying to convince our users it's
faster to start simple and build to complex than to start
complex and try to debug. This is related to the computer science
dictum that it's easier to optimize a working program than debug
an optimized program.

Would you mind sharing what were you using before Stan and what the
difficulty was? I used the same strategy with BUGS/JAGS and the
main problem I had was chains getting stuck without reasonable inits.

- Bob

Tamas Papp

unread,

Feb 16, 2015, 7:44:37 AM2/16/15

to stan-...@googlegroups.com

On Mon, Feb 16 2015, Bob Carpenter wrote:

>> On Feb 16, 2015, at 11:23 PM, Tamas Papp <tkp...@gmail.com> wrote:
>>
>> I was thinking of something like this: given a posterior distribution of
>> parameters in the R^n space that Stan uses internally, estimate a
>> multivariate normal or T, scale up the variance, and draw some initial
>> values from that.
>
> Isn't that circular? How do you know what the posterior is?

Sorry, I did not mention that I would estimate from a previous run. So I
would run the chain twice, once with the default inits, then, based on
the results, with the overdispersed inits.

This is probably horribly unscientific though and may be a habit that is
now not necessary.

> Would you mind sharing what were you using before Stan and what the
> difficulty was? I used the same strategy with BUGS/JAGS and the
> main problem I had was chains getting stuck without reasonable inits.

Custom code with a mix of Gibbs and Metropolis jumps, which was a
debugging nightmare but I could make it fast. Now completely superseded
by Stan for my purposes.

Best,

Tmas

John Palmer

unread,

Feb 16, 2015, 7:49:15 AM2/16/15

to stan-...@googlegroups.com

Thanks very much to both of you - this is all very helpful.

I realize convergence is a big can of worms and I'm probably overly confident when I look at traceplots and rhat values and tell myself that my models have converged. (And I know I can only tell myself that there is no evidence that they have not converged, not that there is evidence they have converged.) I hadn't been following the stan-dev discussion but I will do so now.

I also agree with Tamas about how nice it is to work with Stan in terms of building up models. I had been using BUGS and writing different BUGS code for each model. The key for me has been seeing Germán Rodríguez's suggestion of using a row vector for predictors in Stan and thus making the Stan code much more generalizable: http://data.princeton.edu/pop510/hospStan.html. Perhaps this is also possible with BUGS, and I just hadn't realized it. My approach now is to write one or two different skeletons in Stan code (e.g. a one-level version and a two-level version), and then write a function in R that lets me fit each of these skeletons with different sets of variables. Building up the models then becomes I lot like the way one would approach it with a set of single-level GLMs (apart from the fact that fitting time is usually longer.)

I'm still left with a question about how best to make use of available cores. Say I take all of Tamas's suggestions for optimization, I use Bob's doubling approach to come up with some target number of iterations, I have reason to think that this number will hold for a set of similar models that I am fitting, and someone now gives me access to lots of cores -- like 16 or even 40. How can I take most advantage of those cores? Since increasing the number of chains will speed things up only if I also cut down on iterations, the question is really how to shift work from iterations to chains. From Bob's post, I take it that I should make sure to have at least 50 iterations beyond warmup, otherwise diagnosing non-convergence will be harder. But within that constraint, is it reasonable to shift post-warmup iterations onto extra chains in order to take advantage of the cores?

Thanks again,

John

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/5WG51xKNSbA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,

Feb 16, 2015, 5:41:56 PM2/16/15

to stan-...@googlegroups.com

I'm afraid we don't yet have any way to take advantage of multiple
cores beyond one per chain. That's going to be on us to
replace things like matrix operations with parallelized
version. Or more ambitiously, to figure out how to
mulithread the gradient calculations.

Given that warmup takes longer than sampling in most cases,
it's hard to get much speedup that way --- the upper bound is
a factor of a bit less than 2.

- Bob

> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Michael Betancourt

unread,

Feb 16, 2015, 6:28:55 PM2/16/15

to stan-...@googlegroups.com

A few comments:

Assuming well-behaved chains (geometric ergodicity and the like)
then running chains in parallel should yield the same inference as
running a full chain, so running in parallel would give you a pretty
solid 1 / cores speed up.

In practice parallel chains are most useful for identifying when
the chains can’t converge or mix efficiently. I think we’ve focused
(and rightly so) more on robustness than raw performance when
thinking about parallelization.

Most of the time we spend in warmup for Stan is actually focused
on learning the mass matrix and step size — convergence for most
models is _really_ fast. Eventually we’ll admit user-defiend mass
matrices where you can set a good estimate of the mass matrix
and over all of this tuning overhead when running in parallel.

As Bob notes, the really scalable speedups will come from
internal parallelization of matrix calcs and probability distributions calcs.
But this is going to be haaaaaaard.

Dan Stowell

unread,

Feb 17, 2015, 3:41:31 AM2/17/15

to stan-...@googlegroups.com

Michael,

So this suggests that if there was a way to run one chain and then
export its decisions about mass matrix and step size, then we could
import those settings into a set of parallel chains with much shorter
warmup, and that would then be an efficient way to run on a compute
cluster. Should be much easier to implement than the internal
parallelisation. Without it, this discussion suggests that the need
for each parallel chain to run its own burn-in seems to make it rarely
worth parallelising?

Best
Dan

http://www.mcld.co.uk

Michael Betancourt

unread,

Feb 17, 2015, 4:25:57 AM2/17/15

to stan-...@googlegroups.com

The only way you can guarantee correct sampling is when a single
chain visits every state of high probability — in a multimodal posterior
this means visiting each mode. Running multiple chains that each
fall into different modes is not helpful as there’s no information on
how to weight the modes relative to each other.

Consequently, _if we assume that our sampler can explore the
entire state space efficiently__ then running multiple chains should
just give a linear speed up. If we do not have the efficient exploration
then running multiple chains doesn’t offer much help because naively
concatenating their samples does not yield correct inferences.

In practice we get sublinear speed up because of the adaptation
during warmup. Improving adapting by starting with good initial
values or sharing information between chains should reduce this
overhead and bring you closer to the ideal linear speed up.

But this all assumes that your sampler is ideal. The most powerful
use of multiple chains is to check for poor convergence or mixing,
and so it’s more about adding robustness than improving speed.

Reply all

Reply to author

Forward