Continuing sampling

880 views
Skip to first unread message

Jonathon Love

unread,
Apr 10, 2015, 1:34:05 PM4/10/15
to stan-...@googlegroups.com
hi,

after performing some sampling, with say the stan() function (in say, Rstan), is it possible to come back and continue sampling?

i.e. in the situation that the chains haven't converged yet, can you choose to run another 1000 samples? or is it necessary to start over (including the warm-up period)?

with thanks

Luc Coffeng

unread,
Apr 10, 2015, 3:00:36 PM4/10/15
to stan-...@googlegroups.com
You have to start over as the full state of the Hamiltonian is not saved in the Stan object. What I usually do is start with a small number of iterations, and when I don't reach convergence, double that number. Rinse and repeat.

Andrew Gelman

unread,
Apr 10, 2015, 4:22:48 PM4/10/15
to stan-...@googlegroups.com
Soon I think we’ll have a version that allows the mass matrix to be specified in the Stan call!

On Apr 10, 2015, at 3:00 PM, Luc Coffeng <lucco...@gmail.com> wrote:

You have to start over as the full state of the Hamiltonian is not saved in the Stan object. What I usually do is start with a small number of iterations, and when I don't reach convergence, double that number. Rinse and repeat.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,
Apr 10, 2015, 11:11:57 PM4/10/15
to stan-...@googlegroups.com
I need to add some description of this to the manual, because
this keeps coming up.

You absolutely cannot just keep going. Nor do you want to.
The problem is that if you haven't reached convergence, then
you want to do more warmup, and then more sampling. So you'd
have to toss out the sampling part of the run anyway.

Andrew's the one who urged us to do it this way in the first
place based on the following argument.

If you run 1000 warmup iterations and 1000 sampling iterations
and haven't converged, then you want to go back and run
2000 warmup iterations and 2000 sampling iterations. If you
were to restart, after the original warmup iterations, you'd save
around 15% of your work. Here's an example with the default
number of iterations (which is probably way too high):

rerun from the beginning

1000 + 1000
2000 + 2000
-----------
3000 + 3000

rewarmup:

1000 + 1000
0 + 1000 + 2000
-----------
2000 + 3000

The exception to this is when you have converged during warmup, but
just haven't drawn enough iterations for the n_eff you need. In that
case, you'd get this:

resample:

1000 + 1000
0 + 1000
-----------
3000

restart:

1000 + 1000
1000 + 2000
-----------
5000

So you'd save about 40% of your total effort if you could just run more
samples.

- Bob

P.S. I don't know about "soon" --- there's a huge backlog on our
to-do list.

Jonathon Love

unread,
Apr 11, 2015, 3:51:21 AM4/11/15
to stan-...@googlegroups.com
thanks for the explanation bob.

this needing to rerun the warmup stage is a slightly different way of thinking about sampling for me (and one of the unique things about stan from what i can gather)

cheers




Bob Carpenter

unread,
Apr 11, 2015, 5:07:57 AM4/11/15
to stan-...@googlegroups.com
The issue of needing to throw away warmup iterations before
convergence is not at all unique to Stan.

Many MCMC algorithms perform a number of warmup (burnin, adaption,
or whatever you call it) iterations to tune parameters of
the algorithm, then fix the algorithm parameters before converting
to a properly Markovian regime to do sampling. The reason you
can't keep the warmup iterations is twofold:

1. they often don't form a Markov chain. This is true for Stan
and it would also be true of a Metropolis algorithm where you
are estimating the covariance matrix with which to do jumping
proposals and estimating a step size to tune rejection rate.

2. they typically aren't reasonable draws from the posterior if
you start with random inits far out in the tails. If you keep
the warmup iterations then you'll bias the final estimates.

I'm not sure what BUGS does for it's adaptive rejection
sampling during warmup --- that is, I don't know if you can
properly use all the draws or if you need to throw away the
warmup draws. Usually you want to throw away warmup draws
before you've converged to the high mass volume of the
posterior anyway because even though they will wash out
asymptotically, they will bias a small-ish finite sample.

I'm also not sure what JAGS does with its slice sampling, which
can also be adapted.

Some conjugate models don't need to be adapted at all, but
those still suffer from issue (2).

A way to get around issue (1) is to gradually decline adaption
according to something like a Robbins-Monro strategy --- that
can make the result acceptable, but it'll still suffer from
issue (2) if you don't start from a reasonable draw from the
posterior.

- Bob

Jonathon Love

unread,
Apr 11, 2015, 5:31:14 AM4/11/15
to stan-...@googlegroups.com
The issue of needing to throw away warmup iterations before
convergence is not at all unique to Stan.


ah yes, sorry, i was meaning a distinction between 'warm-up' and 'burn-in'.

in my experience, burn-in is just discarding samples from before convergence is achieved. i.e. the following pseudocode would work:

while convergence not achieved
    draw more samples

discard samples  # this is discarding your burn-in

while not enough samples for posterior estimates
    draw more samples
   
end pseudo code

in this scenario, it's easy to continue sampling if convergence isn't achieved, however the 'warm-up' is a bit more complicated; requiring the following:

do
    draw N warm-up samples
    draw M normal samples
   
    if converged
        break
    else
        double N, and M

while true

end pseudo code

so that's what i mean about warm-up being unique to stan (although it is probably done in other places as well, i'm quite new to this).

please let me know if my understanding (and my two blocks of pseudocode) are wong!

many thanks
 

Bob Carpenter

unread,
Apr 11, 2015, 11:44:57 AM4/11/15
to stan-...@googlegroups.com

> On Apr 11, 2015, at 7:31 PM, Jonathon Love <jonath...@gmail.com> wrote:
>
> ah yes, sorry, i was meaning a distinction between 'warm-up' and 'burn-in'.

That's just Andrew wanting to change the names of everything.
He'd rename what BUGS and JAGS called it if they'd let him.

> in my experience,

What MCMC is that?

> burn-in is just discarding samples from before convergence is achieved. i.e. the following pseudocode would work:
>
> while convergence not achieved
> draw more samples

How do you test if you've achieved convergence?

> discard samples # this is discarding your burn-in
>
> while not enough samples for posterior estimates
> draw more samples
>
> end pseudo code

You can't do the same thing in Stan or any other MCMC that
uses adaptation during burnin/warmup/adaptation, which includes
adaptive Metropolis, for example.

> in this scenario, it's easy to continue sampling if convergence isn't achieved, however the 'warm-up' is a bit more complicated; requiring the following:
>
> do
> draw N warm-up samples
> draw M normal samples
>
> if converged
> break
> else
> double N, and M
>
> while true
>
> end pseudo code

That's what you want to do, though the doubling of N and
M isn't necessary --- you can use any size increase you want.

The convergence test is R-hat, but that won't quite technically
be an "if converged", but rather "if can't reject non-convergence",
which is subtly different --- (split) R-hat being near 1 doesn't guarantee
you've converged. And it can be especially misleading if you only
run a single chain.

We don't have a way to keep adding more sampling iterations, but will
add that at some point in the future. But as I said in the last
mail, the savings won't be great.

> so that's what i mean about warm-up being unique to stan (although it is probably done in other places as well, i'm quite new to this).

Does anybody know what BUGS or JAGS do for their "burnin" period?
Is there any kind of adaptation that renders the result not from
an MCMC chain?

- Bob

Michael Betancourt

unread,
Apr 12, 2015, 6:21:19 AM4/12/15
to stan-...@googlegroups.com
Markov Chain Monte Carlo 101:

In general we contract a Markov chain that generates
a series of states, x_{n}, such that as the number of
states, N, goes to infinity,

lim_{N \rightarrow \infty} 1/N \sum_{n = 0}^{N} f(x_{n})
= \int dx p(x) f(x),

where p(x) is our posterior distribution and f(x) is any
sufficiently nice function.

In practice we’re not interested in the N \rightarrow \infty
limit because we’ll never get there — we want to know
what happens for finite N. Assuming that the Markov
chain and the posterior distribution play nicely with
each other (a property called geometry ergodicity)
then we have for some large enough but still finite N,

1/N \sum_{n = 0}^{N} f(x_{n}) ~ normal( \int dx p(x) f(x), Var[f] / N).

In words, the Markov chain Monte Carlo estimator,
1/N \sum_{n = 0}^{N} f(x_{n}), is distributed around the
true expectation, \int dx p(x) f(x), with a variance that
shrinks with the number of samples generated. We
say that the Markov chain Monte Carlo estimator
_converges in distribution_ towards the true value.

But we can actually do better — although we have the
above property for a Markov chain started at _any_ initial
point, those chains started at points far off in the tails
away from the posterior mass will take a while to reach
the posterior mass. Those initial samples bias Markov
chain Monte Carlo estimators away from their true values
and we need a pretty large N in order for the rest of
the samples to wash out that initial bias and recover
the convergence above. But if we simply remove those
initial samples than we can dramatically improve the
precision of our Markov chain Monte Carlo estimator.

Warmup is just the discarding of these initial samples
to yield better estimators. It’s not necessary, but it
means that we can get better estimators in fewer
samples. Why call it warmup and not burn in? Well
burn in has a certain connotation to it — when you
burn a product in you stress test it, throwing away
items that fail and keep the ones that work. Running
a bunch of Markov chains and then throwing the
ones that fail (get stuck, behave weird, etc) is very,
very bad (with Markov chains, if one chain fails then
they’ve all failed). Warmup is more evocative of
what’s really going on, which is an equilibration process.

Finally, what’s all this adaptation business about?
The above results all hold for a Markov chain that is
constant — if you start modifying the Markov chain
_while_ generating samples then you lose all of the
precious convergence results. There are ways of
very carefully modifying Markov chains while maintaining
some results, but they are fragile and hard to implement
and you only get N \rightarrow \infty results, not finite
N results. Because we discard warmup iterations,
however, we can screw with the Markov chain in that
stage without compromising the validity of the samples
that we keep.


Andrew Gelman

unread,
Apr 12, 2015, 4:54:02 PM4/12/15
to stan-...@googlegroups.com
What Betancourt said.

Nathaniel

unread,
Nov 16, 2016, 8:23:16 PM11/16/16
to Stan users mailing list, gel...@stat.columbia.edu
^ Lol

I am a bit late to the party here... but this was an excellent discussion. 
Reply all
Reply to author
Forward
0 new messages