estimating run time upper limit?

Andria Dawson

unread,

Nov 27, 2013, 3:40:37 PM11/27/13

to stan-...@googlegroups.com

Hi (again) Stan team!

Thanks for all the hard work that went into Stan 2.0, works great!

I do have a question (not specific to 2.0) about timing. The models we are running have large data sets, and take a (very) long time to run. I've been trying to figure out how to determine what sort of runs are realistic in terms of timing, so have am playing with estimating upper bounds for run times. I can time how long it take to evaluate the model block code in the cpp file, and this time seems to be roughly consistent for each step. Then using my maximum tree depth (max_depth), I know the max number of steps per iteration. To compute a run time upper bound, I can then do ub = 2^max_depth*(step_time)*(num_iters) - assuming that my understanding that the max number of leapfrogs steps per iteration is 2^max_depth is correct. (Similarly for a lower bound.)

The question is whether this is a reasonable way to get an upper bound or do you have ideas of a better way to get at this?

Cheers, and thanks again for all your hard work! Stan is a great tool.

Andria

Michael Betancourt

unread,

Nov 27, 2013, 4:09:01 PM11/27/13

to stan-...@googlegroups.com

max_depth is the depth of the last subtree, so the total number of steps is actually

sum_{i = 0}^{max_depth} 2^{i} \approx 2^{max_depth + 1}

If you assume the same adaptation results then your strategy will indeed provide
a reasonable extrapolation. Even better if you save the seed and reuse that.

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,

Nov 27, 2013, 4:25:23 PM11/27/13

to stan-...@googlegroups.com

On 11/27/13, 4:09 PM, Michael Betancourt wrote:
> max_depth is the depth of the last subtree, so the total number of steps is actually
>
> sum_{i = 0}^{max_depth} 2^{i} \approx 2^{max_depth + 1}

2^0 + ... + 2^N = 2^(N+1) - 1

But this isn't the whole story in terms of number of leapfrog
steps (and hence number of log prob + gradient evals).

The issue is that if NUTS hits a U-turn while constructing
the last subtree, it won't go into the calculation of max depth,
but just be rejected. That can happen at any point along the
last subtree. So there's no exact way to recover the number of
evals from the tree depth.

That means the bound you can do from the outside is:

2^{max_depth + 2} - 1

- Bob

Andria Dawson

unread,

Nov 27, 2013, 6:16:03 PM11/27/13

to stan-...@googlegroups.com

Hi Bob and Michael,

Thanks for the quick replies!

I am still confused about where the additional factor of 2 comes from in the 2^{max_depth + 2} - 1 (i.e. why is it not 2^{max_depth + 1} - 1). If the last subtree is rejected, without having to go into the calculation of max_depth, why is the bound not lower than 2^{max_depth + 1} - 1?

Thanks!

Andria

Michael Betancourt

unread,

Nov 27, 2013, 6:44:11 PM11/27/13

to stan-...@googlegroups.com

Bob's argument is that the last subtree could have gone through almost all of the 2^{depth} iterations before
being rejected, and none of those would count because the depth counter wasn't incremented yet. But that's
only a concern if you're trying to determine the number of iterations given some particular value of the the
tree depth.

max_depth is actually checked before the loop starts, so this concern isn't an issue when determining
the largest possible number of iterations. 2^{max_depth + 1} - 1 should be fine.

Bob Carpenter

unread,

Nov 28, 2013, 1:24:41 AM11/28/13

to stan-...@googlegroups.com

On 11/27/13, 6:44 PM, Michael Betancourt wrote:
> Bob's argument is that the last subtree could have gone through almost all of the 2^{depth} iterations before
> being rejected, and none of those would count because the depth counter wasn't incremented yet. But that's
> only a concern if you're trying to determine the number of iterations given some particular value of the the
> tree depth.
>
> max_depth is actually checked before the loop starts, so this concern isn't an issue when determining
> the largest possible number of iterations. 2^{max_depth + 1} - 1 should be fine.

Sorry --- Michael's absolutely right, as usual.
I was mistakenly reading max_depth just as depth.

Most models aren't going to take you anywhere near max_depth,
and if the tree depths start running up against max_depth and
there's not a bug in the model, then you probably want to
increase max_depth.

- Bob

Andria Dawson

unread,

Nov 29, 2013, 1:19:26 PM11/29/13

to stan-...@googlegroups.com

Thanks for all your help Bob and Michael!

Andria

Andria Dawson

unread,

Jan 3, 2014, 1:24:01 PM1/3/14

to stan-...@googlegroups.com

Happy new year stan teams & users!

Just wanted to follow up on this thread. I noticed that there is a new message in the dev version of stan that provides info about the time it takes to evaluate the gradient - cool! This is really helpful, thanks for adding this.

I did notice that the value printed is quite different than my previous attempts based on timing only the model block, by more than a factor of two (model block eval: ~1.9s; gradient eval: ~4.7s). I am pretty sure this is what we expect, because to evaluate the gradient, I think we evaluate the model block as well as do some other computations. From what I understand about NUTS, the gradient needs to be evaluated each time my position changes (so for every leapfrog step) - is this correct?

Also, when trying to figure out timing previously, I noticed that the time it takes to evaluate the model block decreases over the first ~5-10 evals, and then remains somewhat more constant. Perhaps this happens as a result of starting at initial conditions that make computation a little harder? I am wondering if the printed gradient timing may be larger than what it will be for most of the evals if it is evaluated so early on. Any thoughts?

Thanks again for all your hard work on stan, and for maintaining an awesome users list!

Andria

Michael Betancourt

unread,

Jan 3, 2014, 1:30:00 PM1/3/14

to stan-...@googlegroups.com

The timing estimate is very, very crude and really just to give users and idea of the order of magnitude their
code will take to run (i.e. is this going to take a few seconds to run or a few days?). In particular, it ignores
overhead but also warmup so it's not unreasonable for it to overestimate simple models but underestimate
complex models.

The behavior you're seeing is entirely expected during warmup, where the chain needs to move far to get
from the tail into the typical set. Once in the typical set it tends to move in much shorter trajectories.

Marcus Brubaker

unread,

Jan 3, 2014, 1:37:53 PM1/3/14

to stan-...@googlegroups.com

Also, on a technical level, the first evaluation of the model spends a bunch of time allocating memory which is then re-used in future iterations so the very first evaluation is likely to be a bit more expensive than any subsequent evaluations. I can't really say how much more, as it would really depend on the details of the model, but for some models it could be significant, particularly if the system is under memory pressure.

Cheers,

Marcus

Michael Betancourt

unread,

Jan 3, 2014, 1:58:22 PM1/3/14

to stan-...@googlegroups.com

The timing estimate does not use the first evaluation, it uses a realistic estimate of the gradient evaluation time.

Marcus Brubaker

unread,

Jan 3, 2014, 2:06:52 PM1/3/14

to stan-...@googlegroups.com

I figured as much, but I wasn't sure how Andria might have been computing things and thought that this could explain some of the difference in timings she saw.

Cheers,

Marcus

Andria Dawson

unread,

Jan 3, 2014, 2:35:46 PM1/3/14

to stan-...@googlegroups.com

Thanks again for your quick responses. Good to know about the crudeness and memory allocation. I'm really not looking for an exact timing either, just trying to get at the order of magnitude.

Some of my models seem to be on the scale of weeks to months so I'm trying to

1) speed up my code (I'm not a programmer, but have gotten lots of great tips from everyone answering questions on the list), and

2) determine what size of problem is realistic in terms of being able to get a large enough effective sample size from the posterior. I don't think I can run the model I want on the domain I would like, but can scale back to something doable by using a subdomain, aggregating, or changing my assumptions for now.

The challenge is that the difference between the *crude* lower and upper timing bounds is so wide that my estimates are sometimes not that helpful (I jokingly say a run could be done NOW, NEVER, or some time in between). They are more helpful when the order of the *crude* lower bound estimate is much larger than my tolerance.

Bob Carpenter

unread,

Jan 3, 2014, 3:28:13 PM1/3/14

to stan-...@googlegroups.com

Another uncertainty in the estimates is that the warmup iterations
often tend to take longer than the sampling iterations. I've
found it can be as much as a factor of 3 in models we've
been playing with.

Also, the report is for the number of time it takes to do 10
leapfrog steps (and yes, each of those does a log prob and
gradient calc). The tree depth we report as treedepth__ is
actually the depth - 1, at least as computer scientists measure
depth. Another twist is that we may throw away some leapfrog
steps due to the NUTS condition, though that usually doesn't
happen very often.

Also, you might want to evaluate how quickly your model
converges, not only to the high mass volume of the posterior,
but also in terms of the adaptation parameters. You don't
need to warmup for longer than that.

Also, you should tune how long you run to the number of effective
samples you need. Our defaults of 1000 warmup and 1000 sampling
iterations are usually way more than we need. We'd talked about
scaling the default back to 200 and 200, but never seem to have
done that. So I'd start small and only increase the number of warmup
iterations if you haven't converged (both in terms of parameter estimates
entering the high-mass volumeof the posterior and the adaptation parameters)
and only increase the number of sampling iterations if you need more effective
samples.

How many effective samples you need will depend on what you're doing.
If you're evaluating tail probabilities (such as 99.9% intervals), you'll need a lot.
If you're evaluating 90% intervals, you'll need less. If you're evaluating
posterior variance, even fewer. If you're evaluating just posterior means,
you don't need that many at all (as in 20 or so should probably be enough).

- Bob

On 1/3/14, 8:35 PM, Andria Dawson wrote:
> Thanks again for your quick responses. Good to know about the crudeness and memory allocation. I'm really not looking
> for an exact timing either, just trying to get at the order of magnitude.
>
> Some of my models seem to be on the scale of weeks to months so I'm trying to
> 1) speed up my code (I'm not a programmer, but have gotten lots of great tips from everyone answering questions on the
> list), and
> 2) determine what size of problem is realistic in terms of being able to get a large enough effective sample size from
> the posterior. I don't think I can run the model I want on the domain I would like, but can scale back to something
> doable by using a subdomain, aggregating, or changing my assumptions for now.
>
> The challenge is that the difference between the *crude* lower and upper timing bounds is so wide that my estimates are
> sometimes not that helpful (I jokingly say a run could be done NOW, NEVER, or some time in between). They are more
> helpful when the order of the *crude* lower bound estimate is much larger than my tolerance.
>
> On Friday, January 3, 2014 11:06:52 AM UTC-8, Marcus wrote:
>
> I figured as much, but I wasn't sure how Andria might have been computing things and thought that this could explain
> some of the difference in timings she saw.
>
> Cheers,
> Marcus
>
>
>
> On Fri, Jan 3, 2014 at 1:58 PM, Michael Betancourt <betan...@gmail.com <javascript:>> wrote:
>
> The timing estimate does not use the first evaluation, it uses a realistic estimate of the gradient evaluation time.
>

> > > >> For more options, visit https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

> > > >
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to

> stan-users+...@googlegroups.com <javascript:>.
> > > For more options, visit https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

> >
> > --
> > You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to

> stan-users+...@googlegroups.com <javascript:>.
> > For more options, visit https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to

> stan-users+...@googlegroups.com <javascript:>.
> > For more options, visit https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

>
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> stan-users+...@googlegroups.com <javascript:>.
> For more options, visit https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.

Michael Betancourt

unread,

Jan 3, 2014, 3:34:34 PM1/3/14

to stan-...@googlegroups.com

Yeah, timing difficult models can be really hard. There are two big complications:

1) Warmup, which is hugely affected by the initial conditions.

2) Position-dependent curvature.

(1) is really hard to estimate and can be anywhere from much faster than sampling to much slower.
(2) makes estimating timing hard, because the NUTS tree might be huge in one region of parameter space
but small in another.

If you've been able to converge then you can get a reasonable estimate of timing once you've reached a few
effective samples. That should extrapolate pretty accurately and allow you to estimate the number of chains
or days/weeks/years you'll need.

If you're still having problems with warmup there is, unfortunately, not much principle strategy in estimating
the timing.

Andrew Gelman

unread,

Jan 3, 2014, 1:40:21 PM1/3/14

to stan-...@googlegroups.com

The next step is for our run-time estimates to include standard errors . . . We could perhaps fit a Stan model to estimate these!

Bob Carpenter

unread,

Jan 4, 2014, 11:38:38 AM1/4/14

to stan-...@googlegroups.com

Ideally, we'd be able to provide an ongoing estimate
of the remaining time until completion.

The problem I see with the current estimate is that it
gives you time for 10 leapfrog steps, but we
don't know how many leapfrog steps are going to be needed.
At the upper limit, maybe 1000 and at the lower limit, 1.
So it doesn't really give you an idea of how long the
runs are going to be.

- Bob

Michael Betancourt

unread,

Jan 4, 2014, 12:52:32 PM1/4/14

to stan-...@googlegroups.com

> Ideally, we'd be able to provide an ongoing estimate
> of the remaining time until completion.

Doable; add it to the list.

> The problem I see with the current estimate is that it
> gives you time for 10 leapfrog steps, but we
> don't know how many leapfrog steps are going to be needed.
> At the upper limit, maybe 1000 and at the lower limit, 1.
> So it doesn't really give you an idea of how long the
> runs are going to be.

Sure, but again the warning was never intended to be taken
that precisely. It's original intention was to prevent people
from spinning of runs that would take weeks without any
warning.

Andria Dawson

unread,

Jan 7, 2014, 4:21:36 PM1/7/14

to stan-...@googlegroups.com

Thanks for all the info, comments, and advice.

As for the gradient timing info that is printed out, I was a bit confused about the statement '1000 transitions using 10 leapfrog steps', but found the first line which seems to give the time for a single gradient eval helpful. Is 'transition' referring to an iteration?

For reference, the printed warning I get for one of my models is:

Gradient evaluation took 4.060000e+00 seconds

1000 transitions using 10 leapfrog steps per transition would take 4.060000e+04 seconds.

Adjust your expectations accordingly!

For what it's worth, I'll share the following, which may or may not be useful to others.

Because timing is so variable, and some of the models I run can take a while, I figure out a reasonable warmup, and then set num_samples to something fairly high. Since the output is written to file often (every iteration maybe?), I can check and analyze my results before the run is finished. To do this I just have to edit the csv add the 'Elapsed Time' footer (with fake timing info) and update num_samples in the header to the appropriate value. Then I can use the read_stan_csv function in R. Sometimes I get what I need well before my program is finished executing, in which case I can just kill it.

Also, I wanted to note that my model is not slow because it's implemented in Stan (or because it uses NUTS). The models that are slow for me are relatively complex models. Smaller-scale versions of the slowest model I run were previously implemented using a combination of R & C++ using an MCMC approach, and this was also slow because it was difficult to get things to move, even with much effort spent trying to tune the model. I want to make sure no one is deterred from using Stan because of my comments :) .

Marcus Brubaker

unread,

Jan 7, 2014, 7:07:54 PM1/7/14

to stan-...@googlegroups.com

In this discussion of runtime one thing that is important to remember is that there are two relatively orthogonal things that determine how long it takes Stan to produce a given number of effective samples.

1. Sampling efficiency. If your posterior has highly variable curvature, is poorly initialized, etc, it will be hard for the sampling algorithm to move around efficiently. This will typically manifest as either poor movement in certain parts of the space (aka, "sticking") or very long NUTS trajectories (large values of the treedepth parameter).

In the case of sticking, Rhat will be high and the number of effective samples will be low because it will be hard to explore the posterior well. It may appear that models are running "very fast" in this case, running through a large number of iterations in very short periods of this. This is because "bad" trajectories can be quickly rejected, but the quality of the samples will be terrible.

In the case of long trajectories, iterations will take a huge amount of computation time because a large number of gradient evaluations will be necessary for each iteration. (Potentially thousands...) To add insult to injury, it's also possible that these iterations still aren't moving that much and the quality of these samples will range from poor to so-so.

Sampling efficiency can be improved by changing your model (i.e., changing the posterior your sampling from). Things like reparameterizing things (e.g., the "Matt trick"), tightening priors, simplifying models, marginalizing over parameters instead of sampling them, ensuring appropriate bounds are declared on parameters, etc.

2. Efficient model specification. I.E., writing your models so that it's efficient to evaluate the gradient. This is a little tricky (in part because we keep optimizing parts of Stan, so the "best" way to do something things changes) but basically, it means trying to minimize the number of operations that's required to compute the posterior. Things like using vectorized distributions, caching and reusing computations, expressing things in terms of matrix-vector operations instead of scalar operations, etc all help make the same model faster. When Stan prints out "Gradient evaluation took 4.060000e+00 seconds", it's referring to this.

By the way, Andria, your strategy of looking at results as they're being produced is a good one. It would be great if we could adjust bin/print to work without the footer data.

Cheers,

Marcus

--

Andria Dawson

unread,

Jan 10, 2014, 12:38:22 AM1/10/14

to stan-...@googlegroups.com

Thanks Marcus for the advice. I am noticing a decrease in runtime as I continue to refine both my model and code, so these tips are really helpful.

It would be great to be able to more easily work with the 'intermediate' output files.

Also, just wanted to point out a super-small typo in the manual. On page 70, in the code never lies section. Documentation is spelled docuementation. Maybe it's been fixed already, but in case it hasn't... Aside from this typo, the new manual is great :) .

Andria

Bob Carpenter

unread,

Jan 10, 2014, 5:35:41 AM1/10/14

to stan-...@googlegroups.com

On 1/10/14, 6:38 AM, Andria Dawson wrote:
> Thanks Marcus for the advice. I am noticing a decrease in runtime as I continue to refine both my model and code, so
> these tips are really helpful.

Glad to hear it.

> It would be great to be able to more easily work with the 'intermediate' output files.

I'm not sure what you mean by intermediate output files. In RStan,
there is no intermediate file-based representation. In CmdStan, there's
only a CSV output file.

> Also, just wanted to point out a super-small typo in the manual. On page 70, in the code never lies section.
> Documentation is spelled docuementation. Maybe it's been fixed already, but in case it hasn't... Aside from this typo,
> the new manual is great :) .

Thanks --- I added it to the "to fix" list. I've been too lazy
to run a spell checker because of all the terminology and the length
of the manual, but I should do that and save our users having to report
these things to us. To track:

https://github.com/stan-dev/stan/issues/485

- Bob

Andria Dawson

unread,

Jan 10, 2014, 10:40:21 AM1/10/14

to stan-...@googlegroups.com

I was referring to the csv file generated by CmdStan ('intermediate' in the sense that the model hasn't finished running yet). Sorry for the confusion! '

Bob Carpenter

unread,

Jan 10, 2014, 2:03:40 PM1/10/14

to stan-...@googlegroups.com

On 1/10/14, 4:40 PM, Andria Dawson wrote:
> I was referring to the csv file generated by CmdStan ('intermediate' in the sense that the model hasn't finished running
> yet). Sorry for the confusion! '

No problem --- this stuff tends to be confusing because there
are so many moving pieces.

But the answer's that they're not available in RStan because there's
no file getting written out.

We've been thinking recently about being able to do some online
analysis ("online" in the computer science sense of as the model's
running). At the very least, we're thinking we should print out the log
prob in each iteration to get a sense of convergence or chains going
awry.

- Bob

Reply all

Reply to author

Forward