Diagnosing slow performance

Kevin Van Horn

unread,

Dec 23, 2014, 1:29:50 PM12/23/14

to stan-...@googlegroups.com

I am running a nested multinomial logit model with a hierarchical prior and I am getting very slow performance. Mixing looks good, but it's averaging around 80 seconds for ONE iteration. This is a problem with 8 choice tasks for each of 1620 individuals, with each task being a choice between 3 alternatives and "none", and each alternative described by a total of 23 covariates (after appropriately encoding discrete attributes). I have tried all of the optimization tricks mentioned in the Stan user manual that appear relevant.

I'm not sure how to proceed in trying to figure out how to make this faster. (I'm interested in more than just this specific data set -- I plan to use this model repeatedly on a variety of projects.) How does one diagnose slow performance in Stan?

Some more info: this was a timing test, so I only ran for 200 iterations (100 warmup, 100 sampling). In the warmup phase n_leapfrog as often as high as 2047, and post-warmup it was 1023 on every iteration.

Bob Carpenter

unread,

Dec 23, 2014, 2:25:30 PM12/23/14

to stan-...@googlegroups.com

There are two issues: speed of the implementation code and
speed of the model.

The 1000+ leapfrog steps are the problem because Stan's
evaluating log probs and gradients for each of those steps.
The only way to fix this is to reparameterize or tighten priors.

More warmup iterations might help. Starting at a lower step
size might also help the warmup go faster.

How is the mixing? I'd suggest making the treedepth bound even
higher. I hope it maxing out at different numbers in warmup and
during sampling isn't indicative of an off-by-one error in config
(won't affect sampling accuracy---just how deep the tree can grow
in NUTS).

- Bob

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Kevin Van Horn

unread,

Dec 23, 2014, 3:58:46 PM12/23/14

to stan-...@googlegroups.com

The mixing looks pretty good. I have weakly informative priors, and I've already reparameterized. Nearly two hours ago I started another test run with max_treedepth set to 13, and three of the chains are at somewhere between iteration 40 and 60, with the other one somewhere between iteration 20 and 40; that works out to at least 170 seconds per iteration for the slowest chain. (I'm running the chains in parallel.)

Kevin Van Horn

unread,

Jan 6, 2015, 5:30:36 PM1/6/15

to stan-...@googlegroups.com

I have some more information on this. With a simplified version of the model running for 1000+1000 iterations I found:

* Average n_leapfrog__, first 100 warmup iterations: 1310.

* Average n_leapfrog__, remaining 1900 iterations: 65.

The sum of n_leapfrog__ over the first 100 iterations is greater than the sum over all remaining iterations. Looking at a plot of n_leapfrog__, I see a sudden drop after iteration 100 in each of the four chains.

Any ideas why the first 100 iterations would average 20 times as many leapfrog steps as later iterations? Or what I could do to improve that?

More context: the simplification was that I restricted the covariance matrix for the regression coefficients to be diagonal. I ran 4 chains and made an effort to get reasonably good initial values.

Bob Carpenter

unread,

Jan 6, 2015, 7:08:30 PM1/6/15

to stan-...@googlegroups.com

The manual explains the stages that warmup goes through. There's
a part near the end with a chapter on Hamiltonian Monte Carlo and
how it's implemented in Stan.

The only thing you can control is how long the various stages
are and where you start in terms of step size. But this might
be enough to control your issues for a particular model.

What's happening early on is that you're probably needing to
find a very small step size before the scales are adapted per
parameter. This means multiple leapfrog steps per iteration.
Then, once Stan's sorted out the scales, it can increase the
step size and thus cut down on the number of leapfrog steps.

Another thing that can help is scaling the data so all the parameters
are roughly on the unit scale.

- Bob

Sebastian Weber

unread,

Jan 7, 2015, 5:04:44 AM1/7/15

to stan-...@googlegroups.com

Hi!

It looks to me that you use a too short warmup phase of only 100 steps. As far as I recall, Stan shuts off quite important warmup adjustments if you warmup less than 250 iterations. Could you check what the elements of the mass matrix are? I remember that with less than 250 warmup iterations then I only got a diagonal mass matrix. Only with more than 250 warmup iterations Stan will make a smart guess on the mass matrix which is really important for good performance during sampling.

You can make Stan use less than 250 warmup iterations to come up with an adjusted mass matrix if you play around with the different windows (phases) during warmup - Michael is the expert here. However, I never dared to play with this and simply always go with at least 250 warmups. This situation will become better once we can pass into Stan a mass matrix manually, but that is on the todo list as far as I recall.

Best,

Sebastian

Kevin Van Horn

unread,

Jan 8, 2015, 12:13:30 PM1/8/15

to stan-...@googlegroups.com

Sebastian,

Some of my experiments did use a warmup phase of only 100 steps, and I'm disregarding the results of those runs for the reasons you mention. The more recent experiments I've run have all used 1000 warmup plus 1000 sampling iterations.

A lesson here for anyone else diagnosing slow performance: Running for 100+100 iterations doesn't tell you anything useful about what will happen when you run for 1000+1000 iterations.

Kevin Van Horn

unread,

Jan 8, 2015, 12:42:11 PM1/8/15

to stan-...@googlegroups.com

Bob,

This was very useful information. To verify that different scales for the different parameters was the issue, I tried optimally scaling the (untransformed) parameters. Specifically, for each parameter I found the posterior mean and std dev from a previous run and did a new run with parameters scaled to all have a posterior s.d. of 1. This made a substantial difference:

* The initial run had a total of 523,977 leapfrog steps summed over the first 100 iterations of all four chains; 492,946 leapfrog steps summed over the remaining 900+1000 iterations of all four chains; and a total of 1,016,923 leapfrog steps summed over all iterations and all chains.

* The run with optimal scaling had a total of 27,212 leapfrog steps summed over the first 100 iterations of all four chains; 484,120 leapfrog steps summed over the remaining 900+1000 iterations of all four chains; and a total of 511,332 leapfrog steps summed over all iterations and all chains.

So optimal scaling reduced the total computational effort to about half.

Of course, you don't know the optimal scaling in advance, but this suggests to me that it really can be worthwhile to get some idea of the width of the posterior for each parameter and scale accordingly.

As a guide for anyone else who might want to experiment with this kind of thing, here are some details:

The original model had the following parameters block:

    parameters {
matrix[K+1, R] Epsilon;
real<lower=0, upper=1> lambda;
vector<lower=0>[K+1] sigma;
matrix[K+1, G] preTheta;
    }

I did an initial run of 1000 warmup and 1000 sampling iterations, with the result stored in the variable "fit", then executed this code to find the posterior widths:

    samples <- extract(fit)
    mu.Epsilon <- apply(samples$Epsilon, 2:3, mean)
    sd.Epsilon <- apply(samples$Epsilon, 2:3, sd)
    mu.logit.lambda <- mean(logit(samples$lambda))
    sd.logit.lambda <- sd(logit(samples$lambda))
    mu.log.sigma <- colMeans(log(samples$sigma))
    sd.log.sigma <- apply(log(samples$sigma), 2, sd)
    mu.preTheta <- apply(samples$preTheta, 2:3, mean)
    sd.preTheta <- apply(samples$preTheta, 2:3, sd)

For the optimal scaling run, I made modifications to the data, parameters, and transformed parameters sections of the model. I added the following to the data block:

    data {
      ...
      matrix[K+1, R] mu_Epsilon;
      matrix<lower=0>[K+1, R] sd_Epsilon;
      real mu_logit_lambda;
      real<lower=0> sd_logit_lambda;
      vector[K+1] mu_log_sigma;
      vector<lower=0>[K+1] sd_log_sigma;
      matrix[K+1,G] mu_preTheta;
      matrix<lower=0>[K+1,G] sd_preTheta;
    }

I changed the parameters block to be

    parameters {
      matrix[K+1, R] scaled_Epsilon;
      real scaled_logit_lambda;
      vector[K+1] scaled_log_sigma;
      matrix[K+1, G] scaled_preTheta;
    }

and added these lines to the "transformed parameters" block:

    transformed parameters {
      matrix[K+1, R] Epsilon;
      matrix[K+1, G] preTheta;
      vector[K+1] log_sigma;
      vector<lower=0>[K+1] sigma;
      real<lower=0, upper=1> lambda;
      ...

      Epsilon <- mu_Epsilon + sd_Epsilon .* scaled_Epsilon;
      preTheta <- mu_preTheta + sd_preTheta .* scaled_preTheta;
      log_sigma <- mu_log_sigma + sd_log_sigma .* scaled_log_sigma;
      sigma <- exp(log_sigma);
      lambda <- inv_logit(mu_logit_lambda + sd_logit_lambda * scaled_logit_lambda);

      ...
    }

Finally, since the model block includes prior distributions for sigma and lambda, which are now transformed parameters, I included a Jacobian adjustment:

    model {
      lambda ~ uniform(0, 1);
      increment_log_prob(log(lambda * (1 - lambda)));
      sigma ~ normal(0, 5);
      increment_log_prob(log_sigma);
      ...

Andrew Gelman

unread,

Jan 8, 2015, 5:47:11 PM1/8/15

to stan-...@googlegroups.com

Yes, but here’s another lesson: There are lots of easy problems where 100+100 is more than enough.

And another lesson: Often we have huge errors in our Stan programs which can show up after we run 100+100 (or even 10+10).

Sebastian Weber

unread,

Jan 9, 2015, 3:08:46 AM1/9/15

to stan-...@googlegroups.com, gel...@stat.columbia.edu

I agree most errors pop-up by just trying to compile my Stan model and then catching on of those syntax errors...

Would it be possible to make the need for rescaling of the data less important? I thought that using the Matt trick should already be sufficient for most cases.

Could one avoid the necessity to rescale the data and replace it with a better initial guess of the diagonal of the mass matrix? I.e. I would much prefer to pass into Stan a sensible mass matrix for warmup rather than doing those data rescalings... at least this is my understanding of what would be the alternative to the manual rescaling. Anything like this planned?

Sebastian

Bob Carpenter

unread,

Jan 9, 2015, 3:49:33 PM1/9/15

to stan-...@googlegroups.com

We've been planning to allow users to provided fixed mass matrices
to use without adaptation.

I'm not sure how to provide scaling suggestions. Perhaps as
a different prior than our default?

- Bob

Kevin Van Horn

unread,

Jan 9, 2015, 8:28:19 PM1/9/15

to stan-...@googlegroups.com

In my case I wouldn't want to use a *fixed* mass matrix, but I would like to be able to specify the *initial* mass matrix, so that warmup starts at a better place.

Michael Betancourt

unread,

Jan 10, 2015, 5:44:07 AM1/10/15

to stan-...@googlegroups.com

Let me just summarize a few of the key points here.

Warmup:

Conceptually MCMC warmup times are basically equivalent to

the autocorrelation time — because HMC chains tend to be

lowly autocorrelated they also tend to converge really, really

quickly.

The HUGE caveat is that such an argument assumes uniformity

of curvature across the parameter space, as assumption which

is violated in many of the complex models we see. Very often

the tails have large curvature while the bulk is relatively

well-behaved; in other words, warmup is slow not because

the actual convergence time is slow but rather because the

cost of an HMC iteration is more expensive out in the tails.

Poor behavior in the tails is the kind of pathology that Andrew

notes we can find by running only a few warmup iterations.

By looking at the acceptance probabilities and step sizes of

the first few iterations you can get an idea of how bad the

problem is and whether you need to address it with modeling

efforts.

The Mass Matrix (or, more formally, the Euclidean Metric):

The mass matrix can compensate for linear (i.e. global)

correlations in the posterior which can dramatically improve

the performance of HMC in some problems. Of course this

requires that we know the global correlations a priori.

In complex models this is incredibly difficult (for example,

nonlinear model components convolve the scales of the

data, so standardizing the data doesn’t always help) so in

Stan we learn these correlations online with an adaptive

warmup. In models with strong nonlinear (i.e. local)

correlations this learning can be slow, even with regularization.

This is ultimately why warmup in Stan often needs to be

so long, and why a sufficiently long warmup can yield such

substantial performance improvements.

Reply all

Reply to author

Forward