max tree-depth in warmup and sampling

Guido Biele

unread,

Jul 28, 2016, 5:39:27 AM7/28/16

to Stan users mailing list

Hello,

I have a model that reaches large tree-depths during warmup (up to 17)

and has smaller tree-depths during sampling (around 8, never over 10).

Warmup is very slow with those large tree-depths, so that I am

wondering if I could set max treedepth to a value that is well above the

expected tree-depth during sampling, but below the maximally reached

tree-depth during warmup (say 14). I would basically set the max-

treedepth to a value that results in some but not too many max-treedepth

iterations during warmup.

I understand Michael Betancourt's response in another thread so that

one only needs to be concerned about reaching max-treedepth, if it

saturates at the maximum during sampling. That is, it should not be a

problem if max-treedepth is reached a few times during warmup.

Does this sound like a reasonable approach?

Or does hitting max tree-depth in during warmup disturb the adaptation

of sampler parameters?

Best - Guido

Michael Betancourt

unread,

Jul 28, 2016, 5:49:23 AM7/28/16

to stan-...@googlegroups.com

It will reduce exploration efficiency in warmup, making the adaptation

more difficult and ultimately slower. You really want to look at the

output mass matrix and rescale your parameters accordingly to

prevent the slow adaptation periods.

Guido Biele

unread,

Jul 28, 2016, 6:38:42 AM7/28/16

to Stan users mailing list

Thanks Michael,

I am using non-centered parameterization where possible.*

(The model I use is an adaptation of the hierarchical regression

model with multivariate priors described in the stan-manual)

How do I identify the problematic parameters with the mass matrix?

(I guess you mean more specifically the Diagonal elements of the

inverse mass matrix, which is what is saved by cmdstan)

Guido

*(I am not sure anymore if I should have used it, given that

I have lots of data and I remember a discussion that non-centered

parameterization might hurt when one has large data sets).

Michael Betancourt

unread,

Jul 28, 2016, 7:08:34 AM7/28/16

to stan-...@googlegroups.com

High treedepth in warmup but reasonable treedepth in sampling

is caused by a small step size in warmup which is caused by

parameters with very different posterior variances. Just scan

through the diagonal elements of the mass matrix to find any

extremely high or small values (the values are just the marginal

posterior variances, or approximations there of) and target

the corresponding parameters for rescaling (note that the mass

matrix elements correspond to parameters in the unconstrained

space, in case you have any constraints).

For example, if mu and sigma are the marginal posterior mean

and variance of the parameter x then the model

parameters {

real x;

}

model {

x ~ blah;

}

would be easier to adapt with the modification

transformed data {

real mu;

real sigma;

mu = …;

sigma = …;

}

parameters {

real x_tilde;

}

model {

mu + sigma * x_tilde ~ blah; // no jacobian needed for linear transform

}

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Guido Biele

unread,

Jul 28, 2016, 8:21:22 AM7/28/16

to Stan users mailing list

Thanks for the constructive support Michael!

I think I understand what you mean.

I think it would follow that chains with slow warmup should

have especially high or low values in the mass matrix.

However, if I plot both as in the attached figure, I can't see

a clear association between max or min values in the

diagonal mass matrix and warmup time.

In the attached plot parameters are on the x axis, each chain

is indexed by a number, the size of the numbers is proportional

to the warmup time (which varies by a factor 10), and the

chains with the slowest warmup time are highlighted in red.

Is my thought that chains with slow warmup should have extreme

values in the mass matrix wrong?

Best - Guido

Charles Driver

unread,

Jul 28, 2016, 3:21:18 PM7/28/16

to Stan users mailing list

Guido - I had similar difficulties which were substantially improved by dropping the init window and window size parameters to 2.
Michael - I was also thinking I would like the same thing, re treedepth restriction during warmup... re the inefficient exploration you say this would cause, it may also be that the inefficiency due to hitting max treedepth is offset by more frequently (in a time sense) adapting the step size and mass matrix, no?

Michael Betancourt

unread,

Jul 28, 2016, 4:45:22 PM7/28/16

to stan-...@googlegroups.com

No, there is a dangerous trade off there. If you cap the max treedepth

then you can get drastically fewer effective samples in each adaptation

window, which would lead to _noisier_ mass matrix estimation and

_even worse trajectories in the next window_.

The problem is that in order to speed things up you have to very

delicately tune the adaptation parameters, such as the various window sizes,

in order to get just enough effective samples to get a reasonably convergent

series of mass matrix estimates. There’s no general solution to this, and

I certainly don’t feel comfortable relying on it myself.

I think it’s much easier to focus on the latent cause, which is the huge

variation in marginal variances. Having parameters on similar scales

usually leads to more interpretable models, anyways, and so may

indicate a somewhat ill-posed model.

Guido Biele

unread,

Jul 28, 2016, 6:33:07 PM7/28/16

to Stan users mailing list

Michael,

a typically try to achieve parameters on similar scales by scaling predictors

and outcome of a regression model. However, this is not generally doable.

For example, intercepts for some general linear models can be much larger

than the slope parameters (I think this will happen in many models where

one can't simply scale the outcome variable, e.g. in logistic regressions or

for have count data.)

Having said that, when I look at the marginal variances from different chains

with vastly different warmup times,I do not see a relationship between variation

(or min or max) of the marginal variances (i.e. the values in the diagonal of the

mass matrix) and the warmup time.

So, either I am not understanding well how I should examine the mass matrix,

or something else is going on.

I am attaching the plot of marginal variances I apparently forgot to attach in my

last message. In the plot parameters are on the x axis, each chain is indexed

by a number, the size of the numbers is proportional to the warmup time (which

varies by a factor 10), and the chains with the slowest warmup time are

highlighted in red.

Maybe you can glean something from this plot that escapes me?

Best - Guido

PS: I do get the model to converge with only 100 warmup iterations.

I suspect one reason that I need relatively few warmup iterations

is that I have lots of data (rows).

MassMatrices.pdf

Michael Betancourt

unread,

Jul 29, 2016, 6:32:50 PM7/29/16

to stan-...@googlegroups.com

You have some marginal variances on the order of 1e-4, and some of

the order 1. Without the proper mass matrix HMC will typical require a

step size sqrt(1e-4 / 1) = 1e-2 times smaller, hence 100 times longer

trajectories. This can lead to slow adaptation, especially if your model

is already difficult to sample.

You can always arbitrarily rescale your parameters, as I tried to explain

before. If you know the parameter x has posterior mean mu and

posterior standard deviation sigma, say from a previous run, then

rewrite your model as

parameters {

real x_tilde;

}

transformed parameters {

real x;

x = mu + sigma * x_tilde;

}

model {

// use x

}

By construct x_tilde will have unit marginal variance and hence admit

a much easier adaptation.

<MassMatrices.pdf>

Guido Biele

unread,

Jul 30, 2016, 4:13:59 AM7/30/16

to Stan users mailing list

Thanks for your patience Michael,
I had understoood how to adjust the parameters. (I think an alternative would be to rescale the predictor variables based on an initial parameter estimate).

What I did not understand is, why for the same model and data the warm-up time can vary by a factor of ten, when all minimum and maximum values of the diagonal of the mass matrix for all chains are on the same order of magnitude, respectively. However, I haven't tested if the ratio of the min and max values for each chain predict warm-up time. I'll check that!

I also would have thought that the mass matrices would be more similar from chain to chain, but maybe they aren't because I use only 100 warm-up samples(which are enough to make the models converge).

Guido

Krzysztof Sakrejda

unread,

Jul 30, 2016, 5:55:50 PM7/30/16

to Stan users mailing list

This is key, short warm-ups are _noisy_ so the estimates are likely to be very different from each other. Additionally whether the sampler wanders into a region that takes a long time to get out of is random so even if two chains have the same mass matrix one might take much longer.

Krzysztof

Guido

Bob Carpenter

unread,

Jul 31, 2016, 11:30:49 AM7/31/16

to stan-...@googlegroups.com

The notion of "convergence" for warmup is that all the
chains find the same mass matrix and step size.

There are several obstacles to this, though. First is
that you need to get to the typical set (where the posterior
mass is and where you find truly random draws from the posterior)
and you then need to spend enough time there to get a good
estimate of the mass matrix. This is made more difficult (in
a good diagnostic way) by starting with diffuse initializations
and the general random nature of MCMC (and hence our adaptation,
which is very much like MCMC even though it's not properly
Markovian). Another big obstacle is varying posterior curvature
(i.e., models where the matrix of second derivatives of the log
density w.r.t. the parameters varies around the posterior); then
there isn't a good global mass matrix.

I believe there is a lot of room for improvement in adaptation using
parallelization and second derivative calculations. We're looking
for a grad student or postdoc to help us investigate. Hint hint.

- Bob

Guido Biele

unread,

Jul 31, 2016, 4:23:09 PM7/31/16

to stan-...@googlegroups.com

Thanks everyone for your clarifying answers!

@Krzysztof: I agree, to the combination of a random sampling process

and relatively few warm-up samples sounds like a good explanation for

the variation between mass matrices I observed.

@Bob: When I was writing that the model converges, I meant that

post-warmup samples converged (low Rhats, not divergent iterations,

low mcse, I don't know about the BMFI as I am using stan 2.9), even

tough the mass matrices appear to be somewhat different. I say "appear"

because I am not sure how big a difference is relevant.

Looking at the plot of the mass matrix values again, I was wondering if

maybe a large variation of the marginal posterior variance between

chains is an indication that this variable could benefit from rescaling?

Cheers - Guido

You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/43SGHPVTZfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,

Jul 31, 2016, 5:35:10 PM7/31/16

to stan-...@googlegroups.com

> On Jul 31, 2016, at 4:22 PM, Guido Biele <guido...@gmail.com> wrote:
>
> Thanks everyone for your clarifying answers!
>
> @Krzysztof: I agree, to the combination of a random sampling process
> and relatively few warm-up samples sounds like a good explanation for
> the variation between mass matrices I observed.

It's very hard to estimate variance with 50 approximate draws before
convergence (we always take the second half of the iterations to estimate
the mass matrix [inverse covariance]).

> @Bob: When I was writing that the model converges, I meant that
> post-warmup samples converged (low Rhats, not divergent iterations,
> low mcse, I don't know about the BMFI as I am using stan 2.9), even
> tough the mass matrices appear to be somewhat different. I say "appear"
> because I am not sure how big a difference is relevant.

A smaller step size can compensate for inaccurate mass matrices.

> Looking at the plot of the mass matrix values again, I was wondering if
> maybe a large variation of the marginal posterior variance between
> chains is an indication that this variable could benefit from rescaling?

Probably.

- Bob

Reply all

Reply to author

Forward