max tree-depth in warmup and sampling

1,630 views
Skip to first unread message

Guido Biele

unread,
Jul 28, 2016, 5:39:27 AM7/28/16
to Stan users mailing list
Hello,

I have a model that reaches large tree-depths during warmup (up to 17) 
and has smaller tree-depths during sampling (around 8, never over 10).

Warmup is very slow with those large tree-depths, so that I am 
wondering if I could set max treedepth to a value that is well above the 
expected tree-depth during sampling, but below the maximally reached 
tree-depth during warmup (say 14). I would basically set the max-
treedepth to a value that results in some but not too many max-treedepth
iterations during warmup.

I understand Michael Betancourt's response in another thread so that
one only needs to be concerned about reaching max-treedepth, if it 
saturates at the maximum during sampling. That is, it should not be a 
problem if max-treedepth is reached a few times during warmup.

Does this sound like a reasonable approach?
Or does hitting max tree-depth in during warmup disturb the adaptation
of sampler parameters?


Best - Guido

Michael Betancourt

unread,
Jul 28, 2016, 5:49:23 AM7/28/16
to stan-...@googlegroups.com
It will reduce exploration efficiency in warmup, making the adaptation 
more difficult and ultimately slower.  You really want to look at the 
output mass matrix and rescale your parameters accordingly to
prevent the slow adaptation periods.

Guido Biele

unread,
Jul 28, 2016, 6:38:42 AM7/28/16
to Stan users mailing list
Thanks Michael,
I am using non-centered parameterization where possible.*
(The model I use is an adaptation of the hierarchical regression
model with multivariate priors described in the stan-manual)

How do I identify the problematic parameters with the mass matrix?
(I guess you mean more specifically the Diagonal elements of the 
inverse mass matrix, which is what is saved by cmdstan)

Guido


*(I am not sure anymore if I should have used it, given that 
I have lots of data and I remember a discussion that non-centered 
parameterization might hurt when one has large data sets).

Michael Betancourt

unread,
Jul 28, 2016, 7:08:34 AM7/28/16
to stan-...@googlegroups.com
High treedepth in warmup but reasonable treedepth in sampling
is caused by a small step size in warmup which is caused by 
parameters with very different posterior variances.  Just scan
through the diagonal elements of the mass matrix to find any 
extremely high or small values (the values are just the marginal
posterior variances, or approximations there of) and target 
the corresponding parameters for rescaling (note that the mass
matrix elements correspond to parameters in the unconstrained
space, in case you have any constraints).

For example, if mu and sigma are the marginal posterior mean
and variance of the parameter x then the model

parameters {
  real x;
}

model {
  x ~ blah;
}

would be easier to adapt with the modification

transformed data {
  real mu;
  real sigma;
  mu = …;
  sigma = …;
}

parameters {
  real x_tilde;
}

model {
  mu + sigma * x_tilde ~ blah; // no jacobian needed for linear transform
}

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Guido Biele

unread,
Jul 28, 2016, 8:21:22 AM7/28/16
to Stan users mailing list
Thanks for the constructive support Michael!

I think I understand what you mean.
I think it would follow that chains with slow warmup should 
have especially high or low values in the mass matrix. 

However, if I plot both as in the attached figure, I can't see
a clear association between max or min values in the
diagonal mass matrix and warmup time.

In the attached plot parameters are on the x axis, each chain
is indexed by a number, the size of the numbers is proportional
to the warmup time (which varies by a factor 10), and the 
chains with the slowest warmup time are highlighted in red.

Is my thought that chains with slow warmup should have extreme
values in the mass matrix wrong?

Best - Guido

Charles Driver

unread,
Jul 28, 2016, 3:21:18 PM7/28/16
to Stan users mailing list
Guido - I had similar difficulties which were substantially improved by dropping the init window and window size parameters to 2.
Michael - I was also thinking I would like the same thing, re treedepth restriction during warmup... re the inefficient exploration you say this would cause, it may also be that the inefficiency due to hitting max treedepth is offset by more frequently (in a time sense) adapting the step size and mass matrix, no?

Michael Betancourt

unread,
Jul 28, 2016, 4:45:22 PM7/28/16
to stan-...@googlegroups.com
No, there is a dangerous trade off there.  If you cap the max treedepth
then you can get drastically fewer effective samples in each adaptation
window, which would lead to _noisier_ mass matrix estimation and
_even worse trajectories in the next window_.

The problem is that in order to speed things up you have to very
delicately tune the adaptation parameters, such as the various window sizes,
in order to get just enough effective samples to get a reasonably convergent
series of mass matrix estimates.  There’s no general solution to this, and
I certainly don’t feel comfortable relying on it myself.

I think it’s much easier to focus on the latent cause, which is the huge
variation in marginal variances.  Having parameters on similar scales
usually leads to more interpretable models, anyways, and so may 
indicate a somewhat ill-posed model.  

Guido Biele

unread,
Jul 28, 2016, 6:33:07 PM7/28/16
to Stan users mailing list
Michael,
a typically try to achieve parameters on similar scales by scaling predictors 
and outcome of a regression model. However, this is not generally doable. 
For example, intercepts for some general linear models can be much larger 
than the slope parameters (I think this will happen in many models where 
one can't simply scale the outcome variable, e.g. in logistic regressions or 
for have count data.)

Having said that, when I look at the marginal variances from different chains
with vastly different warmup times,I do not see a relationship between variation
(or min or max) of the marginal variances (i.e. the values in the diagonal of the
mass matrix) and the warmup time. 

So, either I am not understanding well how I should examine the mass matrix, 
or something else is going on.

I am attaching the plot of marginal variances I apparently forgot to attach in my
last message. In the plot parameters are on the x axis, each chain is indexed
by a number, the size of the numbers is proportional to the warmup time (which
varies by a factor 10), and the chains with the slowest warmup time are 
highlighted in red.

Maybe you can glean something from this plot that escapes me?

Best - Guido


PS: I do get the model to converge with only 100 warmup iterations. 
I suspect one reason that I need relatively few warmup iterations
is that I have lots of data (rows).
MassMatrices.pdf

Michael Betancourt

unread,
Jul 29, 2016, 6:32:50 PM7/29/16
to stan-...@googlegroups.com
You have some marginal variances on the order of 1e-4, and some of
the order 1.  Without the proper mass matrix HMC will typical require a
step size sqrt(1e-4 / 1) = 1e-2 times smaller, hence 100 times longer
trajectories.  This can lead to slow adaptation, especially if your model
is already difficult to sample.

You can always arbitrarily rescale your parameters, as I tried to explain
before.  If you know the parameter x has posterior mean mu and
posterior standard deviation sigma, say from a previous run, then 
rewrite your model as

parameters {
  real x_tilde;
}
transformed parameters {
  real x;
  x = mu + sigma * x_tilde;
}
model {
  // use x
}

By construct x_tilde will have unit marginal variance and hence admit
a much easier adaptation.

<MassMatrices.pdf>

Guido Biele

unread,
Jul 30, 2016, 4:13:59 AM7/30/16
to Stan users mailing list
Thanks for your patience Michael,
I had understoood how to adjust the parameters. (I think an alternative would be to rescale the predictor variables based on an initial parameter estimate).

What I did not understand is, why for the same model and data the warm-up time can vary by a factor of ten, when all minimum and maximum values of the diagonal of the mass matrix for all chains are on the same order of magnitude, respectively. However, I haven't tested if the ratio of the min and max values for each chain predict warm-up time. I'll check that!

I also would have thought that the mass matrices would be more similar from chain to chain, but maybe they aren't because I use only 100 warm-up samples(which are enough to make the models converge).

Guido

Krzysztof Sakrejda

unread,
Jul 30, 2016, 5:55:50 PM7/30/16
to Stan users mailing list
This is key, short warm-ups are _noisy_ so the estimates are likely to be very different from each other. Additionally whether the sampler wanders into a region that takes a long time to get out of is random so even if two chains have the same mass matrix one might take much longer.

Krzysztof

 

Guido

Bob Carpenter

unread,
Jul 31, 2016, 11:30:49 AM7/31/16
to stan-...@googlegroups.com
The notion of "convergence" for warmup is that all the
chains find the same mass matrix and step size.

There are several obstacles to this, though. First is
that you need to get to the typical set (where the posterior
mass is and where you find truly random draws from the posterior)
and you then need to spend enough time there to get a good
estimate of the mass matrix. This is made more difficult (in
a good diagnostic way) by starting with diffuse initializations
and the general random nature of MCMC (and hence our adaptation,
which is very much like MCMC even though it's not properly
Markovian). Another big obstacle is varying posterior curvature
(i.e., models where the matrix of second derivatives of the log
density w.r.t. the parameters varies around the posterior); then
there isn't a good global mass matrix.

I believe there is a lot of room for improvement in adaptation using
parallelization and second derivative calculations. We're looking
for a grad student or postdoc to help us investigate. Hint hint.

- Bob

Guido Biele

unread,
Jul 31, 2016, 4:23:09 PM7/31/16
to stan-...@googlegroups.com
Thanks everyone for your clarifying answers!

@Krzysztof: I agree, to the combination of a random sampling process
and relatively few warm-up samples sounds like a good explanation for 
the variation between mass matrices I observed.

@Bob: When I was writing that the model converges, I meant that 
post-warmup samples converged (low Rhats, not divergent iterations, 
low mcse, I don't know about the BMFI as I am using stan 2.9), even 
tough the mass matrices appear to be somewhat different. I say "appear" 
because I am not sure how big a difference is relevant.



Looking at the plot of the mass matrix values again, I was wondering if 
maybe  a large variation of the marginal posterior variance between 
chains is an indication that this variable could benefit from rescaling?


Cheers - Guido



You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/43SGHPVTZfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,
Jul 31, 2016, 5:35:10 PM7/31/16
to stan-...@googlegroups.com

> On Jul 31, 2016, at 4:22 PM, Guido Biele <guido...@gmail.com> wrote:
>
> Thanks everyone for your clarifying answers!
>
> @Krzysztof: I agree, to the combination of a random sampling process
> and relatively few warm-up samples sounds like a good explanation for
> the variation between mass matrices I observed.

It's very hard to estimate variance with 50 approximate draws before
convergence (we always take the second half of the iterations to estimate
the mass matrix [inverse covariance]).

> @Bob: When I was writing that the model converges, I meant that
> post-warmup samples converged (low Rhats, not divergent iterations,
> low mcse, I don't know about the BMFI as I am using stan 2.9), even
> tough the mass matrices appear to be somewhat different. I say "appear"
> because I am not sure how big a difference is relevant.

A smaller step size can compensate for inaccurate mass matrices.

> Looking at the plot of the mass matrix values again, I was wondering if
> maybe a large variation of the marginal posterior variance between
> chains is an indication that this variable could benefit from rescaling?

Probably.

- Bob

Reply all
Reply to author
Forward
0 new messages