Determining impact of divergent transitions after warmup

1,124 views
Skip to first unread message

Philip Boonstra

unread,
Apr 1, 2016, 1:52:01 PM4/1/16
to Stan users mailing list
Hi, First, thanks to the development team for such a fantastic tool for practical Bayesian analysis. Count me a big Stan fan.

Second, I'm in a bit of a conundrum. Session info is at the bottom. Here are the salient points.

(i) I fit a logistic regression with a horseshoe (i.e. two-level, half-cauchy) prior, parametrized as follows:
data {
  int<lower=0> n_stan; // num obs
  int<lower=0> p_stan; // num covariates
  int<lower=0,upper=1> y_stan[n_stan]; // outcome
  matrix[n_stan,p_stan] x_scaled_stan; //covariates (no intercept)
}
parameters {
  real mu;
  vector[p_stan] beta_raw;
  real<lower=0> tau;
  vector<lower=0>[p_stan] lambda;
}
transformed parameters {
  vector[p_stan] beta;
  beta <- tau*(lambda .* beta_raw);
}
model {
  lambda ~ cauchy(0,1);
  tau ~ cauchy(0,1);
  beta_raw ~ normal(0,1);
  y_stan ~ bernoulli_logit(mu + x_scaled_stan * beta);
}
(ii) I fit the data to about 1600 observations, 54 predictors (after transforming categorical to dummy and specifying my interactions)
(iii) There were missing data - mostly sporadic. My strategy to handle this was as follows: construct 100 multiply imputed datasets, run the the sampler on each imputed dataset for a warmup of 400 and 300 additional iterations, and concatenate the results for posterior inference on the combined 100x300 = 30,000 draws.
(iv) I originally did the analysis using rstan v2.5.0, so I did not see warnings about divergent iterations. Everything looked reasonable, e.g. Rhats < 1.02, reasonable traceplots, clinically sensible findings. And the covariates the model identified as most impactful co-coincided with my collaborator's clinical prior. Our model performed decently but not stellar. AUC in the training data of about 0.68; AUC in the validation data of about 0.64. Fairly well calibrated.
(v) However, while doing some follow-up work (and having learned about the issues of sampling from the Cauchy distribution in Stan), and with the latest and greatest rstan (v2.9.0-3) in hand, I see now that about 1-2% of my iterations are reported as divergent, i.e. about 3-7 for each chain of 300 iterations. Looking at paired plots and shinystan, the divergent iterations do not jump out at me as inherently pathological. No funnel shapes, no divergent iterations near tails of the parameters or the posterior mode.
My questions: The analysis is finalized at this point, with the results already in press. Much of the language in the RStan vignette and google discussions here suggests that observing any divergent iterations after the warmup is unacceptable. Is that a fair interpretation -- is it the presence of divergent iterations or the proportion of divergent iterations that matters more? Or is that too general of a question to ask? Is re-running the analysis with, a larger targeted acceptance rate and/or t_3 tails warranted here, at least for a sensitivity analysis?

Thanks in advance,
Phil

 > sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] shinystan_2.1.0    shiny_0.13.1       betareg_3.0-5      rstan_2.9.0-3      ggplot2_2.1.0    
[6] readstata13_0.8.2  mice_2.25          Rcpp_0.12.3        RColorBrewer_1.1-2


Bob Carpenter

unread,
Apr 5, 2016, 3:51:34 PM4/5/16
to stan-...@googlegroups.com
(i) Michael --- are you still running a zero-tolerance policy on
divergent transitions?

(iii) Philip --- that imputation should be OK if all the chains
converged. 100 imputations is probably overkill. But I'm going to
post another question with a fresh subject to try to get Andrew's attention
to see what he thinks the impact on R-hat etc. would be.

(v) Cauchy posteriors are bad news because expectations aren't defined.
But Cauchy priors don't lead to Cauchy posteriors when there's more than
a few data points (get a theoretician or someone with experimental cycles
to answer more precisely).

Running with a higher adapt_delta (target acceptance rate) should answer
the question of whether the divergences matter here.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Michael Betancourt

unread,
Apr 5, 2016, 4:10:45 PM4/5/16
to stan-...@googlegroups.com

(i)  Michael --- are you still running a zero-tolerance policy on
divergent transitions?

Yes with caveats for people who really know what they are doing (i.e. me*).
Basically if the divergences can be ignored then you should be able to
get rid of them by increasing adapt_delta by a small amount.  If that
doesn’t work then the divergences are meaningful.

*The only time I personally ignore divergences is when I can isolate
their origin to hierarchical effects that should be implemented with
a partially-centered parameterization, so not quite centered but not
quite noncentered.  But this requires a lot of expertise and I make
no guarantees of valid results if you ignore any divergences.

Bob Carpenter

unread,
Apr 5, 2016, 4:57:34 PM4/5/16
to stan-...@googlegroups.com
That's roughly been the case for me up to a point. That is,
I can go from 10 or 20 divergences to 1 or 2 by monkeying with
adapt_delta and stepsize, but when I'm at adapt_delta=0.999 and
I've already moved to a non-centered parameterization, I often
still get 1 or 2 divergent transitions with 1K iterations in each
of 4 chains.

- Bob

Michael Betancourt

unread,
Apr 5, 2016, 5:16:32 PM4/5/16
to stan-...@googlegroups.com
I’d look at the plots of the hierarchical effects vs the hierarchical variance.
My guess is that you have enough data where the non-centered parameterization
isn’t ideal which manifests as an inverted funnel with divergences in the tail.

Sometimes certain groups are more informative than others and you can
switch centering/noncentering group by group.

Once I get time I’ll start to play around with partial centering so that we
can play around with that, too.

Michael Betancourt

unread,
Apr 5, 2016, 5:17:24 PM4/5/16
to stan-...@googlegroups.com
But divergences that persist even at adapt_delta = 0.999 would make me
anxious. I usually accept the occasional divergence only when I’m near
the nominal adapt_delta = 0.8.

Bob Carpenter

unread,
Apr 5, 2016, 6:28:36 PM4/5/16
to stan-...@googlegroups.com
These are all hierarchical logistic regressions with
mixed amounts of data by group. For instance, I'm working
on a very unbalanced hierarchical binomial model now that has
a lot of 0/0 observations and some 30K/60K observations.

Those 0/0 cases also dominate the hierarchical variance (basically
pegging it to the next level up's variance), so I've
got bigger problems for this model. The ones I did for the
case study were well behaved enough I could get rid of all the
divergences with a low enough step size.

- Bob

Michael Betancourt

unread,
Apr 5, 2016, 6:35:21 PM4/5/16
to stan-...@googlegroups.com
Then you’ll probably get a sufficient fix by mixing centered
and non-centered parameterizations by group.

Philip Boonstra

unread,
Apr 5, 2016, 9:35:23 PM4/5/16
to Stan users mailing list

(iii) Philip --- that imputation should be OK if all the chains
converged.  100 imputations is probably overkill.   But I'm going to
post another question with a fresh subject to try to get Andrew's attention
to see what he thinks the impact on R-hat etc. would be.

You're right, 100 isn't the "right" number, but some folks have suggested that the standard imputation recipe of 5-10 is probably insufficient--from the perspective of  propagating imputation uncertainty--if you're going to do MCMC on them e.g. http://www.xzlab.org/papers/2010_Zhou&Reiter_TAS.pdf. Gelman, et al. make this point in BDA, but I don't think they talk about sufficient number of imputations.

Philip Boonstra

unread,
Apr 5, 2016, 10:35:58 PM4/5/16
to Stan users mailing list


On Tuesday, April 5, 2016 at 5:17:24 PM UTC-4, Michael Betancourt wrote:
But divergences that persist even at adapt_delta = 0.999 would make me
anxious.  I usually accept the occasional divergence only when I’m near
the nominal adapt_delta = 0.8.

Hi. Thanks for your help. I'm still struggling with understanding what's going on. Probably need to do some more reading on my own. Are any of these the right way to think about what's going on here?
(i) Divergence indicates that my posterior "distribution" is actually not a distribution. From this perspective, any divergent iteration is bad news.
(ii) Divergent iterations are contamination in my posterior sample that simply need to be pruned out
(iii) Divergent iterations are a symptom of something gone awry with the sampler but not necessarily of the posterior distribution

Thanks, Phil

Bob Carpenter

unread,
Apr 6, 2016, 12:50:33 AM4/6/16
to stan-...@googlegroups.com
The answer is (iii) --- it's a computational issue.

There's not much out there to read! Basically just what Michael's
written --- in papers (see arXiv), in our manual, and on this
list and the dev list.

It's almost always caused by areas of high curvature in the posterior
which are difficult to follow using only small steps along the gradient.
It almost always manifests as floating-point underflow or overflow in
the density or gradient calculations, which we report as "divergent"
iterations.

If the curvature (second derivative matrix) doesn't vary by position
in the posterior (as in the multivariate normal), then adaptation
is much easier. When curvature varies a lot, as in a hierarchical
model (see the funnel example in the manual), adaptation to a single
step size/mass matrix isn't possible, and areas of the posterior become
hard to explore.

This also explains why reparameterization can help. Again, see the
funnel example in the manual for an extreme example, or read Michael's
paper with Mark Girolami on hierarchical models for much more detail
and an explanation of how data can tame the funnel prior.

It's hard to quantify how much bias there might be without getting
a better handle on what's really going on in a paticular posterior. Usually
it's not very big in the examples I've looked at. You'll wind up getting
the same kinds of biases with Gibbs, Metropolis and ensemble samplers---these are
hard problems computationally; we're just trying to be helpful in providing
diagnostics.

- Bob

Guido Biele

unread,
Apr 8, 2016, 3:08:35 AM4/8/16
to Stan users mailing list
I think the discussion of the relationship of amount of data
and non-centered parameterization is important, but as I far 
as I can tell it is not in the manual. It would be great if the 
the stan dev team could add a few sentences about this issue*,
even if there is no time for a full discussion partial centering.

Guido

*in particular, when does one have "enough data so that 
non-centered parameterization  isn’t ideal". This could explain
why I recently could not fit a non-hierarchical regression
with non-centered parameterization (with N = 45 0000)
whereas a hierarchical regression with ~20 group with ns between 
5 and 10 000 worked.

Bob Carpenter

unread,
Apr 8, 2016, 10:45:37 AM4/8/16
to stan-...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages