Divergent transitions (what to try after adapt_delta and non-centered parameterisation)

1,602 views
Skip to first unread message

Jon Sjoberg

unread,
Mar 7, 2017, 8:09:42 AM3/7/17
to Stan users mailing list
I have this model for hierarchical binomial regression, it seems to work well except that it gets 1 - 3 divergent transitions, even when I have adapt_delta = .999. I have a fair amount of data so if I use a non-centered parameterisation I get around 200 divergent transitions. 

From what I've understood from Micheal Betancourt's talks is that these few divergent transitions could still be a sign of a real problem, but how can I go about to find out if this is a sign of a problem I need to worry about? 

What other things can I try (apart from adapt_delta and non-centered parameterisation) can I use to get rid of divergent transitions?

Could it be a problem if the groups in the hierarchies are un-balanced in size, or that there just isn't enough (information in the) data to properly fit this model?

...
parameters
{
  real a
;
  vector
[N_p] ap;
  vector
[N_c] ac;
  vector
[N_s] as;
  vector
[N_c_c] acc;
  vector
[N_lc] alc;
  real alf
;
  real aser
;


  real
<lower=0> ss;
  real
<lower=0> sp;
  real
<lower=0> sc;
  real
<lower=0> scc;
  real
<lower=0> slc;


  real bme
;
}


model
{
  vector
[N] t;
 
for(n in 1:N){
    t
[n] = a + ap[p[n]] + ac[c[n]] + as[s[n]] + acc[c_c[n]] + alc[lc[n]] + alf*lf[n] + aser*ser[n] + bme*m_e[n];
 
}


  u_s_a
~ binomial_logit(u_s_o, t);


  a
~ normal(0, 10);
  ap
~ normal(0, sp);
  ac
~ normal(0, sc);
 
as ~ normal(0, ss);
  acc
~ normal(0, scc);
  alc
~ normal(0, slc);


  alf
~ normal(0, 1);
  aser
~ normal(0, 1);


  sp
~ cauchy(0, 1);
  sc
~ cauchy(0, 1);
  ss
~ cauchy(0, 1);
  scc
~ cauchy(0, 1);
  slc
~ cauchy(0, 1);
}
...

Michael Betancourt

unread,
Mar 7, 2017, 8:44:27 AM3/7/17
to stan-...@googlegroups.com

Too figure out what’s going on you’ll have to start investigating where the divergences
occur in the posterior.  With only 1-3 that won’t be easy but it can provide some critical
information.

Are your fixed or random effects become strongly linearly correlated?  That can happen
when you have this many and lots of data.  Really strong linear correlations can cause
stability issues which trigger divergences.  Ultimately you might need to tighten up
your priors a bit.

What’s n_eff / iteration for all of the parameters?

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jon Sjoberg

unread,
Mar 7, 2017, 9:19:23 AM3/7/17
to Stan users mailing list
Yeah, there seems to be quite strong linear correlation between, at least some of, the random effects, so I'll look into the priors..

The n_eff/iteration is pretty bad, ~0.02 - ~0.25, with most of them are around ~0.1

Michael Betancourt

unread,
Mar 7, 2017, 9:25:36 AM3/7/17
to stan-...@googlegroups.com
Are you you hitting the max_treedepth limit?


    t
[n] = a + ap[p[n]] + ac[c[n]] + as[s[n]] + acc[c_c[n]] + alc[lc[n]] + alf*lf[n] + aser*ser[n] +bme*m_e[n];

Jon Sjoberg

unread,
Mar 7, 2017, 10:38:35 AM3/7/17
to Stan users mailing list

I'm not sure how one finds the treedepth information, other than from shinystan, there it looks like in the attached image, so based on that and that I don't see any other warnings I would say no, I'm not hitting the limit.

Michael Betancourt

unread,
Mar 7, 2017, 10:57:18 AM3/7/17
to stan-...@googlegroups.com

I think there may be an off-by-one error in the visualization as it certainly
looks like you’re saturating the tree depth, which would also account for
the exceedingly slow mixing.  It could also be caused by the intense
correlations of the weakly-identified posterior causing the NUTS criterion
to be prematurely satisfied.

Aside form increasing max_treedepth and seeing if that changes anything
you’ll want to look into putting some kind of sum-to-one constraint on the
fixed/random effects (typically be setting the last one to be 1 - sum of the
others) or using the QR decomposition as  recently discussed on the list
and I believe on in the manual.

Bob Carpenter

unread,
Mar 7, 2017, 10:59:25 AM3/7/17
to stan-...@googlegroups.com

> On Mar 7, 2017, at 10:38 AM, Jon Sjoberg <jon.s...@gmail.com> wrote:
>
> I'm not sure how one finds the treedepth information, other than from shinystan,

https://cran.r-project.org/web/packages/rstan/vignettes/stanfit-objects.html

All those different group intercepts plus a global intercept mean that the
model's only identified by the prior --- you can add to one group and subtract
from another with no difference in likelihood.

The program you provided is not non-centered, because sigma != 1. You
only non-centered the location (necessary given that you have a global intercept)
and not the scale. You could also try starting without the hierarchical structure
and see what you get.

You can also vectorize this

for(n in 1:N)
t[n] = a + ap[p[n]] + ac[c[n]] + as[s[n]] + acc[c_c[n]] + alc[lc[n]] + alf*lf[n] + aser*ser[n] +bme*m_e[n];

as

t = a + ap[p] + ac[c] + as[s] + acc[c_c] + ... + aser * ser + bme * me;

assuming these are all defined as vectors rather than arrays.

- Bob

Ben Goodrich

unread,
Mar 7, 2017, 12:03:01 PM3/7/17
to Stan users mailing list
On Tuesday, March 7, 2017 at 10:57:18 AM UTC-5, Michael Betancourt wrote:

I think there may be an off-by-one error in the visualization as it certainly
looks like you’re saturating the tree depth, which would also account for
the exceedingly slow mixing.  

For the default value of 10 for max_treedepth, it can U-turn on 10 or not U-turn on 10. But in the latter case, there will be warnings (top left) and the pairs plot will have yellow dots (by default) whenever that occurs.



Ben
Auto Generated Inline Image 1

Jon Sjoberg

unread,
Mar 7, 2017, 6:03:59 PM3/7/17
to Stan users mailing list
Thanks for all the input! Increasing max_treedepth helped bring up the n_eff/iteration to ~.125 - ~.5, but the divergent transitions are still there.

You don't happen to have an example of a sum-to-one constraint in a model like this? I have an idea of how I would implement it, but it always nice to have something to compare to so I haven't missed anything.

Anyway, I will look into that and QR decomposition tomorrow, again, thanks for all the help!

Bob Carpenter

unread,
Mar 8, 2017, 12:56:24 AM3/8/17
to stan-...@googlegroups.com
I think Michael meant sum-to-zero. There's an explanation
in the manual in section 8.7, Parameterizing Centered Vectors.

- Bob

Jon Sjoberg

unread,
Mar 16, 2017, 3:54:54 AM3/16/17
to Stan users mailing list
It seems like one of my problems was that I one of the parameters I used a vector for only actually had two values in the data I used to fit the model, fixing that together with increased treedepth made everything a lot better. But thanks for the tips about QR decomposition and parameterizing centered vectors, they will surely come in handy in the future.
Reply all
Reply to author
Forward
0 new messages