Dealing with divergent transitions in example model: From Wishart to LKJ priors for correlation.

2219 views
Skip to first unread message

Henrik Singmann

unread,
Feb 3, 2016, 4:05:11 PM2/3/16
to Stan users mailing list
Hi there,

When running the example model form the Lee and Wagenmakers book on MPTs (https://github.com/stan-dev/example-models/blob/master/Bayesian_Cognitive_Modeling/CaseStudies/MPT/MPT_2_Stan.R) it gives a lot of divergent transitions. 

Specifically, the first fit (samples_1) gives 191 divergent transitions by 2000 post warm-up samples in 3 chains when producing the start values with set.seed(1). My question is how to deal with this by reparameterizing the model following the Stan recommendations. I am obviously willing to submit a pull request with the updated example model, once this is sorted out.

To give some background, the model is a hierarchical cognitive model of multinomial data. The hierarchical structure is implemented on the real space and then mapped onto the probability space using the probit (i.e., Phi). Further, the group means (i.e., hyperparameters) are independent parameters. The individual levels are modeled as displacements from the group means with a multivariate normal with mean 0 and covariance matrix Sigma. Following Gelman and Hill, the Sigma uses a Wishart prior. Parameter expansion is used as well on the displacements.

As far as I understand the documentation and mailing list, parameter expansions are not well suited for the Stan algorithm and the Wishart prior is also not the first choice anymore. Consequently, my first steps in updating the models were to removing the parameter expansion and switch from a Wishart to a LKJ prior.

Following the documentation (section 6. 12) this means the main change in the model is from

  Sigma ~ wishart(df, I);
  deltahat ~ multi_normal(mudeltahat, Sigma);

to
 
  tau ~ cauchy(0,2.5);
  
Omega ~ lkj_corr(4);
  deltahat
~ multi_normal(mudeltahat, quad_form_diag(Omega, tau));

, where mudeltahat is a vector of zeros. Some other changes obviously are also necessary in the parameter block (e.g., removing the parameter expansion parameters).

This reduces the number of divergent transitions with the same initial values to 1. However, there is still this one and increasing adapt_delta does not seem to help here (I find some even with 0.95). Also, the traceplots (see attached) for the tau parameter and the log-posterior do not look totally optimal. And there are occasional "lkj_corr_log: y is not positive definite:" messages.

I now would like to implement the Cholesky Factorization as the next step also given in the documentation, but I fear am lacking the required Math skills to map the example given there onto my problem. Also, this problem seems well suited to what is commonly referred to as the Matt trick and I also have problems applying this here (I basically have no idea how to do it). 

I attache the updated (and shortened) example which contains the model with the LKJ prior. Any help would be highly appreciated.

Cheers,
Henrik



traceplot_mpt3.pdf
MPT_3_Stan.R

Bob Carpenter

unread,
Feb 3, 2016, 4:31:26 PM2/3/16
to stan-...@googlegroups.com
We don't have the Wisharts implemented for Cholesky factors
and the Jacobians are pretty hairy to code up directly in Stan.

We're torn between wanting to reproduce the examples from
books and wanting to use priors we like better, like the scaled
LKJ priors on correlations.

So I'd suggest reducing initial step size and upping adapt_delta
to whatever it takes to get rid of the divergent transitions (.99
usually works). Then writing a separate model that uses the
Cholesky factor of the correlation matrix --- it's much more stable
numerically.

For that, you just want to follow the example in the manual:

parameters {
...
cholesky_factor_corr[K] L_Omega;
vector<lower=0>[K] sigma;
...
model {
...
L_Omega ~ lkj_cholesky_corr(4);
sigma ~ cauchy(0, 2.5);
deltahat ~ multi_normal_cholesky(mudeltahat, diag_pre_multiply(sigma, L_Omega));

This is assuming you don't have more info pushing L_Omega to unit
correlation or information about the scale of sigma.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <traceplot_mpt3.pdf><MPT_3_Stan.R>

Michael Betancourt

unread,
Feb 3, 2016, 4:49:00 PM2/3/16
to stan-...@googlegroups.com
You can also non-center the multivariate normal, a la

deltahat_tilde ~ normal(0, 1);
deltahat <- mudeltahat + diag_pre_multiply(sigma, L_Omega) * deltahat_tilde;

The long excursion to large values in the tau trace indicates
that your data is not really informing tau at all, which leaves
you with the heavy-tailed prior that HMC will fit okay but not
super effectively. You may need to move to a more informative
half Gaussian prior.

Bob Carpenter

unread,
Feb 3, 2016, 4:57:25 PM2/3/16
to stan-...@googlegroups.com
I just tried the multivariate non-centering with the multivariate Poisson
regressions I posted yesterday and it didn't help. It got the same answer,
but didn't mix any more efficienty. That was a small (K=4 or something like that)
zero mean but full rank covariance case.

- Bob

Henrik Singmann

unread,
Feb 3, 2016, 6:45:19 PM2/3/16
to Stan users mailing list
Hi Bob and Michael, 

Thanks a lot for your very helpful responses. Let me address them in order.

1. Original model (using Wishart). Simply increasing adapt delta to .99 or .even 999 does not remove all divergent transitions, but most (there are still < 10). I didn't know and couldn't find out how to reduce the initial step size. Can you please tell me what is the corresponding control parameter?

2. The model that simply uses the Cholesky factor of the LKJ also still has very few divergent transitions. This can be dealt with by using a tighter prior on the sigma as suggested by Michael. When using normal(0, 1) it basically works and the log-posterior and posteriors of the sigmas look better (but still not awesome). The model then looks like:
parameters {
  vector[nparams] deltahat[nsubjs]; 
  cholesky_factor_corr[nparams] L_Omega; 
  vector<lower=0>[nparams] sigma; 
...

model {
  L_Omega ~ lkj_corr_cholesky(4); 
  sigma ~ normal(0, 1); 
  deltahat ~ multi_normal_cholesky(mudeltahat, diag_pre_multiply(sigma, L_Omega)); 
...
}
generated quantities {
  corr_matrix[nparams] Omega;  
  Omega <- L_Omega * L_Omega';
...
}

3. non-centering the multivariate works wonders. It took me a little to figure the right dimensions out, but then it completely removes the divergent transitions and makes all other statistics and the traceplot look pretty, pretty good. The model then looks like this:
parameters {
  matrix[nparams,nsubjs] deltahat_tilde; 
  cholesky_factor_corr[nparams] L_Omega; 
  vector<lower=0>[nparams] sigma; 
...
transformed parameters {  
  matrix[nsubjs,nparams] deltahat; 
  deltahat <- (diag_pre_multiply(sigma, L_Omega) * deltahat_tilde)'; 
...
}
model {
  L_Omega ~ lkj_corr_cholesky(4); 
  sigma ~ cauchy(0, 2.5); 
  to_vector(deltahat_tilde) ~ normal(0, 1); 
...
}
generated quantities {
  corr_matrix[nparams] Omega;  
  Omega <- L_Omega * L_Omega';
...
}"

Given that in this case the displacements deltahat are zero centered, I removed the additive term before the parentheses in the transformed parameters block. Otherwise, it also would need to be a nsubjs by nparams matrix, mudeltahat, consisting of zeros. 
Just to be clear, this third option is the Matt trick?

I will now prepare the pull request with the new versions as well as an accompanying README file detailing why and how it diverges from the book. Please just let me know, what I should do with the original model (just increase adapt_delta or more)?

Thanks,
Henrik

PS: Attached the two full model versions.
MPT_4_Stan.R
MPT_5_Stan.R

Jonah Gabry

unread,
Feb 4, 2016, 12:35:17 AM2/4/16
to stan-...@googlegroups.com
Hi Henrik,

If you're using RStan you can specify stepwise using the argument

control = list(stepsize = value)

You can also add adapt_delta to the control list together with stepwise.

Also, yes, the non-centered parameterization has also been referred to as the Matt Trick, 

although we're trying to use a more informative name. The name "non-centered parameterization" 

also leaves something to be desired.

Jonah

Michael Betancourt

unread,
Feb 4, 2016, 4:00:18 AM2/4/16
to stan-...@googlegroups.com
> 1. Original model (using Wishart). Simply increasing adapt delta to .99 or .even 999 does not remove all divergent transitions, but most (there are still < 10). I didn't know and couldn't find out how to reduce the initial step size. Can you please tell me what is the corresponding control parameter?

Increasing adapt_delta isn’t fool proof — it can resolve small issues but in many cases
diverges are caused by regions of high curvature in the posterior that would require a
vanishing step size to fully explore.

> Just to be clear, this third option is the Matt trick?

Sort of. The “Matt trick” was originally a reparameterization of random effects in
hierarchical models which was developed at Columbia independently of a few
other places (one of Radford Neal’s students, for example). More recently we’ve
been applying it to other places (time series, multivariate Gaussians, etc)

But the general approach actually has a long history in the Markov chain Monte
Carlo literature — it’s been used to speed up Gibbs samplers for a long time, but
for whatever reason it wasn’t all that well known in the applied community.

“Centered” and “non-centered” have technical meanings and these names
make sense in that technical context.

Henrik Singmann

unread,
Feb 5, 2016, 7:46:19 PM2/5/16
to Stan users mailing list
Thanks for those clarifications. Again very helpful.

I have now managed to get the original model to run without divergent transitions and also tweaked the reparameterized models to run without those. Consequently, I have submitted a pull request with my changes so that the example model will get updated to not produce any more divergent transitions and also include the recommended parameterizations: https://github.com/stan-dev/example-models/pull/42

I have only one more question: The stepsize control argument to Stan does not seem to have a default value (at least there is no one given in the R help page). Does this mean the value is picked algorithmically? How does picking a small value like 0.001 help with convergence then?
I unfortunately could not figure this out from the documentation.

Cheers,
Henrik

Bob Carpenter

unread,
Feb 6, 2016, 12:34:14 AM2/6/16
to stan-...@googlegroups.com
I believe the default value is 1 --- that's the default in
CmdStan.

The step size is just the initial step size. It lets
the first few iterations move around a bit and set relative
scales on the parameters. It'll also reduce numerical issues.
On the negative side, it will also be slower because it'll take more
steps at a smaller step size before hitting a U-turn.

The adapt_delta (target acceptance rate) determines what the
step size will be during sampling --- the higher the accept
rate, the lower the step size has to be. The lower the step
size, the less likely there are to be divergent (numerically
unstable) transitions.

- Bob

Jonah Sol Gabry

unread,
Feb 6, 2016, 1:14:39 AM2/6/16
to stan-...@googlegroups.com
For v3 maybe we should rename the argument initial_stepsize. This question has come up a bunch and it's usually just confusion over the name.

Bob Carpenter

unread,
Feb 6, 2016, 7:46:32 PM2/6/16
to stan-...@googlegroups.com
That would've been a better name --- it's overloaded as
initial step size for adaptation and the only step size
if there's no adaptation. This is one of the problems with
having all the arguments depend on each other.

- Bob

> On Feb 6, 2016, at 1:14 AM, Jonah Sol Gabry <jga...@gmail.com> wrote:
>
> For v3 maybe we should rename the argument initial_stepsize. This question has come up a bunch and it's usually just confusion over the name.
>
>

Michael Betancourt

unread,
Feb 6, 2016, 8:02:01 PM2/6/16
to stan-...@googlegroups.com
Except adding more arguments doesn’t necessarily help as engaging
adaptation would make arguments like “step size” irrelevant entirely.
Unless function calls were refined down to every possible configuration
and then users would have to continuously look up which function they
should call in which case.

Bob Carpenter

unread,
Feb 7, 2016, 4:02:47 PM2/7/16
to stan-...@googlegroups.com
Everyone agrees the following two things are good:

1. small number of function names

2. clearly named arguments with minimial interactions

The only problem is that they pull the interface in opposite directions.

At one extreme is the hierarchical CmdStan structure, though
arguably, that breaks down into a few subfunctions based on
whether it's "sample" or "optimize" and then again based
on the engine (can't recall if that's overloaded for sampling
and optimization).

At the two extremes, we have

hmc_static_adaptive_diag(...)

hmc_static_adaptive_dense(...)

hmc_static_fixed_diag(...)

...

nuts_fixed_dense(...)

vs.

stan(engine=hmc, algorithm=static, adapt=yes, mass=diag_e, ...)

stan(engine=nuts, algorithm=fixed, adapt=yes, mass=dense_e, ...)

In the former case, all the interaction is hacked into the names.
To learn the whole command structure, you need to understand the
range of functions. In the latter case, you need to understand how
all the arguments interact.

The doc for the former will require a high-level overview, then a lot
of redundancy per function; the doc for the latter will in one place,
but it'll be deep and have argument interactions.

- Bob
Reply all
Reply to author
Forward
0 new messages