2,395 views

Skip to first unread message

Feb 3, 2016, 4:05:11 PM2/3/16

to Stan users mailing list

Hi there,

to

When running the example model form the Lee and Wagenmakers book on MPTs (https://github.com/stan-dev/example-models/blob/master/Bayesian_Cognitive_Modeling/CaseStudies/MPT/MPT_2_Stan.R) it gives a lot of divergent transitions.

Specifically, the first fit (samples_1) gives 191 divergent transitions by 2000 post warm-up samples in 3 chains when producing the start values with set.seed(1). My question is how to deal with this by reparameterizing the model following the Stan recommendations. I am obviously willing to submit a pull request with the updated example model, once this is sorted out.

To give some background, the model is a hierarchical cognitive model of multinomial data. The hierarchical structure is implemented on the real space and then mapped onto the probability space using the probit (i.e., Phi). Further, the group means (i.e., hyperparameters) are independent parameters. The individual levels are modeled as displacements from the group means with a multivariate normal with mean 0 and covariance matrix Sigma. Following Gelman and Hill, the Sigma uses a Wishart prior. Parameter expansion is used as well on the displacements.

As far as I understand the documentation and mailing list, parameter expansions are not well suited for the Stan algorithm and the Wishart prior is also not the first choice anymore. Consequently, my first steps in updating the models were to removing the parameter expansion and switch from a Wishart to a LKJ prior.

Following the documentation (section 6. 12) this means the main change in the model is from

` Sigma ~ wishart(df, I); deltahat ~ multi_normal(mudeltahat, Sigma);`

` tau ~ cauchy(0,2.5);`

Omega ~ lkj_corr(4);

deltahat ~ multi_normal(mudeltahat, quad_form_diag(Omega, tau));

, where mudeltahat is a vector of zeros. Some other changes obviously are also necessary in the parameter block (e.g., removing the parameter expansion parameters).

This reduces the number of divergent transitions with the same initial values to 1. However, there is still this one and increasing adapt_delta does not seem to help here (I find some even with 0.95). Also, the traceplots (see attached) for the tau parameter and the log-posterior do not look totally optimal. And there are occasional "lkj_corr_log: y is not positive definite:" messages.

I now would like to implement the Cholesky Factorization as the next step also given in the documentation, but I fear am lacking the required Math skills to map the example given there onto my problem. Also, this problem seems well suited to what is commonly referred to as the Matt trick and I also have problems applying this here (I basically have no idea how to do it).

I attache the updated (and shortened) example which contains the model with the LKJ prior. Any help would be highly appreciated.

Cheers,

Henrik

Henrik

Feb 3, 2016, 4:31:26 PM2/3/16

to stan-...@googlegroups.com

We don't have the Wisharts implemented for Cholesky factors

and the Jacobians are pretty hairy to code up directly in Stan.

We're torn between wanting to reproduce the examples from

books and wanting to use priors we like better, like the scaled

LKJ priors on correlations.

So I'd suggest reducing initial step size and upping adapt_delta

to whatever it takes to get rid of the divergent transitions (.99

usually works). Then writing a separate model that uses the

Cholesky factor of the correlation matrix --- it's much more stable

numerically.

For that, you just want to follow the example in the manual:

parameters {

...

cholesky_factor_corr[K] L_Omega;

vector<lower=0>[K] sigma;

...

model {

...

L_Omega ~ lkj_cholesky_corr(4);

sigma ~ cauchy(0, 2.5);

deltahat ~ multi_normal_cholesky(mudeltahat, diag_pre_multiply(sigma, L_Omega));

This is assuming you don't have more info pushing L_Omega to unit

correlation or information about the scale of sigma.

- Bob

> --

> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

> <traceplot_mpt3.pdf><MPT_3_Stan.R>

and the Jacobians are pretty hairy to code up directly in Stan.

We're torn between wanting to reproduce the examples from

books and wanting to use priors we like better, like the scaled

LKJ priors on correlations.

So I'd suggest reducing initial step size and upping adapt_delta

to whatever it takes to get rid of the divergent transitions (.99

usually works). Then writing a separate model that uses the

Cholesky factor of the correlation matrix --- it's much more stable

numerically.

For that, you just want to follow the example in the manual:

parameters {

...

cholesky_factor_corr[K] L_Omega;

vector<lower=0>[K] sigma;

...

model {

...

L_Omega ~ lkj_cholesky_corr(4);

sigma ~ cauchy(0, 2.5);

deltahat ~ multi_normal_cholesky(mudeltahat, diag_pre_multiply(sigma, L_Omega));

This is assuming you don't have more info pushing L_Omega to unit

correlation or information about the scale of sigma.

- Bob

> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

> <traceplot_mpt3.pdf><MPT_3_Stan.R>

Feb 3, 2016, 4:49:00 PM2/3/16

to stan-...@googlegroups.com

You can also non-center the multivariate normal, a la

deltahat_tilde ~ normal(0, 1);

deltahat <- mudeltahat + diag_pre_multiply(sigma, L_Omega) * deltahat_tilde;

The long excursion to large values in the tau trace indicates

that your data is not really informing tau at all, which leaves

you with the heavy-tailed prior that HMC will fit okay but not

super effectively. You may need to move to a more informative

half Gaussian prior.

deltahat_tilde ~ normal(0, 1);

deltahat <- mudeltahat + diag_pre_multiply(sigma, L_Omega) * deltahat_tilde;

The long excursion to large values in the tau trace indicates

that your data is not really informing tau at all, which leaves

you with the heavy-tailed prior that HMC will fit okay but not

super effectively. You may need to move to a more informative

half Gaussian prior.

Feb 3, 2016, 4:57:25 PM2/3/16

to stan-...@googlegroups.com

I just tried the multivariate non-centering with the multivariate Poisson

regressions I posted yesterday and it didn't help. It got the same answer,

but didn't mix any more efficienty. That was a small (K=4 or something like that)

zero mean but full rank covariance case.

- Bob

regressions I posted yesterday and it didn't help. It got the same answer,

but didn't mix any more efficienty. That was a small (K=4 or something like that)

zero mean but full rank covariance case.

- Bob

Feb 3, 2016, 6:45:19 PM2/3/16

to Stan users mailing list

Hi Bob and Michael,

Thanks a lot for your very helpful responses. Let me address them in order.

1. Original model (using Wishart). Simply increasing adapt delta to .99 or .even 999 does not remove all divergent transitions, but most (there are still < 10). I didn't know and couldn't find out how to reduce the initial step size. Can you please tell me what is the corresponding control parameter?

2. The model that simply uses the Cholesky factor of the LKJ also still has very few divergent transitions. This can be dealt with by using a tighter prior on the sigma as suggested by Michael. When using normal(0, 1) it basically works and the log-posterior and posteriors of the sigmas look better (but still not awesome). The model then looks like:

`parameters { vector[nparams] deltahat[nsubjs]; cholesky_factor_corr[nparams] L_Omega; vector<lower=0>[nparams] sigma; ...} `

model { L_Omega ~ lkj_corr_cholesky(4); sigma ~ normal(0, 1);

` deltahat ~ multi_normal_cholesky(mudeltahat, diag_pre_multiply(sigma, L_Omega)); `

`...`

}

generated quantities { corr_matrix[nparams] Omega; Omega <- L_Omega * L_Omega';

...}

3. non-centering the multivariate works wonders. It took me a little to figure the right dimensions out, but then it completely removes the divergent transitions and makes all other statistics and the traceplot look pretty, pretty good. The model then looks like this:

`parameters {`

matrix[nparams,nsubjs] deltahat_tilde; cholesky_factor_corr[nparams] L_Omega; vector<lower=0>[nparams] sigma; ...} transformed parameters { matrix[nsubjs,nparams] deltahat; deltahat <- (diag_pre_multiply(sigma, L_Omega) * deltahat_tilde)';

...}

model { L_Omega ~ lkj_corr_cholesky(4); sigma ~ cauchy(0, 2.5); to_vector(deltahat_tilde) ~ normal(0, 1); ...

}

generated quantities { corr_matrix[nparams] Omega; Omega <- L_Omega * L_Omega';

...

}"

Given that in this case the displacements deltahat are zero centered, I removed the additive term before the parentheses in the transformed parameters block. Otherwise, it also would need to be a nsubjs by nparams matrix, mudeltahat, consisting of zeros.

Just to be clear, this third option is the Matt trick?

Thanks,

Henrik

PS: Attached the two full model versions.

Feb 4, 2016, 12:35:17 AM2/4/16

to stan-...@googlegroups.com

Hi Henrik,

If you're using RStan you can specify stepwise using the argument

control = list(stepsize = value)

You can also add adapt_delta to the control list together with stepwise.

Also, yes, the non-centered parameterization has also been referred to as the Matt Trick,

although we're trying to use a more informative name. The name "non-centered parameterization"

also leaves something to be desired.

Jonah

Feb 4, 2016, 4:00:18 AM2/4/16

to stan-...@googlegroups.com

> 1. Original model (using Wishart). Simply increasing adapt delta to .99 or .even 999 does not remove all divergent transitions, but most (there are still < 10). I didn't know and couldn't find out how to reduce the initial step size. Can you please tell me what is the corresponding control parameter?

Increasing adapt_delta isn’t fool proof — it can resolve small issues but in many cases
diverges are caused by regions of high curvature in the posterior that would require a

vanishing step size to fully explore.

> Just to be clear, this third option is the Matt trick?

hierarchical models which was developed at Columbia independently of a few

other places (one of Radford Neal’s students, for example). More recently we’ve

been applying it to other places (time series, multivariate Gaussians, etc)

But the general approach actually has a long history in the Markov chain Monte

Carlo literature — it’s been used to speed up Gibbs samplers for a long time, but

for whatever reason it wasn’t all that well known in the applied community.

“Centered” and “non-centered” have technical meanings and these names

make sense in that technical context.

Feb 5, 2016, 7:46:19 PM2/5/16

to Stan users mailing list

Thanks for those clarifications. Again very helpful.

I have now managed to get the original model to run without divergent transitions and also tweaked the reparameterized models to run without those. Consequently, I have submitted a pull request with my changes so that the example model will get updated to not produce any more divergent transitions and also include the recommended parameterizations: https://github.com/stan-dev/example-models/pull/42

I have only one more question: The stepsize control argument to Stan does not seem to have a default value (at least there is no one given in the R help page). Does this mean the value is picked algorithmically? How does picking a small value like 0.001 help with convergence then?

I unfortunately could not figure this out from the documentation.

Cheers,

Henrik

I have now managed to get the original model to run without divergent transitions and also tweaked the reparameterized models to run without those. Consequently, I have submitted a pull request with my changes so that the example model will get updated to not produce any more divergent transitions and also include the recommended parameterizations: https://github.com/stan-dev/example-models/pull/42

I have only one more question: The stepsize control argument to Stan does not seem to have a default value (at least there is no one given in the R help page). Does this mean the value is picked algorithmically? How does picking a small value like 0.001 help with convergence then?

I unfortunately could not figure this out from the documentation.

Cheers,

Henrik

Feb 6, 2016, 12:34:14 AM2/6/16

to stan-...@googlegroups.com

I believe the default value is 1 --- that's the default in

CmdStan.

The step size is just the initial step size. It lets

the first few iterations move around a bit and set relative

scales on the parameters. It'll also reduce numerical issues.

On the negative side, it will also be slower because it'll take more

steps at a smaller step size before hitting a U-turn.

The adapt_delta (target acceptance rate) determines what the

step size will be during sampling --- the higher the accept

rate, the lower the step size has to be. The lower the step

size, the less likely there are to be divergent (numerically

unstable) transitions.

- Bob

CmdStan.

The step size is just the initial step size. It lets

the first few iterations move around a bit and set relative

scales on the parameters. It'll also reduce numerical issues.

On the negative side, it will also be slower because it'll take more

steps at a smaller step size before hitting a U-turn.

The adapt_delta (target acceptance rate) determines what the

step size will be during sampling --- the higher the accept

rate, the lower the step size has to be. The lower the step

size, the less likely there are to be divergent (numerically

unstable) transitions.

- Bob

Feb 6, 2016, 1:14:39 AM2/6/16

to stan-...@googlegroups.com

For v3 maybe we should rename the argument initial_stepsize. This question has come up a bunch and it's usually just confusion over the name.

Feb 6, 2016, 7:46:32 PM2/6/16

to stan-...@googlegroups.com

That would've been a better name --- it's overloaded as

initial step size for adaptation and the only step size

if there's no adaptation. This is one of the problems with

having all the arguments depend on each other.

- Bob

> On Feb 6, 2016, at 1:14 AM, Jonah Sol Gabry <jga...@gmail.com> wrote:

>

> For v3 maybe we should rename the argument initial_stepsize. This question has come up a bunch and it's usually just confusion over the name.

>

>

initial step size for adaptation and the only step size

if there's no adaptation. This is one of the problems with

having all the arguments depend on each other.

- Bob

> On Feb 6, 2016, at 1:14 AM, Jonah Sol Gabry <jga...@gmail.com> wrote:

>

> For v3 maybe we should rename the argument initial_stepsize. This question has come up a bunch and it's usually just confusion over the name.

>

>

Feb 6, 2016, 8:02:01 PM2/6/16

to stan-...@googlegroups.com

Except adding more arguments doesn’t necessarily help as engaging

adaptation would make arguments like “step size” irrelevant entirely.

Unless function calls were refined down to every possible configuration

and then users would have to continuously look up which function they

should call in which case.

adaptation would make arguments like “step size” irrelevant entirely.

Unless function calls were refined down to every possible configuration

and then users would have to continuously look up which function they

should call in which case.

Feb 7, 2016, 4:02:47 PM2/7/16

to stan-...@googlegroups.com

Everyone agrees the following two things are good:

1. small number of function names

2. clearly named arguments with minimial interactions

The only problem is that they pull the interface in opposite directions.

At one extreme is the hierarchical CmdStan structure, though

arguably, that breaks down into a few subfunctions based on

whether it's "sample" or "optimize" and then again based

on the engine (can't recall if that's overloaded for sampling

and optimization).

At the two extremes, we have

hmc_static_adaptive_diag(...)

hmc_static_adaptive_dense(...)

hmc_static_fixed_diag(...)

...

nuts_fixed_dense(...)

vs.

stan(engine=hmc, algorithm=static, adapt=yes, mass=diag_e, ...)

stan(engine=nuts, algorithm=fixed, adapt=yes, mass=dense_e, ...)

In the former case, all the interaction is hacked into the names.

To learn the whole command structure, you need to understand the

range of functions. In the latter case, you need to understand how

all the arguments interact.

The doc for the former will require a high-level overview, then a lot

of redundancy per function; the doc for the latter will in one place,

but it'll be deep and have argument interactions.

- Bob

1. small number of function names

2. clearly named arguments with minimial interactions

The only problem is that they pull the interface in opposite directions.

At one extreme is the hierarchical CmdStan structure, though

arguably, that breaks down into a few subfunctions based on

whether it's "sample" or "optimize" and then again based

on the engine (can't recall if that's overloaded for sampling

and optimization).

At the two extremes, we have

hmc_static_adaptive_diag(...)

hmc_static_adaptive_dense(...)

hmc_static_fixed_diag(...)

...

nuts_fixed_dense(...)

vs.

stan(engine=hmc, algorithm=static, adapt=yes, mass=diag_e, ...)

stan(engine=nuts, algorithm=fixed, adapt=yes, mass=dense_e, ...)

In the former case, all the interaction is hacked into the names.

To learn the whole command structure, you need to understand the

range of functions. In the latter case, you need to understand how

all the arguments interact.

The doc for the former will require a high-level overview, then a lot

of redundancy per function; the doc for the latter will in one place,

but it'll be deep and have argument interactions.

- Bob

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu