Hierarchical shrinkage prior

255 views
Skip to first unread message

David West

unread,
Sep 5, 2016, 7:48:11 PM9/5/16
to Stan users mailing list
Dear experts,

I'm just a beginner in Bayesian statistics but really likes Stan and the nice discussions on this mailing list. I'm trying to implement a hierarchical shrinkage prior with fat tails in Stan. This leads to many divergent transitions. I use a non-centered parametrization for the normal distribution in the model but this didn't solve all the divergent transitions. I'm currently in doubt what I should do next to prevent divergent transitions. From the manual and discussion on this mailing list I learnt that alternative parameterizations might prevent divergent transitions. From the manual I understood that there are possibilities to use a non-centered parameterization of the Dirichlet distribution (but I had troubles to implement this) and from discussions on this mailing list I understood that there might be alternative parameterizations of the gamma distribution. Or should I go another way?

On the other hand, although there are many divergent transitions, estimates were pretty good in simulated data. From posts of Michael Betancourt I know however, that I shouldn't trust outcomes if there are any divergent transitions so I hope that anyone can help me solving these divergent transitions. I included R code to simulate data and a stan file including the model.

Any help would be greatly appreciated!
D
Simulated_data1.R
HierShrPrior1.stan

Bob Carpenter

unread,
Sep 6, 2016, 7:03:47 AM9/6/16
to stan-...@googlegroups.com
The real problem you have is a multiplicative non-identifiability:

parameters {
vector[D] beta_raw;
vector<lower=0>[D] psi;
simplex[D] phi;
real<lower=0> omega;
}
transformed parameters {
vector[D] beta;
beta = beta_raw .* (0.5 * omega * (psi .* phi));
}
model {
omega ~ gamma(a, ksi);
ksi ~ gamma(b, 1);
phi ~ dirichlet(a_pi);
psi ~ exponential(0.5);
beta_raw ~ normal(0, 1);
temp_intercept ~ normal(0, 10);
y ~ bernoulli_logit(x * beta + temp_intercept);

(Variables not included above are data.)

The scale of phi is fixed as a simplex, but the omega * psi
term isn't identifiable (multiply omega and divide psi by
a constant) without a prior that fixes the scale of one of
them. I'm not sure psi ~ exponential(0.5) is enough to do that
effectively.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <Simulated_data1.R><HierShrPrior1.stan>

David West

unread,
Sep 6, 2016, 2:55:58 PM9/6/16
to Stan users mailing list
Hi Bob,

Thanks a lot for your explanation! I read Gelman and Hill 2007 about identifiability. I think that I understand the concept but have troubles how to solve it in this model. Changing exponential(beta), with beta ranging from 0.1 to 5 did not change anything because the multiplicative non-identifiability was left unchanged, in fact. You suggested to fix the scale of psi or omega with a prior but I do not understand how to do this because I have already a prior on psi and omega (psi as exponential distribution with a fixed inverse scale of 0.5). Your help would be greatly appreciated.

D

Op dinsdag 6 september 2016 13:03:47 UTC+2 schreef Bob Carpenter:

Bob Carpenter

unread,
Sep 7, 2016, 10:21:38 AM9/7/16
to stan-...@googlegroups.com
Simplest thing to do is just fix one of the parameters to
a constant and see if that works. It always helps to develop
code if you start with simple models that work and build up to
more complex models. Then you can see when the problems get
introduced.

Otherwise, a tighter prior than exponential(0.5) would help.

Or reformulating in a way that has a direct representation rather
than something multiplicatively scaled.

- Bob

David West

unread,
Sep 7, 2016, 8:33:39 PM9/7/16
to Stan users mailing list
Hi Bob,

Thanks a lot for your advice and patience! I tried to follow your advice as good as possible:
1. Fixing one of the parameters. The only parameter that was possible to fix was ksi. I changed omega ~ gamma(a, ksi) to omega ~ gamma(a, 0.1). a is a constant, in this example 1. This did improve number of divergent transitions but there were still 171 divergent transition with adapt_delta=0.99. Additionally, number of effect samples sharply decreased to <10 for many variables.

2. Starting with a simple model. I'm always trying to do this but for hierarchical shrinkage priors this is difficult. I performed a test by omitting omega, thus changing beta = beta_raw .* (0.5 * omega * (psi .* phi)) to beta = beta_raw .* (0.5 * (psi .* phi)). This did not improve number of divergent transitions but effective number of samples and estimates looked good. I also removed psi from the model (instead of psi), thus beta = beta_raw .* (0.5 * omega * phi), but this did also not improve number of divergent transitions.

3. Tighten the exponential(0.5) prior. I changed this prior to exponential(10) but this did unfortunately not improve the number of divergent transitions and because of the multiplication in the model omega increased largely. I therefore additionally fixed omega to omega ~ gamma(1, 0.1) but this also did not change number of divergent transitions (but estimates looked good).

4. Reformulate in something not multiplicatively scaled. With the values of a_pi I'm using, the model is in fact equivalent to:
ksi ~ gamma(b, 1);
lambda ~ gamma(a_pi, ksi);
beta ~ double_exponential(0, sqrt_vec(lambda * 0.5)); //sqrt_vec is square root of a vector
This model needs less multiplication but unfortunately did not improve the number of divergent transitions. Additionally, it is hard to make hierarchical shrinkage priors without multiplication of the scale (the horseshoe prior needs also multiplication of the scale for example). The model I would like to implement in Stan can be found in formula 8 of https://arxiv.org/pdf/1609.00046v1.pdf. This paper also shows how to implement a Gibbs sampler but because I understood that Hamiltonian Monte Carlo is more reliable I would really like to implement this in Stan. Do you think this is possible?

Many thanks,
D


Op woensdag 7 september 2016 16:21:38 UTC+2 schreef Bob Carpenter:

David West

unread,
Sep 12, 2016, 5:55:11 AM9/12/16
to Stan users mailing list
Dear Bob,

Congratulations with the 2000 registrations for Stan (http://andrewgelman.com/2016/09/09/stan-users-group-hits-2000-members/)! I think people really like the friendly and helpful answers of you and the other members of the Stan team! I'm really hoping that you or other users can also help me making the next step in solving the issue I formulated in this post, but if not possible or my formulation is unclear or I'm too impatient or something else, please let me know.

D

Op donderdag 8 september 2016 02:33:39 UTC+2 schreef David West:

Bob Carpenter

unread,
Sep 12, 2016, 10:39:40 AM9/12/16
to stan-...@googlegroups.com

> On Sep 7, 2016, at 8:33 PM, David West <davidw...@gmail.com> wrote:
>
> Hi Bob,
>
> Thanks a lot for your advice and patience! I tried to follow your advice as good as possible:
> 1. Fixing one of the parameters. The only parameter that was possible to fix was ksi. I changed omega ~ gamma(a, ksi) to omega ~ gamma(a, 0.1). a is a constant, in this example 1. This did improve number of divergent transitions but there were still 171 divergent transition with adapt_delta=0.99. Additionally, number of effect samples sharply decreased to <10 for many variables.

Can you put a more informative prior on omega than gamma(1, 0.1)?

> 2. Starting with a simple model. I'm always trying to do this but for hierarchical shrinkage priors this is difficult. I performed a test by omitting omega, thus changing beta = beta_raw .* (0.5 * omega * (psi .* phi)) to beta = beta_raw .* (0.5 * (psi .* phi)). This did not improve number of divergent transitions but effective number of samples and estimates looked good. I also removed psi from the model (instead of psi), thus beta = beta_raw .* (0.5 * omega * phi), but this did also not improve number of divergent transitions.
>
> 3. Tighten the exponential(0.5) prior. I changed this prior to exponential(10) but this did unfortunately not improve the number of divergent transitions and because of the multiplication in the model omega increased largely. I therefore additionally fixed omega to omega ~ gamma(1, 0.1) but this also did not change number of divergent transitions (but estimates looked good).
>
> 4. Reformulate in something not multiplicatively scaled. With the values of a_pi I'm using, the model is in fact equivalent to:
> ksi ~ gamma(b, 1);
> lambda ~ gamma(a_pi, ksi);
> beta ~ double_exponential(0, sqrt_vec(lambda * 0.5)); //sqrt_vec is square root of a vector
> This model needs less multiplication but unfortunately did not improve the number of divergent transitions. Additionally, it is hard to make hierarchical shrinkage priors without multiplication of the scale (the horseshoe prior needs also multiplication of the scale for example). The model I would like to implement in Stan can be found in formula 8 of https://arxiv.org/pdf/1609.00046v1.pdf. This paper also shows how to implement a Gibbs sampler but because I understood that Hamiltonian Monte Carlo is more reliable I would really like to implement this in Stan. Do you think this is possible?

No idea and I don't really have time for another research
project right now. Gibbs can work well when there's a lot of
scale variation because it's completely scale invariate when
conjugate.

- Bob

David West

unread,
Sep 12, 2016, 3:50:41 PM9/12/16
to Stan users mailing list
Dear Bob,

Thanks a lot for your help.

D

Op maandag 12 september 2016 16:39:40 UTC+2 schreef Bob Carpenter:
Reply all
Reply to author
Forward
0 new messages