Bayesian Variable Selection models

454 views
Skip to first unread message

Haziq Jamil

unread,
Oct 13, 2016, 3:49:46 PM10/13/16
to Stan users mailing list
Hello all,

I am looking to fit the following variable selection for linear regression models (Kuo & Mallick, 1998) in Stan:

y_i ~ N(gamma_1 * beta_1 * X_{i1} + ... + gamma_p * beta_p * X_{ip}, sigma^2)
beta_j
~ N(0, 100)  
gamma_j
~ Bern(0.5)
sigma
~ some weakly informative prior

i = 1,...,n
j = 1,...,p

The gamma are indicator variables denoting whether or not that covariate should be included in the model. The posterior samples of gamma are of interest, as we can compute posterior model probabilities from the samples. E.g., gamma_1 = ... gamma_p = 1 is the "full model" where all variables are included, and we can calculate the proportion of this model occurring in the posterior state space of models, which gives an estimate for the posterior model probability of the full model.

While this can be implemented in BUGS or JAGS using Gibbs sampling, this of course can't be done in Stan because sampling from discrete distributions is not possible. There are two ways I can think of to make this happen in Stan:

APPROACH 1: Marginalise over gamma

As suggested in the manuals, one could marginalise the model over the gamma variables, and use the generated quantities block to obtain posterior samples for gamma.

ISSUE: Scalability, as p gets large. If I understand correctly, marginalising over all gamma variables requires the calculation of (the log-likelihood value of) all possible 2^p models. I am concerned about efficiency - I don't know how to code this marginalisation without having to call p nested loops.

APPROACH 2: Latent continuous variable

Assume for each gamma_j there is an underlying continuous latent variable Z_j which has a standard normal distribution. Define gamma_j = 1 if Z_j >= 0 and gamma_j = 0 otherwise. This should be an equivalent representation of the original model, at least in intention, because P[Z_j >= 0] = P[gamma_j = 1] = 0.5

ISSUE: I am not sure I have implemented this correctly in Stan. For this simple example (p = 5, variables 1 and 2 not included in true model) it gives good results, in the sense that the posterior model probability of the true model does turn out to be the highest of all. The concern here is that the samples turn out to be severely autocorrelated, and sometimes divergent transitions also occur.

-------------------------------------------------------------------------------------

Would anybody care to give any input as to whether Approach 1 or 2 is better, or neither, and if so, what's a better way of fitting this variable selection model in Stan? Thanks in advance.

Haziq Jamil

unread,
Oct 13, 2016, 3:54:29 PM10/13/16
to Stan users mailing list
The Stan model for Approach 2:

data {
  int<lower=0> n; // number of data
  int<lower=0> p; // number of variables
  vector[n] Y; // centred data Y
  matrix[n, p] X; // standardised covariates X
}
parameters {
  real<lower=0> sigma;
  vector[p] beta;
  vector[p] Z;
}
transformed parameters {
  vector[p] gamma;
  for (j in 1:p) {
    if (Z[j] >= 0) {
      gamma[j] = 1;
    } else {
      gamma[j] = 0;
    }
  }
}
model {
  vector[n] mu;
  mu = X * (beta .* gamma);

  Z ~ normal(0, 1);
  target += normal_lpdf(beta | 0, 4);
  target += normal_lpdf(Y | mu, sigma);
}

As in the original post, here's a minimal working example.

Bob Carpenter

unread,
Oct 13, 2016, 4:06:28 PM10/13/16
to stan-...@googlegroups.com
Right. Your approach 1 is what you need to do, but it's
intractable, requiring 2^N evaluations for N variables.

Your approach 2 won't work because it will break derivatives.

You'll also find that while you can write these models with
Gibbs sampling, you can't easily sample over the posterior
because it's highly multimodal. Instead, what will happen is
that your sampler will likely get stuck if there are two
correlated variables, sometimes eliminating one and sometimes
the other.

We don't worry much about these priors because the applications
people want for them are usually not fully Bayesian. That is,
they want to use this to eliminate variables. But if you set up
the proper Bayesian inference, you'll find there's still some
posterior weight on gamma_j = 1, and so you can't drop the variable
itself. At least not for full Bayes.

Aki's worked on some approaches to variable selection in
some more complex models:

https://users.aalto.fi/~ave/

Or, you can just put a reasonable prior on them and let them fall
where they fall. They might shrink close to zero. And if they
do shrink close to zero, you might be able to eliminate them because
they'll have no effect on model predictions.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jonah Gabry

unread,
Oct 13, 2016, 7:54:15 PM10/13/16
to Stan users mailing list
To add on a bit to Bob's point at the end of his response, if predictive performance is a goal then I think a good regularizing prior that shrinks large coefficients will tend to be more beneficial than setting a whole bunch of small coefficients to zero. Of course trimming down the number of variables in a model can also be nice for other reasons.

Jonah 

Haziq Jamil

unread,
Oct 14, 2016, 3:36:48 AM10/14/16
to Stan users mailing list
Thanks both for your responses.

Without getting into too much of a discussion on the merits of these types of models and their applicability in real life situations, I am interested in these particular models as it ties in with some work I've been doing lately. I was hoping to change the MCMC engine from JAGS to Stan, but I guess there doesn't seem to be a way to do this at the moment.

Haziq

Bob Carpenter

unread,
Oct 14, 2016, 11:45:20 AM10/14/16
to stan-...@googlegroups.com
You can only fit a hard spike-and-slab variable selection model
in Stan with small numbers of variables.

It's also very hard to fit in JAGS or other Gibbs samplers.
They will tend to get stuck in modes and not mix well on the
Bernoulli variate indicating if a variable is selected or not.
The true posterior is not going to be 100% on or off (that is,
full Bayes won't do variable selection), and there are likely
to be many modes. You can see that analytically if you write out
the posterior for the value of a coefficient.

There's a vast MCMC literature on the problems posed by
combinatorial models like variable selection, with Ising spin models
making up the bulk of that because they have real physical
analogues.

You will be able to diagnose this using multiple chains if
you give them diffuse initializations. I can't recall how
JAGS initializes, but I think it may be deterministically by
default. Masanao and Yu-Sun and Andrew built dealing with multiple
chains into R2jags, but I can't recall how inits work.

- Bob

Stephen Martin

unread,
Oct 14, 2016, 12:12:17 PM10/14/16
to Stan users mailing list
Just curious, but could you just say all gamma_p ~ beta(.5,.5) or something? It'd estimate one model and, I think, drop out any variables that seem to just contribute noise to the likelihood.

Stephen Martin

unread,
Oct 14, 2016, 2:48:35 PM10/14/16
to Stan users mailing list
x1 <- rnorm(100,0,1)
x2
<- rnorm(100,0,1)
x3
<- rnorm(100,0,1)
y
<- .8*x1 + rnorm(100,0,.64/.3 - .64)

stanModel
<- '
data {
    vector[100] x1;
    vector[100] x2;
    vector[100] x3;
    vector[100] y;
}
parameters {
    real beta0;
    real<lower=0> sigma;
    vector[3] beta;
    vector<lower=0,upper=1>[3] gamma;
}
model {
    beta0 ~ normal(0,1);
    sigma ~ cauchy(0,1);
    beta ~ normal(0,1);
    gamma ~ beta(.5,.5);
    y ~ normal(beta0 + gamma[1]*beta[1]*x1 + gamma[2]*beta[2]*x2 + gamma[3]*beta[3]*x3,sigma);
}
'

stanOut
<- stan(model_code = stanModel,data = list(x1=x1,x2=x2,x3=x3,y=y))


This worked for me. It just placed high probability on beta[1] and low probabilities on beta[2] and [3]. Output for gamma[1:3]: .81, .27, .28

Bob Carpenter

unread,
Oct 14, 2016, 3:02:35 PM10/14/16
to stan-...@googlegroups.com
Thanks for the reproducible example. It let me run it to
illustrate that the posteriors are going to be very wide on
gamma:

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 -0.07 0.00 0.14 -0.35 -0.16 -0.07 0.03 0.22 3115 1
sigma 1.53 0.00 0.11 1.33 1.45 1.52 1.59 1.75 3244 1
beta[1] 1.19 0.01 0.42 0.64 0.91 1.09 1.36 2.34 864 1
beta[2] -0.26 0.02 0.70 -1.72 -0.60 -0.23 0.05 1.32 1415 1
beta[3] 0.32 0.02 0.69 -1.19 0.00 0.29 0.68 1.80 1586 1
gamma[1] 0.78 0.01 0.19 0.36 0.65 0.83 0.95 1.00 1206 1
gamma[2] 0.30 0.01 0.31 0.00 0.04 0.18 0.51 0.99 1645 1
gamma[3] 0.34 0.01 0.32 0.00 0.06 0.22 0.56 0.99 1652 1
lp__ -98.41 0.07 2.28 -103.81 -99.71 -98.12 -96.78 -94.90 1195 1

The problem you're going to find is that the beta[i] and gamma[i] are
highly correlated because only their product is identified by the likelihood.
So you get a typically banana-shaped posterior in beta[1] and gamma[1] and
a funnel-like posterior in the others.

But in any case, even the posterior means for gamma are such that they'll
have their corresponding betas contributing to the final likelihood.

- Bob

Stephen Martin

unread,
Oct 14, 2016, 3:18:09 PM10/14/16
to Stan users mailing list
Right, it's certainly not perfect, and the betas will be incorrectly estimated (overestimated --- you'd have to multiply the posterior estimates by the gamma value I think), but intuitively it seems like crappy predictors will have lower gamma values, and better predictors higher values. Could probably play with the gamma priors to highly, highly encourage the model to force a near-0 or near-1 probability value  (e.g., using very wide logistic as a prior and converting the logistic to probability; that samples better than using high-valued, inverted beta distributions like beta(.001,.001)). Once you do that, you'd have gamma values closer to .98 and .02, essentially dropping the predictor from the model, I think.

Using beta(.001,.001):
beta0      0.05042475 0.003452536 0.14751143 -2.457484e-01  5.506800e-02  9.057310e-02  1.302242e-01
sigma      1.52741694 0.002459355 0.11032715  1.332410e+00  1.521540e+00  1.547494e+00  1.580244e+00
beta[1]    0.99345021 0.009443226 0.19914098  6.661747e-01  9.801828e-01  1.028649e+00  1.073121e+00
beta[2]   -0.03553213 0.029565265 0.96682366 -1.958262e+00 -4.824715e-02  1.187498e-01  3.979566e-01
beta[3]   -0.07575499 0.024849034 0.92273891 -1.977893e+00 -8.798679e-02  3.344441e-02  2.332172e-01
gamma[1]   0.97946337 0.004261209 0.07765288  7.054986e-01  1.000000e+00  1.000000e+00  1.000000e+00
gamma[2]   0.12710659 0.013083718 0.31782193  2.220446e-16  7.117862e-08  3.653075e-06  2.036401e-04
gamma[3]   0.18242550 0.017428924 0.37162498  4.440892e-16  4.275198e-07  3.644472e-05  2.087207e-03
lp__     -94.57190124 0.042277194 1.63023954 -9.854124e+01 -9.423520e+01 -9.389229e+01 -9.354567e+01
                  80%     n_eff      Rhat
beta0      0.17417156 1825.4708 0.9999324
sigma      1.61793032 2012.4374 1.0000588
beta[1]    1.13122852  444.7137 1.0117517
beta[2]    0.74802429 1069.3773 1.0052308
beta[3]    0.61431356 1378.9186 1.0024048
gamma[1]   1.00000000  332.0852 1.0151699
gamma[2]   0.01335559  590.0725 1.0028787
gamma[3]   0.15881213  454.6411 1.0097139
lp__     -93.19930131 1486.9300 1.0027967

To me, that would give fairly strong evidence for b1 and fairly strong evidence against b2 and b3.

Bob Carpenter

unread,
Oct 14, 2016, 3:53:50 PM10/14/16
to stan-...@googlegroups.com

> On Oct 14, 2016, at 3:18 PM, Stephen Martin <hwki...@gmail.com> wrote:
>
> Right, it's certainly not perfect, and the betas will be incorrectly estimated (overestimated --- you'd have to multiply the posterior estimates by the gamma value I think),

That's right. And you get the implied prior on that product from the
independent priors on beta and gamma.



> but intuitively it seems like crappy predictors will have lower gamma values, and better predictors higher values.



> Could probably play with the gamma priors to highly, highly encourage the model to force a near-0 or near-1 probability value (e.g., using very wide logistic as a prior and converting the logistic to probability; that samples better than using high-valued, inverted beta distributions like beta(.001,.001)). Once you do that, you'd have gamma values closer to .98 and .02, essentially dropping the predictor from the model, I think.

As soon as you start making the priors very sharp to push the gamma
toward 0 and 1, you'll get problems with sampling due to energy wells
(local density maxima) that are too deep to climb out of with HMC (or
Gibbs).

This is good for thinking about variable selection. Which variable
do you select if two variables x[i] and and x[j] are highly correlated?
It depends on amount of regularization and how much better you'll do
at predicting y. Consider the limit where x[i] and x[j] are identical.
You only need one predictor. But the posterior of which one you get isn't
identified. When there are multiple things like this going on, the combinatorics
add up to make it pretty much impossible to solve this problem tractably
in the general case.

With the following, you can see that the posterior density is very skewed,
with half the draws for gamma[2] and gamma[3] being very close to 0 and
others being uch higher to lead to the final posterior means of 0.13 and
0.18.

- Bob

Haziq Jamil

unread,
Oct 14, 2016, 4:45:48 PM10/14/16
to Stan users mailing list


On Friday, 14 October 2016 16:45:20 UTC+1, Bob Carpenter wrote:
You can only fit a hard spike-and-slab variable selection model
in Stan with small numbers of variables.  

It's also very hard to fit in JAGS or other Gibbs samplers.
They will tend to get stuck in modes and not mix well on the
Bernoulli variate indicating if a variable is selected or not.
The true posterior is not going to be 100% on or off (that is,
full Bayes won't do variable selection), and there are likely
to be many modes.  You can see that analytically if you write out
the posterior for the value of a coefficient.


In a Gibbs sampling procedure of the original model above, the conditional density for the betas is Gaussian, and Bernoulli for the gammas. I'm not sure what you mean by multimodal? Also, when you say the Bernoulli variates get stuck and not mix well, e.g. gamma_j keeps on being sampled as 1, 1, 1, 1, ... with some occasional zeroes, then surely the model is telling us that this variable is important, don't throw it out? If you mean the betas don't mix well, then sure it's a problem. But the Gibbs conditionals seem to be nice distributions, I'm failing to see how it can be an issue.

The posterior mean of the gamma_j samples are interpreted as the posterior inclusion probabilities for each variable X_j (Ntzoufras, 2011). So yes agreed that Bayes doesn't turn variables 100% on or off, but inference on variable or model selection is done based on probabilities, and one could choose the model with the highest posterior probability or median probability models (Barbieri and Berger, 2004).

Haziq Jamil

unread,
Oct 14, 2016, 5:02:26 PM10/14/16
to Stan users mailing list
Stephen, thanks for the example. It is an interesting alternative to the original model. Initially I did try making the gammas continuous using Uniform(0,1) priors but that gave terrible results. Your suggestion of a horseshoe-like prior is closer to the original model, since the Bernoulli(0.5) on the gammas are supposed to give this apriori indifference on being selected in the model or not.

What I find attractive about the original Kuo & Mallick model is the ability to do subset selection, something which is not possible with continuous gammas, I don't think. Sure we can get estimates for the coefficients as beta * gamma, but if that's the interest then probably a reasonable prior on beta would do the trick (as Bob pointed in the first reply to my original post). Sometimes, there is a desire to know what's the most "reasonable" parsimonious and interpretable model - a goal which is achieved through model selection.

Bob Carpenter

unread,
Oct 21, 2016, 3:01:44 PM10/21/16
to stan-...@googlegroups.com

> On Oct 14, 2016, at 4:45 PM, Haziq Jamil <haziq...@gmail.com> wrote:
>
>
>
> On Friday, 14 October 2016 16:45:20 UTC+1, Bob Carpenter wrote:
> You can only fit a hard spike-and-slab variable selection model
> in Stan with small numbers of variables.
>
> It's also very hard to fit in JAGS or other Gibbs samplers.
> They will tend to get stuck in modes and not mix well on the
> Bernoulli variate indicating if a variable is selected or not.
> The true posterior is not going to be 100% on or off (that is,
> full Bayes won't do variable selection), and there are likely
> to be many modes. You can see that analytically if you write out
> the posterior for the value of a coefficient.
>
>
> In a Gibbs sampling procedure of the original model above, the conditional density for the betas is Gaussian, and Bernoulli for the gammas. I'm not sure what you mean by multimodal?

The usual---the posterior will have local optima, based
on which of the variables is selected.

> Also, when you say the Bernoulli variates get stuck and not mix well, e.g. gamma_j keeps on being sampled as 1, 1, 1, 1, ... with some occasional zeroes, then surely the model is telling us that this variable is important, don't throw it out?

What happens is that it can do that because it gets stuck, not
because the Pr[z[n] = 1 | data] = 1 in the posterior.

> If you mean the betas don't mix well, then sure it's a problem. But the Gibbs conditionals seem to be nice distributions, I'm failing to see how it can be an issue.

:-) Gibbs is scale invariant but not rotation invariant.
When you introduce correlations among your parameters (as
in the strong correlation between the variables being
selected), then it can fail. There really is a huge literature
on this as I said before. It's mainly about the kinds of
phase changes that happen among the modes.

> The posterior mean of the gamma_j samples are interpreted as the posterior inclusion probabilities for each variable X_j (Ntzoufras, 2011). So yes agreed that Bayes doesn't turn variables 100% on or off, but inference on variable or model selection is done based on probabilities, and one could choose the model with the highest posterior probability or median probability models (Barbieri and Berger, 2004).

Assuming you can put a prior on models that makes sense.
Usually it doesn't, and all this model comparison and model
weighting stuff is pretty useless.

And your choices of which variables are on or off are not
independent. So you can't just look at the marginals. You
get another combinatorial problem there.

> There's a vast MCMC literature on the problems posed by
> combinatorial models like variable selection, with Ising spin models
> making up the bulk of that because they have real physical
> analogues.

I would like to emphasize this point. Seriously, check out
the Ising models as they have the exact same problems you're
going to run into.

> You will be able to diagnose this using multiple chains if
> you give them diffuse initializations. I can't recall how
> JAGS initializes, but I think it may be deterministically by
> default. Masanao and Yu-Sun and Andrew built dealing with multiple
> chains into R2jags, but I can't recall how inits work.

Ditto.

- Bob

Bob Carpenter

unread,
Oct 21, 2016, 3:06:02 PM10/21/16
to stan-...@googlegroups.com

> On Oct 14, 2016, at 5:02 PM, Haziq Jamil <haziq...@gmail.com> wrote:
>
> Stephen, thanks for the example. It is an interesting alternative to the original model. Initially I did try making the gammas continuous using Uniform(0,1) priors but that gave terrible results. Your suggestion of a horseshoe-like prior is closer to the original model, since the Bernoulli(0.5) on the gammas are supposed to give this apriori indifference on being selected in the model or not.
>
> What I find attractive about the original Kuo & Mallick model is the ability to do subset selection, something which is not possible with continuous gammas, I don't think. Sure we can get estimates for the coefficients as beta * gamma, but if that's the interest then probably a reasonable prior on beta would do the trick (as Bob pointed in the first reply to my original post). Sometimes, there is a desire to know what's the most "reasonable" parsimonious and interpretable model - a goal which is achieved through model selection.

Only if you have proper priors on the models, proper priors in
the models, and you've accounted for all possible models.


There's a long discussion on this in Gelman et al.'s Bayesian
Data Analysis book in the chapter on model comparison. And lots
of other literature.

You can also check out their most recent paper, which just
came out today:

http://andrewgelman.com/2016/10/21/practical-loo/

There the focus is on predictive accuracy more than selection based
on model complexity.

- Bob

Reply all
Reply to author
Forward
0 new messages