Before discussing why what Marcel Jonker wants to do is
equivalent to spike-and-slab priors and why spike-and-slab
priors are inefficient to implement in Stan, let me point
everyone to what Andrew has to say on spike-and-slab priors
for Bayesian inference:
http://andrewgelman.com/2012/09/prior-distributions-for-regression-coefficients/
--
--
I'm happy with all these options (except the spike/slab).
If users want coefs that are exactly zero, I think we should think of that as an add-on. For example, I could imagine having Stan take parameters whose posteriors are tight near zero, and then setting then to be exactly 0, and then going on from there to run with constraints. That could be worth doing, but I think it should be an explicit part of the processing, not something that gets done by the prior.
Well, I would cautiously suggest there is pragmatic value (if not theoretical) even in cases where there are not computational constraints.
A relatively common place example from health psychology (my field) would be examining the effects of stress on health.
Suppose there are:
- 5 continuous stress measures (e.g., domains of family, friends, work, financial, health).
- 10 other sociodemographic factors
Based on previous research it is plausible that:
1) The effects of stress are non-linear
2) Stress measures may interact with each other and
3) Stress measures may interact with the sociodemographic factors
If you only check up to quadratic effects, that still yields: 5 + 10 + 50 = 65 interaction terms. Each focal stress measure is involved in 15 interactions. This yields the undesirable result that the threshold for “negligible” must depend on the number of interactions, in way, because even if any one interaction is for all intents and purposes ignorable, the combined set may not be. Otherwise, even if each individual interaction qualifies as negligible, across the 15 the effects may be large enough that reporting the conditional effect of the stress measure may not be the best estimate of its “overall” effect. I suppose the correct approach in this case would be to leave all the coefficients as small, non-zero, and marginalize them out to report the overall effect of a stress measure, if they are all sufficiently small (if not the marginal has perhaps little value). Although conceptually trivial, it is a hassle to actually implement for multiple variables.
Josh
> On Jul 17, 2016, at 10:05 PM, Joshua Wiley <jwiley...@gmail.com> wrote:
>
> Well, I would cautiously suggest there is pragmatic value (if not theoretical) even in cases where there are not computational constraints.
>
> A relatively common place example from health psychology (my field) would be examining the effects of stress on health.
>
> Suppose there are:
>
> - 5 continuous stress measures (e.g., domains of family, friends, work, financial, health).
>
> - 10 other sociodemographic factors
>
> Based on previous research it is plausible that:
>
> 1) The effects of stress are non-linear
>
> 2) Stress measures may interact with each other and
>
> 3) Stress measures may interact with the sociodemographic factors
>
>
>
> If you only check up to quadratic effects, that still yields: 5 + 10 + 50 = 65 interaction terms.
I don't follow the arithmetic here.
> Each focal stress measure is involved in 15 interactions. This yields the undesirable result that the threshold for “negligible” must depend on the number of interactions, in way, because even if any one interaction is for all intents and purposes ignorable, the combined set may not be. Otherwise, even if each individual interaction qualifies as negligible, across the 15 the effects may be large enough that reporting the conditional effect of the stress measure may not be the best estimate of its “overall” effect. I suppose the correct approach in this case would be to leave all the coefficients as small, non-zero, and marginalize them out to report the overall effect of a stress measure, if they are all sufficiently small (if not the marginal has perhaps little value).
That is the standard Bayesian approach to posterior predictive
inference. The point is that you average predictions over posterior
uncertainty rather than making point estimates of parameters and then
using the point estimates for prediction. The latter's the norm in most
machine learning applications with which I'm familiar.
It's hard to talk about what is correct without having
an application in mind. For instance, do you want to make
predictions, estimate "significance", or present as simple
a model as possible?
> Although conceptually trivial, it is a hassle to actually implement for multiple variables.
Why? I'd think with just 65 it'd be trivial. Doing anything
other than full Bayes isn't supported by Stan, so doing
some kind of variable selection winds up being more work,
though leads to simpler models in terms of number of parameters.
- Bob