variable selection / spike-and-slab priors for regression coeffs

2,790 views
Skip to first unread message

Bob Carpenter

unread,
Jan 11, 2013, 4:14:43 PM1/11/13
to stan-...@googlegroups.com
Before discussing why what Marcel Jonker wants to do is
equivalent to spike-and-slab priors and why spike-and-slab
priors are inefficient to implement in Stan, let me point
everyone to what Andrew has to say on spike-and-slab priors
for Bayesian inference:

http://andrewgelman.com/2012/09/prior-distributions-for-regression-coefficients/

Below is a direct reply with an explanation of what's
needed in Stan.

On 1/6/13 7:12 AM, M.F. Jonker wrote:
...
> I posted a simpler form of this model on the STAN mailing list, but I
> wasn't sure whether the derivations would significantly change for the more complicated models.
> The confusing stuff, however, is mostly in the data transformations.

> If you just disregard the data transformations and assume only a single set of
> political dummies subject to Bayesian variable selection (as you suggested), would that
> be a solution? See the attached files.

Thanks for the simpler model. I'll try to generalize
a bit. The basic idea is to do variable selection for
a regression model by including boolean indicator variables.
Marcel sketched out something that looked like this:

parameters {
vector[K] beta_raw; // values
int<lower=0,upper=1> ind[K]; // indicators
...
transformed parameters {
vector[K] beta; // value or 0
for (k in 1:K)
beta[k] <- ind[k] * beta_raw[k];
...
model {
for (k in 1:K) ind[k] ~ ... // priors
for (k in 1:K) beta_raw[k] ~ ...
for (n in 1:N) y[n] ~ ... x[n] beta ...

This codes up what is effectively a spike-and-slab prior.

The current version of Stan (1.1.0) does not support
discrete parameters, so this model won't compile. But even
if it did, this kind of model would be problematic for the
way Stan evaluates models. The problem is that if ind[k] = 0,
then the term beta_raw only shows up in its prior. So on any
iteration where ind[k] = 0, the betas would wander off according
to their prior, whereas where ind[k] = 1, they show up in the
regression term through beta. This would almost certainly lead
to pretty serious convergence issues.

It's certainly possible to sum out the discrete parameter array ind
-- just sum the likelihood terms over the possible values of ind,
weighting each by its probability in the current model. But there
are 2^K possible values for ind here, so this won't be efficient
other than for very small number of variables K. It'd look like this:

vector[K] ind; // local indicator vector
for (n in 1:N) { // loop over data
for (i1 in 0:1) { // loop over values for indictor
ind[1] = i1;
for (i2 in 0:1) {
ind[2] = i2;
...
for (iK in 0:1) {
ind[K] = iK;
lp__ <- lp__ + log_sum_exp(log Pr[ind|...],
LL(beta .* ind));
...
}
}

log Pr[ind|...] is the log probability of the indicator values and
LL(beta .* ind) is shorthand for the log likelihood evaluated with coefficients
equal to the elementwise product of the raw coefficients beta
and the selection indicator vector ind. log_sum_exp(a,b)
= log(exp(a) + exp(b)), and doing it this way stabilizes
the arithmetic..

Note that all the data needs to be visited for each possible value
of ind, so the likelihood takes on the order of O(2^K * N) steps.
So for large K, it's going to be very slow.

This approach does not lose the connection to the raw coefficients that
the true selection-based implementation Marcel suggested. So it
should actually work to sample even if it's slow.

- Bob









Ben Goodrich

unread,
Jan 11, 2013, 4:45:23 PM1/11/13
to stan-...@googlegroups.com
On Friday, January 11, 2013 4:14:43 PM UTC-5, Bob Carpenter wrote:
Before discussing why what Marcel Jonker wants to do is
equivalent to spike-and-slab priors and why spike-and-slab
priors are inefficient to implement in Stan, let me point
everyone to what Andrew has to say on spike-and-slab priors
for Bayesian inference:

http://andrewgelman.com/2012/09/prior-distributions-for-regression-coefficients/
 
What do we think of double exponential priors on coefficients in Stan? You ought to be able to get good geometry by

-- specifying standard normal priors on the relevant parameters
-- transforming those with the normal CDF (approximation)
-- transforming those to double exponential via its quantile function

Of course, that never yields exact zero coefficient values, but the prior mean / median / mode is fairly spiky at zero.

Ben

Matt Hoffman

unread,
Jan 11, 2013, 5:18:29 PM1/11/13
to stan-...@googlegroups.com
That's a nice way of getting rid of the lack of differentiability of
the Laplacian distribution.

The unit-variance Laplacian/double-exponential almost exactly the same
amount of mass to the region around 0 as a standard normal does,
though. It's heavier-tailed than a normal, but it won't give you
super-small coefficients.

Matt
> --
>
>

Bob Carpenter

unread,
Jan 11, 2013, 5:53:11 PM1/11/13
to stan-...@googlegroups.com
I am curious what Andrew thinks. The double-exponential
(Laplace) prior (aka L1 regularization or lasso) is all the
rage in the machine learning literature for variable selection
(sparsity induction) purposes. I know that he and Aleks
Jakulin and crew settled on the Cauchy to use as a prior
for regression coefficients in the flat (non-hierarchical) case.

Gelman, Jakulin, Pittau, Su. 2011.
A WEAKLY INFORMATIVE DEFAULT PRIOR DISTRIBUTION FOR
LOGISTIC AND OTHER REGRESSION MODELS. Annals of Applied
Stats.

http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf

Just to point out to the poor users confused by
all this inverse-CDF sample transformation business,
you can still use a double-exponential prior just
by specifying it in Stan. It just won't be as
efficient.

- Bob

Andrew Gelman

unread,
Jan 11, 2013, 5:54:33 PM1/11/13
to stan-...@googlegroups.com
I'm happy with all these options (except the spike/slab). If users want coefs that are exactly zero, I think we should think of that as an add-on. For example, I could imagine having Stan take parameters whose posteriors are tight near zero, and then setting then to be exactly 0, and then going on from there to run with constraints. That could be worth doing, but I think it should be an explicit part of the processing, not something that gets done by the prior.
But, yes, I'm happy with priors such as double-exponential or t.
A
> --
>

terrance savitsky

unread,
Jan 11, 2013, 7:57:23 PM1/11/13
to stan-...@googlegroups.com
i don't know about the sampling geometry of stan, but maybe the best choice (in terms of selection performance) for a scale-mixture approach such that the resulting fat-tailed priors on the regression coefficient are abs. cont. wrt to the lebesgue measure is the horseshoe (carvalho, polson, scott - 2010).  sometimes a researcher arrives at variable selection in the case of p >> n under some prior expectation of sparsity and these may not be handled with the scale mixtures because they place 0 measure on the exclusion of a predictor.  the author of the original note mentioned something about high correlations among the predictors.  i'm thinking about the same problem and wonder if i might employ a factor analytic model on the predictors and then regress the response on the resulting smaller number of factors using the horseshoe prior.  my purpose is not prediction, but inference on the association of predictors to the response, so the factor definitions are meaningful, rather than just a computational device.  this means that i have to perform a 2-step or separate estimations because i'd otherwise be making inference on the unidentified factor labels in a joint estimation of factors and coefficients from regressing the response on the factors in a single step.  i suppose one could use a DP (as a stick-breaking mixture in stan) in lieu of a factor analytic (or ICA) approach.

i guess my point is that some of using stan have p >> n problems on which we would like to employ a variable selection approach, but we will first have to do some pre-processing to summarize the information in the predictors.


--





--
Thank you, Terrance Savitsky

Andrew Gelman

unread,
Jan 11, 2013, 8:03:58 PM1/11/13
to stan-...@googlegroups.com
A multi-stage approach (screening, then analysis) might be necessary for reasons of speed.  I think the way to understand this better is to go with a specific example of interest.
A

--
 
 

Marcel Jonker

unread,
Jan 14, 2013, 9:07:40 AM1/14/13
to stan-...@googlegroups.com, gel...@stat.columbia.edu
Hi Bob (and others),

Thank you very much for this topic (!) Based on the initial post, summing out the indicator variables seems infeasible for anything except for a small number of indicator variables. Accordingly, in my particular case, the options comprise looking into some of the mentioned priors, forgetting about variable selection altogether, or let BUGS run a substantial amount of time.. At the moment, the latter seems to be very attractive: in contrast to Bob's sketch of the problem, the BUGS models include informative pseudo-priors to ensure that the betas do not wander off too much when the indicator variables are zero. Furthermore, the models compile, all chains converge from dispersed starting points (albeit very slowly), and most importantly, I get sensible results in terms of time spent in different models and in terms of parameter values. I'm aware that the pseudo-priors require some tuning (which I did based on prior runs of several sub-models without variable selection). I'd also imagine that this approach wouldn't work well for strongly correlated variables, whose values change significantly when others are included and excluded from the model specification. But in this particular case, the BUGS models seem to work well. So I guess my question is whether it's all about efficiency (which is why I turned to STAN in the first place, but STAN cannot estimate these models yet) or whether the approach should preferably not be used for publications at all?

All the best,
Marcel

Bob Carpenter

unread,
Jan 14, 2013, 2:21:33 PM1/14/13
to stan-...@googlegroups.com

On 1/14/13 9:07 AM, Marcel Jonker wrote:
...
Thanks, Marcel, for posting back to the list. Here's my response:

Go ahead and post this answer back to the mailing list.
We want to keep the discussions as open as possible.
Even we don't believe Stan's the best solution for
every problem!

What I'd suggest is seeing if the predictive inferences or
fits you care about are affected by the difference between,
say, a Cauchy prior, a Laplace/double-exponential prior, and
a spike-and-slab prior. If there's no difference in inferences
you care about, then I'd suggest going with the simplest form
of the model. You can always talk about effects being fit
near zero and thus not affecting inference (Andrew's preferred
approach) instead of their being fully eliminated (the spike-and-slab
or selection approach).

If there is a difference, and you can motivate the spike-and-slab
inferences, there's your answer. And the paper will be stronger
for the comparison. (Sorry that outside suggestions always involve
more work rather than less.)

As an aside (it won't help solve your selection problem), you
can encode what you call "pseudopriors" in Stan. You just
need to make sure they always contribute to the model by always
including their sampling statements, rather than only including
them in one branch of execution.

- Bob

Joshua Wiley

unread,
Jul 17, 2016, 1:35:52 AM7/17/16
to Stan users mailing list, gel...@stat.columbia.edu

On Saturday, January 12, 2013 at 9:54:33 AM UTC+11, Andrew Gelman wrote:
I'm happy with all these options (except the spike/slab).  
 
If users want coefs that are exactly zero, I think we should think of that as an add-on.  For example, I could imagine having Stan take parameters whose posteriors are tight near zero, and then setting then to be exactly 0, and then going on from there to run with constraints.  That could be worth doing, but I think it should be an explicit part of the processing, not something that gets done by the prior.

Just wondering if there was ever any follow-up on this?  I tried to look through the manual (2.10.0) but did not see anything.  Particularly in the case of models where the predictors are of substantive interest, it would be very convenient to be able to shrink very small coefficients, say for interactions, to zero, so that the effects of the individual variables making up the interaction have a simpler interpretation.  I have been playing with the horseshoe prior, but it is inconvenient to have even very small non-zero interaction terms when trying to present the major paths from a model.

Thanks and sorry for resurrecting a multi year old thread,

Josh

Bob Carpenter

unread,
Jul 17, 2016, 1:10:38 PM7/17/16
to stan-...@googlegroups.com
No, we're not trying to do this within Stan's samplers directly,
because it wouldn't produce the correct Bayesian posterior.

You can code up true spike-and-slab priors by marginalizing
out the is-it-zero indicator, but it requires O(2^N) amount of
work for N parameters, so it's not practical in the usual situations
for which people want to do variable selection.

Aki's been working on the problem as a post-process. I don't know
if there's anything public other than this:

http://arxiv.org/abs/1509.04752

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Andrew Gelman

unread,
Jul 17, 2016, 4:04:10 PM7/17/16
to stan-...@googlegroups.com
I disagree with the spike and slab idea. Instead, I think it would be better to do the add-on as I said in my earlier email. It would make sense not to do this in Stan but in R or Python. You fit the Stan model with the horseshoe prior or whatever, then you post-process and set parameters to exactly zero. This could be coded in a program in R or Python that would be a wrapper for the Stan call.

We don't really have a place on the Stan webpage for this sort of wrapper or post-processing. It could be a good idea to have some of these things accessible for users. Right now we have rstanarm and loo, neither of which is part of Stan but both of which can be useful when fitting Stan models.

Joshua Wiley

unread,
Jul 17, 2016, 5:32:49 PM7/17/16
to stan-...@googlegroups.com
_____________________________
From: Andrew Gelman <gel...@stat.columbia.edu>
Sent: Monday, July 18, 2016 06:04
Subject: Re: [stan-users] variable selection / spike-and-slab priors for regression coeffs
To: <stan-...@googlegroups.com>



I disagree with the spike and slab idea. Instead, I think it would be better to do the add-on as I said in my earlier email. It would make sense not to do this in Stan but in R or Python. You fit the Stan model with the horseshoe prior or whatever, then you post-process and set parameters to exactly zero. This could be coded in a program in R or Python that would be a wrapper for the Stan call.


Any papers/pointers on what this would look like?  I don't mind some R programming and sharing that with the community, but I'm not sure how to adjust the remaining coefficients for the fact that one is fixed at zero during post processing.

Cheers,

Josh
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/mCcCg7cpW30/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Andrew Gelman

unread,
Jul 17, 2016, 5:37:32 PM7/17/16
to stan-...@googlegroups.com
I would not adjust the remaining coefficients.  I'd just fit the model and then have some rule of reporting to zero the coefs that are lower than some threshold.
If you do want to go back and re-fit the model using the coefs set to 0, you'd need to have an alternative version of your Stan program that enters these as data.  But I don't think that's a good idea.  The whole point is that the coefs aren't _really_ 0, you're just reporting them as "negligible" or 0 for convenience.  So I don't think it would make sense to re-fit the model.
A

Joshua Wiley

unread,
Jul 17, 2016, 5:42:12 PM7/17/16
to stan-...@googlegroups.com
Thanks for explaining, bow inindetetand what you meant.  Thanks!
Josh

Bob Carpenter

unread,
Jul 17, 2016, 6:22:15 PM7/17/16
to stan-...@googlegroups.com
I think it depends on the application. For a lot of
large-scale machine learning, like what I used to do in
natural language processing, there are literally billions
of potential predictors. So in order to get things to fit
into cache (or at least into memory), the number of predictors
is drastically pruned.

It's not like anyone believes the value really is zero in these
cases (just about every aspect of language matters and affects
everything else).

In those cases, for maximum performance, I think it would make sense
to refit. But maybe there's something else that could be done.

I sent the reference to Aki's paper for some tips on how to do this.

- Bob

Andrew Gelman

unread,
Jul 17, 2016, 6:31:10 PM7/17/16
to stan-...@googlegroups.com

> On Jul 17, 2016, at 6:22 PM, Bob Carpenter <ca...@alias-i.com> wrote:
>
> I think it depends on the application. For a lot of
> large-scale machine learning, like what I used to do in
> natural language processing, there are literally billions
> of potential predictors. So in order to get things to fit
> into cache (or at least into memory), the number of predictors
> is drastically pruned.

Yes, good point. In that case any solution would have to be done inside Stan, i.e. it's a research project of its own.


Michael Betancourt

unread,
Jul 17, 2016, 8:54:30 PM7/17/16
to stan-...@googlegroups.com
Right, but such dynamic pruning is inherently a greedy
process and it’s really easy to get stuck in local solutions.
Indeed while there are lots of results showing that these
solvers get to the right answer eventually, they sojourn
through all kinds of suboptimal solutions along the way
and there often isn’t a great indication of when you’ve
gone long enough.

Anyways, the point is that these dynamic algorithms
aren’t really Bayesian. Full Bayes is _not_ a very
sparse approach!

Andrew Gelman

unread,
Jul 17, 2016, 8:58:02 PM7/17/16
to stan-...@googlegroups.com
Sure, but in settings where there are such computational constraints, I think there could be an advantage to working within the Stan framework, using the Stan modeling language and Stan's autodiff, and making use of whatever bits of Bayesian inference come through in an approximate algorithm.
A

Joshua Wiley

unread,
Jul 17, 2016, 10:05:32 PM7/17/16
to stan-...@googlegroups.com

Well, I would cautiously suggest there is pragmatic value (if not theoretical) even in cases where there are not computational constraints.

A relatively common place example from health psychology (my field) would be examining the effects of stress on health.

Suppose there are:

-         5 continuous stress measures (e.g., domains of family, friends, work, financial, health).

-         10 other sociodemographic factors

Based on previous research it is plausible that:

1)      The effects of stress are non-linear

2)      Stress measures may interact with each other and

3)      Stress measures may interact with the sociodemographic factors

 

If you only check up to quadratic effects, that still yields: 5 + 10 + 50 = 65 interaction terms.  Each focal stress measure is involved in 15 interactions.  This yields the undesirable result that the threshold for “negligible” must depend on the number of interactions, in way, because even if any one interaction is for all intents and purposes ignorable, the combined set may not be.  Otherwise, even if each individual interaction qualifies as negligible, across the 15 the effects may be large enough that reporting the conditional effect of the stress measure may not be the best estimate of its “overall” effect.  I suppose the correct approach in this case would be to leave all the coefficients as small, non-zero, and marginalize them out to report the overall effect of a stress measure, if they are all sufficiently small (if not the marginal has perhaps little value).  Although conceptually trivial, it is a hassle to actually implement for multiple variables.

 

Josh

Bob Carpenter

unread,
Jul 17, 2016, 10:59:20 PM7/17/16
to stan-...@googlegroups.com

> On Jul 17, 2016, at 10:05 PM, Joshua Wiley <jwiley...@gmail.com> wrote:
>
> Well, I would cautiously suggest there is pragmatic value (if not theoretical) even in cases where there are not computational constraints.
>
> A relatively common place example from health psychology (my field) would be examining the effects of stress on health.
>
> Suppose there are:
>
> - 5 continuous stress measures (e.g., domains of family, friends, work, financial, health).
>
> - 10 other sociodemographic factors
>
> Based on previous research it is plausible that:
>
> 1) The effects of stress are non-linear
>
> 2) Stress measures may interact with each other and
>
> 3) Stress measures may interact with the sociodemographic factors
>
>
>
> If you only check up to quadratic effects, that still yields: 5 + 10 + 50 = 65 interaction terms.

I don't follow the arithmetic here.

> Each focal stress measure is involved in 15 interactions. This yields the undesirable result that the threshold for “negligible” must depend on the number of interactions, in way, because even if any one interaction is for all intents and purposes ignorable, the combined set may not be. Otherwise, even if each individual interaction qualifies as negligible, across the 15 the effects may be large enough that reporting the conditional effect of the stress measure may not be the best estimate of its “overall” effect. I suppose the correct approach in this case would be to leave all the coefficients as small, non-zero, and marginalize them out to report the overall effect of a stress measure, if they are all sufficiently small (if not the marginal has perhaps little value).

That is the standard Bayesian approach to posterior predictive
inference. The point is that you average predictions over posterior
uncertainty rather than making point estimates of parameters and then
using the point estimates for prediction. The latter's the norm in most
machine learning applications with which I'm familiar.

It's hard to talk about what is correct without having
an application in mind. For instance, do you want to make
predictions, estimate "significance", or present as simple
a model as possible?

> Although conceptually trivial, it is a hassle to actually implement for multiple variables.

Why? I'd think with just 65 it'd be trivial. Doing anything
other than full Bayes isn't supported by Stan, so doing
some kind of variable selection winds up being more work,
though leads to simpler models in terms of number of parameters.

- Bob

Bob Carpenter

unread,
Jul 17, 2016, 11:01:32 PM7/17/16
to stan-...@googlegroups.com

> On Jul 17, 2016, at 8:54 PM, Michael Betancourt <betan...@gmail.com> wrote:
>
> Right, but such dynamic pruning is inherently a greedy
> process and it’s really easy to get stuck in local solutions.
> Indeed while there are lots of results showing that these
> solvers get to the right answer eventually, they sojourn
> through all kinds of suboptimal solutions along the way
> and there often isn’t a great indication of when you’ve
> gone long enough.

I think the usual approach is to just do L1 regularized
regression. Then there's the problem of setting how much
sparsity you want, leading to those path diagrams favored
by Hastie et al. in their books (and their papers on
elastic net).

> Anyways, the point is that these dynamic algorithms
> aren’t really Bayesian. Full Bayes is _not_ a very
> sparse approach!

And that's true even with Laplace priors (the distribution
corresponding to L1 regularization). The posterior isn't
sparse, just the penalized maximum likelihood estimate.

- Bob

Joshua Wiley

unread,
Jul 18, 2016, 12:03:46 AM7/18/16
to stan-...@googlegroups.com
On Mon, Jul 18, 2016 at 12:59 PM, Bob Carpenter <ca...@alias-i.com> wrote:

> On Jul 17, 2016, at 10:05 PM, Joshua Wiley <jwiley...@gmail.com> wrote:
>
> Well, I would cautiously suggest there is pragmatic value (if not theoretical) even in cases where there are not computational constraints.
>
> A relatively common place example from health psychology (my field) would be examining the effects of stress on health.
>
> Suppose there are:
>
> -         5 continuous stress measures (e.g., domains of family, friends, work, financial, health).
>
> -         10 other sociodemographic factors
>
> Based on previous research it is plausible that:
>
> 1)      The effects of stress are non-linear
>
> 2)      Stress measures may interact with each other and
>
> 3)      Stress measures may interact with the sociodemographic factors
>
>
>
> If you only check up to quadratic effects, that still yields: 5 + 10 + 50 = 65 interaction terms.

I don't follow the arithmetic here.

5 squared terms, 10 2-way interactions among the stress measures, and 50 2-way interactions between stress measures and sociodemographic factors, but the number is not important for the example.
 

> Each focal stress measure is involved in 15 interactions.  This yields the undesirable result that the threshold for “negligible” must depend on the number of interactions, in way, because even if any one interaction is for all intents and purposes ignorable, the combined set may not be.  Otherwise, even if each individual interaction qualifies as negligible, across the 15 the effects may be large enough that reporting the conditional effect of the stress measure may not be the best estimate of its “overall” effect.  I suppose the correct approach in this case would be to leave all the coefficients as small, non-zero, and marginalize them out to report the overall effect of a stress measure, if they are all sufficiently small (if not the marginal has perhaps little value).

That is the standard Bayesian approach to posterior predictive
inference. The point is that you average predictions over posterior
uncertainty rather than making point estimates of parameters and then
using the point estimates for prediction.  The latter's the norm in most
machine learning applications with which I'm familiar.

It's hard to talk about what is correct without having
an application in mind.  For instance, do you want to make
predictions, estimate "significance", or present as simple
a model as possible?

I had in mind presenting the simplest possible model that captures most of the effects.  Each parameter may inform theory and is conceptually interesting, but to help make sense of data, it is helpful to be parsimonious, when that does not lose much.  I did not have in mind "significance" nor prediction where the interest is in the overall predictive accuracy.
 

> Although conceptually trivial, it is a hassle to actually implement for multiple variables.

Why?  I'd think with just 65 it'd be trivial.  Doing anything
other than full Bayes isn't supported by Stan, so doing
some kind of variable selection winds up being more work,
though leads to simpler models in terms of number of parameters.

My apologies, I did not explain well.  Here is a simpler and more explicit example:

X is the predictor of interest and is distributed as N(0, 1)
W is a uniformly distributed integer in [0, 10]
f(Y) = 0 + .45*X + 0*W + .01*X*W is equation for the outcome where f() is the link function

The "effect" of X here can be factored as:
(.45 + .01*W)*X

So:
.45 * X | W = 0
.55 * X | W = 10

Since W is uniform, the average marginal effect of X is .50, which is the number I really want to report, not the .45, assuming the interaction terms are sufficiently small.  Unless I am mistaken, for linear outcomes I can get that estimate simply by mean centering all predictors, but I believe it is more complicated for nonlinear outcomes, (e.g., reporting the overall odds ratio or average marginal change in probability from logit link Bernoulli models).
Perhaps this is much simpler than adding a variable selection step, though.  I had in my head, probably erroneously,  that life would be easier if many of the interaction coefficients were true zero instead of just close to zero.

Josh


- Bob

Andrew Gelman

unread,
Jul 24, 2016, 1:42:52 AM7/24/16
to stan-...@googlegroups.com
Hi, yes, in this setting I think it makes sense to keep all these variables in the model, rather than setting some of them to 0 and then paying the price later.
A
Reply all
Reply to author
Forward
0 new messages