Regression-coefficient priors that depend on X

Kevin Van Horn

unread,

Feb 21, 2016, 7:01:26 PM2/21/16

to Stan users mailing list

The R2 prior, like Zellner's g prior, depends on the matrix of covariates X. This feature of both priors troubles me, for these reasons:

1. It's not clear to me why one's prior on the regression coefficients should depend on the observed values of the input variables. The problem is clearest when your input variables are all control variables and the regression coefficients can be interpreted causally -- in that case, the input variables are whatever you chose them to be for your experimental design, and clearly contain no information at all about the regression coefficients. The same objection holds for purely observational data when you have a sufficiently complete set of input variables that there is little chance of important confounding variables; the x values occurring in the data tell you nothing about the causal relationship between the input variables and outcome variable.

2. Are such priors even consistent with Bayes' Rule? Assuming independent priors on beta and sigma, and iid observations, we have

p(beta | y, X)
is proportional to
p(beta) * p(y, X | beta)
= p(beta) * p(X | beta) * p(y | X, beta)

= p(beta) * PRODUCT(i: 1 <= i <= n: p(X_i | beta)) * p(y | X, beta)

We effectively have an X-dependent prior on beta only if the distribution for X_i depends on beta, in which case we get an (effective) prior of the form

pi(beta | X) is proportional to p(beta) * PRODUCT(i: 1 <= i <= n: p(X_i | beta))

Does either of the R2 prior or Zellner's g-prior fit this form?

3. One possible way around the problem of (2) above is to make the prior for beta depend, not on X itself, but on other parameters that determine the distribution for the vector of input variables. That is, we have

x_i ~ Distribution_x(psi)
beta ~ Distribution_b(psi)
y_i ~ Normal(x_i' * beta, sigma)

The Zellner prior could be viewed as an empirical Bayes approximation to such a model, if

a) For any covariance matrix Sigma there is some vector psi such that Distribution_x(psi) has Sigma as its covariance matrix;

b) Distribution_b(psi) depends on psi only via the covariance matrix Sigma corresponding to Distribution_x(psi); and

c) (1/n) * X' * X is used as a point estimate of Sigma^{-1}.

Of course, that leads us to the deficiencies of empirical Bayes -- it's only an approximation to the full Bayesian answer, and it's not obvious how much the approximation alters the results.

Can the R2 prior also be viewed in this manner?

Bob Carpenter

unread,

Feb 21, 2016, 7:12:56 PM2/21/16

to stan-...@googlegroups.com

> On Feb 21, 2016, at 7:01 PM, Kevin Van Horn <ke...@ksvanhorn.com> wrote:
>
> The R2 prior, like Zellner's g prior, depends on the matrix of covariates X. This feature of both priors troubles me, for these reasons:
>
> 1. It's not clear to me why one's prior on the regression coefficients should depend on the observed values of the input variables.

I've always worried about this, too --- the exact same
situation comes up from standardizing predictors. I was
sort of thinking muddily along the lines of your answer (3) --
that with a big enough N, you're approaching the population
distribution close enough for the kind of applied statistics
we do.

It's also come up before with constraints, where if you want

alpha * x[n] + beta > 0

for each data point n in 1:N, then even if you satisfy the
constraint for every point in x, there's no reason to
expect you won't violate the constraint with x[N + 1].

...

> 2. Are such priors even consistent with Bayes' Rule?

...

> 3. One possible way around the problem of (2) above is to make the prior for beta depend, not on X itself, but on other parameters that determine the distribution for the vector of input variables. That is, we have
>
> x_i ~ Distribution_x(psi)
> beta ~ Distribution_b(psi)
> y_i ~ Normal(x_i' * beta, sigma)

...

- Bob

Andrew Gelman

unread,

Feb 21, 2016, 8:02:32 PM2/21/16

to stan-...@googlegroups.com

I’ve talked about this with Aleks and Ben, and what I always say is that one can think of this as an approximation to a prior that does not depend on X but depends on the distribution from which X is drawn from. I’m happy to discuss this more if you’d like.
A

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ben Goodrich

unread,

Feb 21, 2016, 8:25:38 PM2/21/16

to Stan users mailing list

On Sunday, February 21, 2016 at 7:01:26 PM UTC-5, Kevin Van Horn wrote:

The R2 prior, like Zellner's g prior, depends on the matrix of covariates X. This feature of both priors troubles me, for these reasons:

The R2 prior does not depend on the values in X; only the number of columns of X. In addition, beta is not a parameter in those models; rather it is a generated quantity. However, the priors in stan_glm, by default, do depend on X and y, so these issues are worth discussing.

Ben

Kevin Van Horn

unread,

Feb 21, 2016, 8:52:32 PM2/21/16

to Stan users mailing list

"The R2 prior does not depend on the values in X; only the number of columns of X. In addition, beta is not a parameter in those models; rather it is a generated quantity."

There is a one-to-one, onto mapping between (beta, sigma_eps) and (r^2, u, omega); thus any prior over one defines a prior over the other, via the usual change-of-variables formula. I can choose to consider either (beta, sigma_eps) or (r^2, u, omega) as my model parameters, and since we are talking about a linear regression model, it is most natural to think in terms of (beta, sigma_eps) as the parameters.

Furthermore, when considering the relationship between y_i and X_i, it is the vector of regression coefficients beta that is of interest, whether I choose to consider beta a primitive parameter or a derived parameter. All of the concerns I expressed apply regardless of whether you choose to treat beta as a primitive parameter or as a derived parameter. And since

beta = R^{-1} * u * sigma_y * sqrt(r^2 * (n-1))

where R is a function of X, we have a prior on beta that is a function of X.

Bob Carpenter

unread,

Feb 21, 2016, 9:05:56 PM2/21/16

to stan-...@googlegroups.com

That's what I figured --- I think that's like Kevin's (3).

Yet another topic for the Stan book. Do you talk about this
in ARM?

So in the simple 1D case, we'd standardize x[1:N] as

z = (x - mean(x)) / sd(x)

Then we'd use z as the predictor in our model

y ~ normal(alpha + beta * z, sigma);

and include a prior like

beta ~ normal(0, 2)

or something like that.

As N gets large, the sample mean and sd approach their true values,
and at the same time, the effect of the prior also goes down. So
I think what we need are something like guidelines on the size
of N. It'd be easy to generate some fake-data examples.

But then there's standardizing y to think about, too.

- Bob

Andrew Gelman

unread,

Feb 21, 2016, 9:06:55 PM2/21/16

to stan-...@googlegroups.com

It’s all in my nonexistent article on weakly informative priors. We should put this discussion on the priors wiki!

Kevin Van Horn

unread,

Feb 21, 2016, 9:28:50 PM2/21/16

to Stan users mailing list, gel...@stat.columbia.edu

Yes, I'd like to hear your thoughts on this in more detail. I think I can see how to view the R2 prior as based on a point estimate of the distribution for x: E[R] is the right Cholesky factor of n * Sigma, where Sigma is the covariance matrix for x. The questions that remain for me are these:

1. Point (1) in my original post, especially when dealing with data from a designed experiment.

2. How much are the inference results affected by (effectively) using a point estimate of Sigma instead of going full Bayesian and making Sigma a model parameter to be inferred?

BTW, I hope I'm not coming across as too critical with all these questions. I will likely be training other people in using rstanarm, and I want to make sure that I understand it very well myself before doing so.

Ben Goodrich

unread,

Feb 21, 2016, 9:49:01 PM2/21/16

to Stan users mailing list

On Sunday, February 21, 2016 at 9:28:50 PM UTC-5, Kevin Van Horn wrote:

BTW, I hope I'm not coming across as too critical with all these questions. I will likely be training other people in using rstanarm, and I want to make sure that I understand it very well myself before doing so.

It's all good. Everything should be questioned, particularly with rstanarm because it is new and because a lot of the users of it will not be questioning the default priors.

With an experiment or some other situation where you have informative priors about the regression coefficients, stan_glm with family = gaussian may be more appropriate. The stan_lm function is more intended for the situation where your beliefs about the regression coefficients are centered at zero with fat tails. One way to look at it is to imagine the correlation matrix among the K predictors and the outcome before you observe the values of either. If you have an LKJ(eta) prior on that correlation matrix, then you have a Beta(eta, K/2) prior on conditional variance of Y | X when all variables are standardized. Conversely, you have a Beta(K/2, eta) prior on the R^2 and need to then choose eta. Equivalently, you could say

R^2 = 1 - (sigma_epsilon)^2 / (sigma_Y)^2

but I don't like to write it that way because it makes it look like sigma_epsilon is primitive and the R^2 is derived, when actually the code does the reverse:

sigma_epsilon = sigma_Y * sqrt(1 - R^2)

But either way is perfectly good for thinking about the prior location of R^2 under a Beta(K/2, eta) distribution, which implies a particular value of eta. Also, for the reasons you mentioned, about the concerns with empirical Bayesianism, sigma_Y is a primitive to be estimated rather than conditioned on.

Ben

Bob Carpenter

unread,

Feb 21, 2016, 11:06:56 PM2/21/16

to stan-...@googlegroups.com, gel...@stat.columbia.edu

This question is much more about broader Bayesian methodology
in setting default priors. So ask away, please. We reserve the
right not to answer.

My intuition is that the answer to (2) is not much with reasonable
amounts of data (point estimates get more accurate and you get
more data to overwhelm the unit-scaled prior). I think Andrew needs
to run the experiments or do the math to figure out the answer.
The answer would make a nice methodology paper.

- Bob

Andrew Gelman

unread,

Feb 21, 2016, 11:08:33 PM2/21/16

to Bob Carpenter, stan-...@googlegroups.com

And at this point my blog lag is so long that I can almost get something into print faster by submitting it as a journal article!

Kevin Van Horn

unread,

Feb 21, 2016, 11:51:16 PM2/21/16

to Stan users mailing list

Thanks for the tip on using stan_glm for a wider selection of priors; I hadn't even thought of using it for just plain old linear regression.

"The stan_lm function is more intended for the situation where your beliefs about the regression coefficients are centered at zero with fat tails."

Fat tails? I'm not seeing that. The equation

R * beta = sigma_y * sqrt(r^2 * (n-1))

implies that the regression coefficients are bounded, doesn't it?

Ben Goodrich

unread,

Feb 22, 2016, 1:36:01 AM2/22/16

to Stan users mailing list

On Sunday, February 21, 2016 at 11:51:16 PM UTC-5, Kevin Van Horn wrote:

"The stan_lm function is more intended for the situation where your beliefs about the regression coefficients are centered at zero with fat tails."

Fat tails? I'm not seeing that. The equation

R * beta = sigma_y * sqrt(r^2 * (n-1))

implies that the regression coefficients are bounded, doesn't it?

theta = R * beta is bounded but beta = R^{-1} theta isn't. For a given R = Q'X, the shape can differ, but marginalizing over X, you get a distribution where the existence of the variance, skewness, kurtosis, etc. depends on whether eta is sufficiently large. You can draw from the distribution of standardized regression coefficients in question by putting this

functions {
  vector std_beta_rng(int K, real eta) {
    matrix[K+1,K+1] Lambda;
    Lambda <- lkj_corr_rng(K + 1, eta);
    return inverse_spd(Lambda[1:K,1:K]) * Lambda[1:K,K+1];
  }
}
model {}

into a file called prior.stan and then calling

library(rstan)
expose_stan_functions("prior.stan")
draws <- replicate(100000, std_beta_rng(K = 5, eta = 1.25))
plot(density(draws[1,]), main = "Standardized Coefficient", xlab = expression(beta), las = 1)

In this case, almost 90% of the draws are between -1 and 1, but you occasionally get a draw past 10 in absolute value.

Ben

Ben Goodrich

unread,

Feb 22, 2016, 2:08:29 AM2/22/16

to Stan users mailing list

On Sunday, February 21, 2016 at 8:52:32 PM UTC-5, Kevin Van Horn wrote:

"The R2 prior does not depend on the values in X; only the number of columns of X. In addition, beta is not a parameter in those models; rather it is a generated quantity."

There is a one-to-one, onto mapping between (beta, sigma_eps) and (r^2, u, omega); thus any prior over one defines a prior over the other, via the usual change-of-variables formula. I can choose to consider either (beta, sigma_eps) or (r^2, u, omega) as my model parameters, and since we are talking about a linear regression model, it is most natural to think in terms of (beta, sigma_eps) as the parameters.

Furthermore, when considering the relationship between y_i and X_i, it is the vector of regression coefficients beta that is of interest, whether I choose to consider beta a primitive parameter or a derived parameter. All of the concerns I expressed apply regardless of whether you choose to treat beta as a primitive parameter or as a derived parameter. And since

beta = R^{-1} * u * sigma_y * sqrt(r^2 * (n-1))

where R is a function of X, we have a prior on beta that is a function of X.

You can have a distribution of prior beliefs about beta that is implied by your prior beliefs about the parameters in the parameters block, but that doesn't affect what Bayes rule implies for those parameters:

f(alpha, u, sigma_y, r^2 | y, X = QR, eta) \propto f(alpha) * f(u) * f(r^2 | K / 2, eta) * f(sigma_y) * f(y | Q, u, r^2, sigma_y)

The beta distribution for r^2 with shape parameters K / 2 and eta does not vary according to the values in X, only its number of columns, K. In stan_glm(), the default priors do depend on the standard deviations of the columns of X (which I don't mind so much) and the standard deviation of y (which I would like to get rid of but it is difficult to come up with a reasonable alternative).

There has to be some joint prior distribution for the alternative primitive parameters beta and sigma_epsilon that implies the new generated quantities alpha, u, sigma_y, and r^2 have the same distribution as in the posterior distribution above. However, that doesn't mean the two distributions are equally easy to sample from in Stan. We do reparameterizations in rstanarm and Stan generally all the time, usually for computational benefits at the cost of readability of the .stan files. But once you have the posterior distribution of the quantities of interest, the interpretation of them is the same.

In this case, it also makes it easier for the user. It is hard to get people to thoughtfully specify multivariate distributions, and it is hard to get people to specify anything other than location and sometimes scale parameters. With stan_lm, all that is needed is some prior information about the location of r^2, and it can infer eta.

Ben

Reply all

Reply to author

Forward