Inthe previous chapter, we introduced Bayesian decision making using posterior probabilities and a variety of loss functions. We discussed how to minimize the expected loss for hypothesis testing. Moreover, we instroduced the concept of Bayes factors and gave some examples on how Bayes factors can be used in Bayesian hypothesis testing for comparison of two means. We also discussed how to choose appropriate and robust priors. When there is no conjugacy, we applied Markov Chain Monte Carlo simulation to approximate the posterior distributions of parameters of interest.
In this chapter, we will apply Bayesian inference methods to linear regression. We will first apply Bayesian statistics to simple linear regression models, then generalize the results to multiple linear regression models. We will see when using the reference prior, the posterior means, posterior standard deviations, and credible intervals of the coefficients coincide with the counterparts in the frequentist ordinary least square (OLS) linear regression models. However, using the Bayesian framework, we can now interpret credible intervals as the probabilities of the coefficients lying in such intervals.
In this section, we will turn to Bayesian inference in simple linear regressions. We will use the reference prior distribution on coefficients, which will provide a connection between the frequentist solutions and Bayesian answers. This provides a baseline analysis for comparisons with more informative prior distributions. To illustrate the ideas, we will use an example of predicting body fat.
Obtaining accurate measurements of body fat is expensive and not easy to be done. Instead, predictive models that predict the percentage of body fat which use readily available measurements such as abdominal circumference are easy to use and inexpensive. We will apply a simple linear regression to predict body fat using abdominal circumference as an example to illustrate the Bayesian approach of linear regression. The data set bodyfat can be found from the library BAS.
The figure below shows the percentage body fat obtained from under water weighing and the abdominal circumference measurements for 252 men. To predict body fat, the line overlayed on the scatter plot illustrates the best fitting ordinary least squares (OLS) line obtained with the lm function in R.
Each of the residuals, which provide an estimate of the fitting error, is equal to \(\hat\epsilon_i = y_i - \haty_i\), the difference between the observed value \(y_i\) and the fitted value \(\haty_i = \hat\alpha + \hat\betax_i\), where \(x_i\) is the abdominal circumference for the \(i\)th male. \(\hat\epsilon_i\) is used for diagnostics as well as estimating the constant variance in the assumption of the model \(\sigma^2\) via the mean squared error (MSE):\[ \hat\sigma^2 = \frac1n-2\sum_i^n (y_i-\haty_i)^2 = \frac1n-2\sum_i^n \hat\epsilon_i^2. \]Here the degrees of freedom \(n-2\) are the number of observations adjusted for the number of parameters (which is 2) that we estimated in the regression. The MSE, \(\hat\sigma^2\), may be calculated through squaring the residuals of the output of bodyfat.lm.
If this model is correct, the residuals and fitted values should be uncorrelated, and the expected value of the residuals is zero. We apply the scatterplot of residuals versus fitted values, which provides an additional visual check of the model adequacy.
With the exception of one observation for the individual with the largest fitted value, the residual plot suggests that this linear regression is a reasonable approximation. The case number of the observation with the largest fitted value can be obtained using the which function in R. Further examination of the data frame shows that this case also has the largest waist measurement Abdomen. This may be our potential outlier and we will have more discussion on outliers in Section 6.2.
Furthermore, we can check the normal probability plot of the residuals for the assumption of normally distributed errors. We see that only Case 39, the one with the largest waist measurement, is exceptionally away from the normal quantile.
We may construct the confidence intervals of \(\alpha\) and \(\beta\) using the \(t\)-statistics\[t_\alpha^\ast = \frac\alpha - \hat\alpha\textse_\alpha,\qquad \qquad t_\beta^\ast = \frac\beta-\hat\beta\textse_\beta.\]
The Bayesian model starts with the same model as the classical frequentist approach:\[ y_i = \alpha + \beta x_i + \epsilon_i,\quad i = 1,\cdots, n. \]with the assumption that the errors, \(\epsilon_i\), are independent and identically distributed as normal random variables with mean zero and constant variance \(\sigma^2\). This assumption is exactly the same as in the classical inference case for testing and constructing confidence intervals for \(\alpha\) and \(\beta\).
We first consider the case under the reference prior, which is our standard noninformative prior. Using the reference prior, we will obtain familiar distributions as the posterior distributions of \(\alpha\), \(\beta\), and \(\sigma^2\), which gives the analogue to the frequentist results. Here we assume the joint prior distribution of \(\alpha,\ \beta\), and \(\sigma^2\) to be proportional to the inverse of \(\sigma^2\)
Since the credible intervals are numerically the same as the confidence intervals, we can use the lm function to obtain the OLS estimates and construct the credible intervals of \(\alpha\) and \(\beta\)
The confint function provides 95% confidence intervals. Under the reference prior, they are equivalent to the 95% credible intervals. The code below extracts them and relabels the output as the Bayesian results.
These intervals coincide with the confidence intervals from the frequentist approach. The primary difference is the interpretation. For example, based on the data, we believe that there is a 95% chance that body fat will increase by 5.75% up to 6.88% for every additional 10 centimeter increase in the waist circumference.
where\[\textS_Y^2 =\hat\sigma^2+\hat\sigma^2\left(\frac1n+\frac(x_n+1-\barx)^2\textS_xx\right) = \hat\sigma^2\left(1+\frac1n+\frac(x_n+1-\barx)^2\textS_xx\right).\]
The variance for predicting a new observation \(y_n+1\) has an extra \(\hat\sigma^2\) which comes from the uncertainty of a new observation about the mean \(\mu_Y\) estimated by the regression line.
While we expect the majority of the data will be within the prediction intervals (the short dashed grey lines), Case 39 seems to be well below the interval. We next use Bayesian methods in Section 6.2 to calculate the probability that this case is abnormal or is an outlier by falling more than \(k\) standard deviations from either side of the mean.
Except from the noninformative reference prior, we may also consider using a more general semi-conjugate prior distribution of \(\alpha\), \(\beta\), and \(\sigma^2\) when there is information available about the parameters.
One can see that the reference prior is the limiting case of this conjugate prior we impose. We usually use Gibbs sampling to approximate the joint posterior distribution instead of using the result directly, especially when we have more regression coefficients in multiple linear regression models. We omit the derivation of the posterior distributions due to the heavy use of advanced linear algebra. One can refer to Hoff (2009) for more details.
Based on any prior information we have for the model, we can also impose other priors and assumptions on \(\alpha\), \(\beta\), and \(\sigma^2\) to get different Bayesian results. Most of these priors will not form any conjugacy and will require us to use simulation methods such as Markov Chain Monte Carlo (MCMC) for approximations. We will introduce the general idea of MCMC in Chapter 8.
We will also use the following quantities derived from the formula of \(\barx\), \(\bary\), \(\hat\alpha\), and \(\hat\beta\)\[\beginaligned& \sum_i^n (x_i-\barx) = 0 \\& \sum_i^n (y_i-\bary) = 0 \\& \sum_i^n (y_i - \haty_i) = \sum_i^n (y_i - (\hat\alpha + \hat\beta x_i)) = 0\\& \sum_i^n (x_i-\barx)(y_i - \haty_i) = \sum_i^n (x_i-\barx)(y_i-\bary-\hat\beta(x_i-\barx)) = \sum_i^n (x_i-\barx)(y_i-\bary)-\hat\beta\sum_i^n(x_i-\barx)^2 = 0\\& \sum_i^n x_i^2 = \sum_i^n (x_i-\barx)^2 + n\barx^2 = \textS_xx+n\barx^2\endaligned\]
We first further simplify the numerator inside the exponential function in the formula of \(p^*(\alpha, \beta, \sigma^2y_1,\cdots,y_n)\):\[\beginaligned& \sum_i^n \left(y_i - \alpha - \beta x_i\right)^2 \\= & \sum_i^n \left(y_i - \hat\alpha - \hat\betax_i - (\alpha - \hat\alpha) - (\beta - \hat\beta)x_i\right)^2 \\= & \sum_i^n \left(y_i - \hat\alpha - \hat\betax_i\right)^2 + \sum_i^n (\alpha - \hat\alpha)^2 + \sum_i^n (\beta-\hat\beta)^2(x_i)^2 \\ & - 2\sum_i^n (\alpha - \hat\alpha)(y_i-\hat\alpha-\hat\betax_i) - 2\sum_i^n (\beta-\hat\beta)(x_i)(y_i-\hat\alpha-\hat\betax_i) + 2\sum_i^n(\alpha - \hat\alpha)(\beta-\hat\beta)(x_i)\\= & \textSSE + n(\alpha-\hat\alpha)^2 + (\beta-\hat\beta)^2\sum_i^n x_i^2 - 2(\alpha-\hat\alpha)\sum_i^n (y_i-\haty_i) -2(\beta-\hat\beta)\sum_i^n x_i(y_i-\haty_i)+2(\alpha-\hat\alpha)(\beta-\hat\beta)(n\barx)\endaligned\]
Finally, we use the quantity that \(\displaystyle \sum_i^n x_i^2 = \sum_i^n(x_i-\barx)^2+ n\barx^2\) to combine the terms \(n(\alpha-\hat\alpha)^2\), \(2\displaystyle (\alpha-\hat\alpha)(\beta-\hat\beta)\sum_i^n x_i\), and \(\displaystyle (\beta-\hat\beta)^2\sum_i^n x_i^2\) together.
\[\beginaligned& \sum_i^n (y_i-\alpha-\beta x_i)^2 \\= & \textSSE + n(\alpha-\hat\alpha)^2 +(\beta-\hat\beta)^2\sum_i^n (x_i-\barx)^2 + (\beta-\hat\beta)^2 (n\barx^2) +2(\alpha-\hat\alpha)(\beta-\hat\beta)(n\barx)\\= & \textSSE + (\beta-\hat\beta)^2\textS_xx + n\left[(\alpha-\hat\alpha) +(\beta-\hat\beta)\barx\right]^2\endaligned\]
3a8082e126