Interpreting Dummy Variables In Multiple Regression

5 views

Skip to first unread message

Tisa Ammann

unread,

Aug 5, 2024, 12:13:06 AM8/5/24

to ryonsatbolpa

Iam wondering how I can interpret the estimated coefficient for variable B. I thought I could run a simple regression analysis, but I would also like to get an estimated effect of variable A while B is zero (which is what the coefficient for variable A represents, if I am understanding correctly).

The way you are interpreting the coefficients is not quite right. The general interpretation of the coefficient on a dummy variable in a multiple regression is "the expected (or average) difference in the dependent variable between those with $1$ and those with $0$ values of that dummy variable, holding other independent variables constant.

Variable A can be present (i.e., 1) only when Variable B is present (1). I am wondering how I can interpret the estimated coefficient for variable B, because the coefficient for B represent the presence of B while A is 0, which logically does not make sense.

Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small; they can take on only two quantitative values. As a practical matter, regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence.

The number of dummy variables required to represent a particular categorical variable depends on the number of values that the categorical variable can assume. To represent a categorical variable that can assume k different values, a researcher would need to define k - 1 dummy variables.

For example, suppose we are interested in political affiliation, a categorical variable that might assume three values - Republican, Democrat, or Independent. We could represent political affiliation with two dummy variables:

In this example, notice that we don't have to create a dummy variable to represent the "Independent" category of political affiliation. If X1 equals zero and X2 equals zero, we know the voter is neither Republican nor Democrat. Therefore, voter must be Independent.

When defining dummy variables, a common mistake is to define too many variables. If a categorical variable can take on k values, it is tempting to define k dummy variables. Resist this urge. Remember, you only need k - 1 dummy variables.

A kth dummy variable is redundant; it carries no new information. And it creates a severe multicollinearity problem for the analysis. Using k dummy variables when only k - 1 dummy variables are required is known as the dummy variable trap. Avoid this trap!

The value of the categorical variable that is not represented explicitly by a dummy variable is called the reference group. In this example, the reference group consists of Independent voters.

In analysis, each dummy variable is compared with the reference group. In this example, a positive regression coefficient means that income is higher for the dummy variable political affiliation than for the reference group; a negative regression coefficient means that income is lower. If the regression coefficient is statistically significant, the income discrepancy with the reference group is also statistically significant.

In this section, we work through a simple example to illustrate the use of dummy variables in regression analysis. The example begins with two independent variables - one quantitative and one categorical. Notice that once the categorical variable is expressed in dummy form, the analysis proceeds in routine fashion. The dummy variable is treated just like any other quantitative variable.

The first thing we need to do is to express gender as one or more dummy variables. How many dummy variables will we need to fully capture all of the information inherent in the categorical variable Gender? To answer that question, we look at the number of values (k) Gender can assume. We will need k - 1 dummy variables to represent Gender. Since Gender can assume two values (male or female), we will only need one dummy variable to represent Gender.

Note that X1 identifies male students explicitly. Non-male students are the reference group. This was a arbitrary choice. The analysis works just as well if you use X1 to identify female students and make non-female students the reference group.

At this point, we conduct a routine regression analysis. No special tweaks are required to handle the dummy variable. So, we begin by specifying our regression equation. For this problem, the equation is:

Values for IQ and X1 are known inputs from the data table. The only unknowns on the right side of the equation are the regression coefficients, which we will estimate through least-squares regression.

The first task in our analysis is to assign values to coefficients in our regression equation. Excel does all the hard work behind the scenes, and displays the result in a regression coefficients table:

The coefficient of muliple determination is 0.810. For our sample problem, this means 81% of test score variation can be explained by IQ and by gender. Translation: Our equation fits the data pretty well.

Before we conduct those tests, however, we need to assess multicollinearity between independent variables. If multicollinearity is high, significance tests on regression coefficient can be misleading. But if multicollinearity is low, the same tests can be informative.

To measure multicollinearity for this problem, we can try to predict IQ based on Gender. That is, we regress IQ against Gender. The resulting coefficient of multiple determination (R2k) is an indicator of multicollinearity. When R2k is greater than 0.75, multicollinearity is a problem.

With multiple regression, there is more than one independent variable; so it is natural to ask whether a particular independent variable contributes significantly to the regression after effects of other variables are taken into account. The answer to this question can be found in the regression coefficients table:

The regression coefficients table shows the following information for each coefficient: its value, its standard error, a t-statistic, and the significance of the t-statistic. In this example, the t-statistics for IQ and gender are both statistically significant at the 0.05 level. This means that IQ predicts test score beyond chance levels, even after the effect of gender is taken into account. And gender predicts test score beyond chance levels, even after the effect of IQ is taken into account.

The regression coefficient for gender provides a measure of the difference between the group identified by the dummy variable (males) and the group that serves as a reference (females). Here, the regression coefficient for gender is 7. This suggests that, after effects of IQ are taken into account, males will score 7 points higher on the test than the reference group (females). And, because the regression coefficient for gender is statistically significant, we interpret this difference as a real effect - not a chance artifact.

We will continue to use the elemapi2v2 data set we used in Lessons 1 and 2 of this seminar. Recall that the variable api00 is a measure of the school academic performance. The variable yr_rnd is a Nominal variable that is coded 0 if the school is not year round and 1 if year round. The variable meals is the percentage of students in the school who are receiving state sponsored free meals and can be used as a proxy for socioeconomic status. This was broken into 3 categories (to make equally sized groups) creating the variable mealcat.

The reference group here is Dummy3, it is also the dummy variable indicating the third meal category. For additional information on dummy coding, take a look at Section 4.1.1 in our pageSAS Seminar: Analyzing and Visualizing Interactions.

It may be surprising to note that this regression analysis with a single dummy variable is the same as doing an independent t-test comparing the mean api00 for the year-round schools with the non year-round schools (see below). You can see that the t-value below is the same as the t-value for yr_rnd in the regression above.

You can see that the Mean Difference of 160.506 is exactly the same as the coefficient in the simple linear regression except that the sign is reversed. This can easily be changed if we define Group 1 to be year round schools and Group 2 to be non-year round schools.

In SPSS, a variable after the BY keyword is a Fixed Factor (or categorical variable) and a variable after the WITH statement is a Covariate (or a continuous variable). Here we are only interested in Fixed Factors. The output we obtain from running the code is:

The Parameter Estimates table tell us the differences in the predicted scores from the respective category to the reference category. The term [mealcat=1] is the additional increase in predicted api00 scores for the first category compared to the third category. Similarly, the term [mealcat=2] is the additional increase in predicted api00 scores for the second category compared to the third category. This makes sense given that we expect higher api00 scores for lower percent free meals at the school.

From this table we can see that there are a total of five dummy variables, 2 dummies for Year Round and Not Year Round, and 3 dummies for each of the three meal categories. The variables Dummy2 (Not Year Round) and Dummy5 (Third Meal Category) are redundant and hence excluded from our model. Additionally, these are the reference groups for Year Round and Meal Category. By putting both variables in our Factor list, SPSS is internally creating five dummy variables for us and purposely excluding Dummy2 and Dummy5 (green highlights).

Because this model has only main effects (no interactions) you can interpret [yr_rnd2=1] as the difference between the year round and non-year round schools (year round schools have a lower predicted api00 holding mealcat constant). The coefficient for [mealcat=1] is the difference in predicted api00 between the first and third meal categories, and [mealcat=2] is the difference in predicted api00 between the second and third meal categories holding yr_rnd2 constant.