Dummy Variable Regression Example

0 views

Skip to first unread message

Claudio Davey

unread,

Aug 4, 2024, 11:37:00 PM8/4/24

to neouwestsunring

Wewant to perform linear regression of the police confidence score against sex, which is a binary categorical variable with two possible values (which we can see are 1= Male and 2= Female if we check the Values cell in the sex row in Variable View ). However, before we begin our linear regression, we need to recode the values of Male and Female. Why must we do this?

The codes 1 and 2 are assigned to each gender simply to represent which distinct place each category occupies in the variable sex . However, linear regression assumes that the numerical amounts in all independent, or explanatory, variables are meaningful data points. So, if we were to enter the variable sex into a linear regression model, the coded values of the two gender categories would be interpreted as the numerical values of each category. This would provide us with results that would not make sense, because for example, the sex Female does not have a value of 2.

A dummy variable is a variable created to assign numerical value to levels of categorical variables. Each dummy variable represents one category of the explanatory variable and is coded with 1 if the case falls in that category and with 0 if not. For example, in the dummy variable for Female, all cases in which the respondent is female are coded as 1 and all other cases, in which the respondent is Male, are coded as 0. This allows us to enter in the sex values as numerical. (Remember, these numbers are just indicators.)

Our sample of data has shown us that, on average, female respondents reported a police confidence score that is .436 points lower than male respondents. We want to know if this is a statistically significant effect in the population from which the sample was taken. To do this, we carry out a hypothesis test to determine whether or not b (the coefficient for females) is different from zero in the population. If the coefficient could be zero, then there is no statistically significant difference between males and females.

SPSS calculates a t statistic and a corresponding p-value for each of the coefficients in the model. These can be seen in the Coefficients output table. A t statistic is a measure of how likely it is that the coefficient is not equal to zero. It is calculated by dividing the coefficient by the standard error . If the standard error is small relative to the coefficient (making the t statistic relatively large), the coefficient is likely to differ from zero in the population.

The p-value is in the column labelled Sig . As in all hypothesis tests, if the p-value is less than 0.05, then the variable is significant at the 5% level. That is, we would have evidence to reject the null and conclude that b is different from zero.

In this example, t = -10.417 with a corresponding p-value of 0.000. This means that the chances of the difference between males and females that we have calculated is actually happening due to chance is very small indeed. Therefore, we have evidence to conclude that sex1 is a significant predictor of policeconf1 in the population.

In regression analysis, a dummy variable (also known as indicator variable or just dummy) is one that takes a binary value (0 or 1) to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.[1] For example, if we were studying the relationship between biological sex and income, we could use a dummy variable to represent the sex of each individual in the study. The variable could take on a value of 1 for males and 0 for females (or vice versa). In machine learning this is known as one-hot encoding.

Dummy variables are commonly used in regression analysis to represent categorical variables that have more than two levels, such as education level or occupation. In this case, multiple dummy variables would be created to represent each level of the variable, and only one dummy variable would take on a value of 1 for each observation. Dummy variables are useful because they allow us to include categorical variables in our analysis, which would otherwise be difficult to include due to their non-numeric nature. They can also help us to control for confounding factors and improve the validity of our results.

As with any addition of variables to a model, the addition of dummy variables will increase the within-sample model fit (coefficient of determination), but at a cost of fewer degrees of freedom and loss of generality of the model (out of sample model fit). Too many dummy variables result in a model that does not provide any general conclusions.

Dummy variables are useful in various cases. For example, in econometric time series analysis, dummy variables may be used to indicate the occurrence of wars, or major strikes. It could thus be thought of as a Boolean, i.e., a truth value represented as the numerical value 0 or 1 (as is sometimes done in computer programming).

Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the panel data fixed effects estimator dummies are created for each of the units in cross-sectional data (e.g. firms or countries) or periods in a pooled time-series. However in such regressions either the constant term has to be removed, or one of the dummies removed making this the base category against which the others are assessed, for the following reason:

If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-of-ones variable were also present, this would result in perfect multicollinearity,[2] so that the matrix inversion in the estimation algorithm would be impossible. This is referred to as the dummy variable trap.

Thus, when we have an intercept in the regression model and we want to avoid perfect multicollinearity, we create only one dummy to encode a categorical variable that has two categories.

To provide an example, let us suppose our sample of individuals have five levels of wealth; poorest, poorer , middle , richer and richest. We are interested in understanding the relation between total number of children born in a family and their wealth level. (The data can be found here.)

We can create 5 dummy variables, called poorest, poorer , middle , richer and richest. The variable poorest takes the value 1 for individuals who have the poorest wealth and 0 otherwise. The variable poorer takes the value 1 for individuals who have poorer wealth and 0 otherwise. Similarly, we construct the other variables. We can take two approaches while regressing total number of children born in a family on wealth levels:

B. Dummy Dependent Variable: OLS regressions are not very informative when the dependent variable is categorical. To handle such situations, one needs to implement one of the following regression techniques depending on the exact nature of the categorical dependent variable.

Do keep in mind that the independent variables can be continuous or categorical while running any of the models below. There is no need for the independent variables to be binary just because the dependent variable is binary.

As an example, if we have data on weight and mileage of 22 foreign and 52 domestic automobiles, we may wish to fit a logit model explaining whether a car is foreign or not on the basis of its weight and mileage. (The data can be found here.)

Here the dependent variable foreign takes the value 1 if the car is foreign and 0 if it is domestic. The regressors weight and mpg are usual continuous variables and denote the weight and mileage of the car respectively.

One must be cautious when interpreting the odds ratio of the constant/intercept term. Usually, this odds ratio represents the baseline odds of the model when all predictor variables are set to zero. Howeer, one must verify that a zero value for all predictors actually makes sense before continuing with this interpretation. For example, a weight of zero for a car does not make sense in the above example, and so the odds ratio estimate for the intercept term here does not carry any meaning.

Note: Both the Logit and Probit models are suitable when the dependent variable is binary or dichotomous. When the dependent variable has more than two categories, one needs to implement either a multinomial logistic regression or an ordered logistic regression, discussed below.

(iii) Multinomial Logit: In a multinomial logit model, the number of outcomes that the dependent variable can possibly accommodate is greater than two. This is the main difference of the multinomial from the ordinary logit. However, multinomial logit only allows for a dependent variable whose categories are not ordered in a genuine sense (for which case one needs to run an Ordered Logit regression).

The above command allows STATA to arbitarily choose which outcome to use as the base outcome. If one wants to specify the base outcome, it can be done by adding the base() option. Suppose, we want to compare with being out of the labor force rather than full-time worker. In this case the base outcome is 0 and to implement it in Stata we will run the following command:

The dataset I am working with is an automotive dataset with different vehicle sales. My assignment is to do a regression analysis on the dataset using Tableau. I wanted to do a regression analysis of the variable vehicle use (personal) against gross margin %. And another regression analysis of car category (SUV) against gross margin %. I want to compare the r^2 value between both of these to determine which variable explains the gross margin % better.