Stata Regression With Dummy Variables

0 views

Skip to first unread message

Rosella Brain

unread,

Aug 5, 2024, 8:12:25 AM8/5/24

to gesylpelim

IfI want a dummy for all levels of size except for a comparison group or base level, I do not need to create 4 dummies. Using [U] factor variables, I may type . summarize i.size or use an estimator

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies Advertising cookies Required cookies These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

To provide an example, let us suppose our sample of individuals have five levels of wealth; poorest, poorer , middle , richer and richest. We are interested in understanding the relation between total number of children born in a family and their wealth level. (The data can be found here.)

We can create 5 dummy variables, called poorest, poorer , middle , richer and richest. The variable poorest takes the value 1 for individuals who have the poorest wealth and 0 otherwise. The variable poorer takes the value 1 for individuals who have poorer wealth and 0 otherwise. Similarly, we construct the other variables. We can take two approaches while regressing total number of children born in a family on wealth levels:

B. Dummy Dependent Variable: OLS regressions are not very informative when the dependent variable is categorical. To handle such situations, one needs to implement one of the following regression techniques depending on the exact nature of the categorical dependent variable.

Do keep in mind that the independent variables can be continuous or categorical while running any of the models below. There is no need for the independent variables to be binary just because the dependent variable is binary.

As an example, if we have data on weight and mileage of 22 foreign and 52 domestic automobiles, we may wish to fit a logit model explaining whether a car is foreign or not on the basis of its weight and mileage. (The data can be found here.)

Here the dependent variable foreign takes the value 1 if the car is foreign and 0 if it is domestic. The regressors weight and mpg are usual continuous variables and denote the weight and mileage of the car respectively.

One must be cautious when interpreting the odds ratio of the constant/intercept term. Usually, this odds ratio represents the baseline odds of the model when all predictor variables are set to zero. Howeer, one must verify that a zero value for all predictors actually makes sense before continuing with this interpretation. For example, a weight of zero for a car does not make sense in the above example, and so the odds ratio estimate for the intercept term here does not carry any meaning.

Note: Both the Logit and Probit models are suitable when the dependent variable is binary or dichotomous. When the dependent variable has more than two categories, one needs to implement either a multinomial logistic regression or an ordered logistic regression, discussed below.

(iii) Multinomial Logit: In a multinomial logit model, the number of outcomes that the dependent variable can possibly accommodate is greater than two. This is the main difference of the multinomial from the ordinary logit. However, multinomial logit only allows for a dependent variable whose categories are not ordered in a genuine sense (for which case one needs to run an Ordered Logit regression).

The above command allows STATA to arbitarily choose which outcome to use as the base outcome. If one wants to specify the base outcome, it can be done by adding the base() option. Suppose, we want to compare with being out of the labor force rather than full-time worker. In this case the base outcome is 0 and to implement it in Stata we will run the following command:

Please note: This page makes use of the program xi3 which is no longer being maintained and has been from ourarchives. References to xi3 will be left on this page because they illustrate specific principles of coding categoricalvariables.

In the previous two chapters, we have focused on regression analyses using continuousvariables. However, it is possible to include categorical predictors in a regressionanalysis, but it requires some extra work in performing the analysis and extra work inproperly interpreting the results. This chapter will illustrate how you can use Statafor including categorical predictors in your analysis and describe how to interpret theresults of such analyses. Stata has some great tools that really ease the process ofincluding categorical variables in your regression analysis, and we will emphasize the useof these timesaving tools.

The variable meals is the percentage of students who are receivingstate sponsored free meals and can be used as an indicator of poverty. This was brokeninto 3 categories (to make equally sized groups) creating the variable mealcat.Thecodebook information for mealcat is shown below.

It may be surprising to note that this regression analysis with a single dummy variableis the same as doing a t-test comparing the mean api00 for the year-roundschools with the non year-round schools (see below). You can see that the t value belowis the same as the t value for yr_rnd in the regression above. This is becauseByr_rnd compares the year-rounds and non year-rounds (sincethe coefficient is mean(year round)-mean(non year-round)).

A categorical predictor variable does not have to be coded 0/1 to be used in aregression model. It is easier to understand and interpret the results from a model withdummy variables, but the results from a variable coded 1/2 yield essentially the sameresults.

Note that you can use 0/1 or 1/2 coding and the results for the coefficient come outthe same, but the interpretation of the constant in the regression equation is different. Itis often easier to interpret the estimates for 0/1 coding.

In summary, these results indicate that the api00 scores aresignificantly different for the schools depending on the type of school, yearround school vs. non-year round school. Non year-round schools havesignificantly higher API scores than year-round schools. Based on the regression results,non year-round schools have scores that are 160.5 points higher than year- roundschools.

But this is looking at the linear effect of mealcat with api00,but mealcat is not an interval variable. Instead, you will want to code the variable sothat all the information concerning the three levels is accounted for.You can dummy code mealcat like this.

The interpretation of the coefficients is much like that for the binary variables. Group 1 isthe omitted group, so _cons is the mean for group 1. The coefficient for mealcat2is the mean for group 2 minus the mean of the omitted group (group 1). And the coefficient formealcat3is the mean of group 3 minus the mean of group 1. You can verify this by comparing thecoefficients with the means of the groups.

As you can see, the results are the same as in the prior analysis. If we want totest the overall effect of mealcat we use the test command as shown below, whichalso gives us the same results as we found using the dummy variables mealcat2and mealcat3.

With group 3 omitted, the constant is now the mean of group 3 and mealcat1is group1-group3 and mealcat2 is group2-group3. We see that both ofthese coefficients are significant, indicating that group 1 is significantly different fromgroup 3 and group 2 is significantly different from group 3.

When we use the xi command, how can we choose which group is theomitted group? By default, the first group is omitted, but say we want group 3 to beomitted. We can use the char command as shown below to tell Stata thatwe want the third group to be the omitted group for the variable mealcat.

It is generally very convenient to use dummy coding but that is not the only kind ofcoding that can be used. As you have seen, when you use dummy coding one of the groupsbecomes the reference group and all of the other groups are compared to that group. Thismay not be the most interesting set of comparisons.

Say you want to compare group 1 with groups 2 and 3, and for a second comparisoncompare group 2 with group 3. You need to generatea coding scheme that forms these 2 comparisons. We will illustrate this using a Stata program,xi3, (an enhanced version of xi) that willcreate the variables you would need for such comparisons (as well as a varietyof other common comparisons).

Using the coding scheme provided by xi3, we were able to formperhaps more interesting tests than those provided by dummy coding. The xi3program can create variables according to other coding schemes, as well ascustom coding schemes that you create, see help xi3 and Chapter5 for more information.

The coefficient for _Imealcat_1 is the predicted difference betweencell1 and cell3. Since this model only has main effects, it is also the predicteddifference between cell4 and cell6. Likewise, B_Imealcat_2 is thepredicted difference between cell2 and cell3, and also the predicted difference betweencell5 and cell6.