Regression Analysis With Dummy Variables Excel

0 views

Skip to first unread message

Kelsi Corsi

unread,

Aug 4, 2024, 7:02:35 PM8/4/24

to stocacknowsa

Newbie question here: I'm doing dummy variable regression using JMP for a school project. When I cross-referenced to Excel's version, I realized the coefficients are completely different even though the t-stats are the same.

1) I can't seem to find the standard error of estimate for the overall model after running the regression (I can find only the standard error for each independent variable). Can you point me in the right direction for this?

2) Let's say I've coefficient estimates generated from one regression, and those generated from another regression. Am I able to run a t-test of the coefficient estimates in JMP? If so, do you have any advice how I can approach that?

Hi Mark,

Thanks for getting back so promptly.

If you refer to the attached images, that's the analysis output after running a regression. So my question is, is there a way for me to find the "total standard error of estimate", if there's such a thing since currently all I can see about the standard error is of each coefficient estimate.

Thanks. I understand where you're coming from.

Another thing is, is there a way to find the standard error of beta coefficient for each independent variable in the Fit Model's Analysis? I understand this is the method used to find beta coefficient:

Excel is not more correct. They are just different. It has to do with the parameterization of the linear model. The choice is mathematically arbitrary but each way has its advantages. JMP uses effect parameterization. That way is supposed to make the parameter estimates more interpretable. They sum to zero, so the intercept is the mean response at the origin. The parameter for a given level is the change from the intercept. So an estimate of -0.55112 for COO - USA = 0 means that the mean for that level is 5.8482381 - 0.55112 = 5.2941181. The plots are meant to help interpret the selected model.

Since you are using numeric values with the correct nominal modeling type, JMP is correctly estimating the effect of these categorical predictors. The numeric code, though, might be interpreted by Excel as a continuous variable, since it has nothing like modeling type that is fundamental to statistical modeling and testing. That model could lead to a very different estimate, test, and interpretation.

Just want to take this chance to ask too: Do the negative estimates mean that the presence of those variables reduces preference (assuming preference is the LHS variable and it is coded as low = not preferred and high = preferred), relative to the baseline?

I do have one last question: Is there an option for me to factor in clustered correlation. The issue is, my dummy variables are all treated as independent entries when some of them are done by the same respondent, which means I need to factor that in.

I cannot picture your study and its data well enough to answer you latest question. (Note: I didn't say 'last' question.) What do you mean by 'factor in clustered correlation?' I know that it is related to 'dummy variables are all treated as independent entries' and 'done by the same respondent.'

A row in a JMP data table represents all the values that make up one observation. That is, you know the value of every variable (independent and dependent) for a complete record. Are you saying that there is another variable, Respondent, that identifies the observation, that respondents contribute more than one observation, and you want to account for Respondent variability in the response? If so, enter this variable in your data table and add it as a term in Fit Model. Be sure to select it, click the red triangle for Attributes, and select Random Effect.

Second, create data columns as I described for the factors (not separate dummy variable columns for each factor level) so that each factor uses one data column and enter the value directly for each row.

Now select Analyze > Fit Model. Select the response column and click Y. Select the factor columns and click Add or use Cross or a macro to add interaction effects to your model. Select the Respondent effect (not the data column), click the red triangle next to Attributes and select Random Effect. That change is all you have to do. You are simply distinguishing the type of effect of Respondent (random) from that of the other factors (fixed).

Creating a dummy variable in Excel is a fundamental data processing task, essential for statistical analysis and modeling. This guide provides step-by-step instructions on how to efficiently transform categorical data into numerical format required for many analytical procedures.

Understanding the nuances of dummy variable creation can streamline data prep activities. However, we'll also explore why using a platform like Sourcetable can simplify this process even more than traditional Excel methods.

Excel's IF() function is the primary tool for creating dummy variables. This function assigns a value of 1 or 0 based on a specified condition, effectively converting categorical data into a binary numerical format suitable for regression models.

Discover the key differences between Excel and Sourcetable. Excel, a long-standing leader in spreadsheet software, excels in data manipulation and complex calculations. Sourcetable, on the other hand, streamlines data consolidation from multiple sources, enabling efficient data querying within a familiar spreadsheet interface.

Sourcetable introduces an innovative AI copilot feature, distinguishing it from Excel. This AI assistant aids users in crafting formulas, generating templates, and providing support through a conversational chat interface, enhancing productivity and user experience.

While Excel requires manual integration of data, Sourcetable automates the process, allowing users to focus on analysis rather than data preparation. This seamless integration positions Sourcetable as a compelling choice for those needing to amalgamate data from various platforms.

Choose Sourcetable for its AI-driven assistance and data integration capabilities, or opt for Excel's robust calculation functions. Your decision should align with your specific data management and analysis needs.

Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small; they can take on only two quantitative values. As a practical matter, regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence.

The number of dummy variables required to represent a particular categorical variable depends on the number of values that the categorical variable can assume. To represent a categorical variable that can assume k different values, a researcher would need to define k - 1 dummy variables.

For example, suppose we are interested in political affiliation, a categorical variable that might assume three values - Republican, Democrat, or Independent. We could represent political affiliation with two dummy variables:

In this example, notice that we don't have to create a dummy variable to represent the "Independent" category of political affiliation. If X1 equals zero and X2 equals zero, we know the voter is neither Republican nor Democrat. Therefore, voter must be Independent.

When defining dummy variables, a common mistake is to define too many variables. If a categorical variable can take on k values, it is tempting to define k dummy variables. Resist this urge. Remember, you only need k - 1 dummy variables.

A kth dummy variable is redundant; it carries no new information. And it creates a severe multicollinearity problem for the analysis. Using k dummy variables when only k - 1 dummy variables are required is known as the dummy variable trap. Avoid this trap!

The value of the categorical variable that is not represented explicitly by a dummy variable is called the reference group. In this example, the reference group consists of Independent voters.

In analysis, each dummy variable is compared with the reference group. In this example, a positive regression coefficient means that income is higher for the dummy variable political affiliation than for the reference group; a negative regression coefficient means that income is lower. If the regression coefficient is statistically significant, the income discrepancy with the reference group is also statistically significant.

In this section, we work through a simple example to illustrate the use of dummy variables in regression analysis. The example begins with two independent variables - one quantitative and one categorical. Notice that once the categorical variable is expressed in dummy form, the analysis proceeds in routine fashion. The dummy variable is treated just like any other quantitative variable.