Model Regresi Variabel Dummy

0 views

Skip to first unread message

Gaynelle Alnutt

unread,

Aug 5, 2024, 12:04:01 AM8/5/24

to teibertprivan

Inregression analysis, a dummy variable (also known as indicator variable or just dummy) is one that takes a binary value (0 or 1) to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.[1] For example, if we were studying the relationship between biological sex and income, we could use a dummy variable to represent the sex of each individual in the study. The variable could take on a value of 1 for males and 0 for females (or vice versa). In machine learning this is known as one-hot encoding.

Dummy variables are commonly used in regression analysis to represent categorical variables that have more than two levels, such as education level or occupation. In this case, multiple dummy variables would be created to represent each level of the variable, and only one dummy variable would take on a value of 1 for each observation. Dummy variables are useful because they allow us to include categorical variables in our analysis, which would otherwise be difficult to include due to their non-numeric nature. They can also help us to control for confounding factors and improve the validity of our results.

As with any addition of variables to a model, the addition of dummy variables will increase the within-sample model fit (coefficient of determination), but at a cost of fewer degrees of freedom and loss of generality of the model (out of sample model fit). Too many dummy variables result in a model that does not provide any general conclusions.

Dummy variables are useful in various cases. For example, in econometric time series analysis, dummy variables may be used to indicate the occurrence of wars, or major strikes. It could thus be thought of as a Boolean, i.e., a truth value represented as the numerical value 0 or 1 (as is sometimes done in computer programming).

Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the panel data fixed effects estimator dummies are created for each of the units in cross-sectional data (e.g. firms or countries) or periods in a pooled time-series. However in such regressions either the constant term has to be removed, or one of the dummies removed making this the base category against which the others are assessed, for the following reason:

If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-of-ones variable were also present, this would result in perfect multicollinearity,[2] so that the matrix inversion in the estimation algorithm would be impossible. This is referred to as the dummy variable trap.

Regresi yang melibatkan variabel dummy (bernilai 0 atau 1) dapat digunakan untuk menentukan pengaruh dari variabel dummy terhadap suatu variabel kuantitatif. Penelitian ini bertujuan untuk menentukan model regresi yang menyatakan hubungan antara IPK mahasiswa dengan nilai UN (variabel kuantitatif) dan empat variabel dummy yaitu status SMA/SMK (SMA), tingkat akreditasi SMA/SMK (AK), asal kabupaten dari SMA/SMK (KAB), dan jalur masuk perguruan tinggi (PT). Populasinya adalah semua mahasiswa pendidikan matematika angkatan (yang masuk) di tahun akademik 2019/2020 dari salah satu universitas di Palangka Raya, Kalimantan Tengah. Penelitian ini dilaksanakan pada populasi. Banyak anggota populasi adalah 59 mahasiswa. Peneliti mengumpulkan data menggunakan survey online (google form). Analisis data dilakukan secara deskriptif (tabel dan grafik), dan inferensia (analisis regresi dummy). Hasilnya menunjukkan bahwa UN, SMA, dan AK secara signifikan berpengaruh terhadap IPK, sedangkan dua variabel dummy lainnya tidak signifikan. Model regresi terbaiknya adalah .

Journal Matematika dan Statistika serta Aplikasinya (MSA) by Mathematics Department Universitas Islam Negeri Alauddin Makassar is licensed under a Creative Commons Attribution 4.0 International License.

Variabel dummy adalah variabel yang digunakan untuk mengkuantitatifkan variabel yang bersifat kualitatif (misal: jenis kelamin, ras, agama, perubahan kebijakan pemerintah, perbedaan situasi dan lain-lain). Variabel dummy merupakan variabel yang bersifat kategorikal yang diduga mempunyai pengaruh terhadap variabel yang bersifat kontinue. Variabel dummy sering juga disebut variabel boneka, binary, kategorik atau dikotom. Variabel dummy hanya mempunyai 2 (dua) nilai yaitu 1 dan nilai 0, serta diberi simbol D. Dummy memiliki nilai 1 (D=1) untuk salah satu kategori dan nol (D=0) untuk kategori yang lain.

Variabel dummy digunakan sebagai upaya untuk melihat bagaimana klasifikasi-klasifikasi dalam sampel berpengaruh terhadap parameter pendugaan. Variabel dummy juga mencoba membuat kuantifikasi dari variabel kualitatif.

Multinomial logistic regression (often just called "multinomial regression") is used to predict a nominal dependent variable given one or more independent variables. It is sometimes considered an extension of binomial logistic regression to allow for a dependent variable with more than two categories. As with other types of regression, multinomial logistic regression can have nominal and/or continuous independent variables and can have interactions between independent variables to predict the dependent variable.

This "quick start" guide shows you how to carry out a multinomial logistic regression using SPSS Statistics and explain some of the tables that are generated by SPSS Statistics. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for a multinomial logistic regression to give you a valid result. We discuss these assumptions next.

When you choose to analyse your data using multinomial logistic regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multinomial logistic regression. You need to do this because it is only appropriate to use multinomial logistic regression if your data "passes" six assumptions that are required for multinomial logistic regression to give you a valid result. In practice, checking for these six assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.

In the section, Procedure, we illustrate the SPSS Statistics procedure to perform a multinomial logistic regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.

A researcher wanted to understand whether the political party that a person votes for can be predicted from a belief in whether tax is too high and a person's income (i.e., salary). Therefore, the political party the participants last voted for was recorded in the politics variable and had three options: "Conservatives", "Labour" and "Liberal Democrats". When presented with the statement, "tax is too high in this country", participants had four options of how to respond: "Strongly Disagree", "Disagree", "Agree" or "Strongly Agree" and stored in the variable, tax_too_high. The researcher also asked participants their annual income which was recorded in the income variable. As such, in variable terms, a multinomial logistic regression was run to predict politics from tax_too_high and income.

The six steps below show you how to analyse your data using a multinomial logistic regression in SPSS Statistics when none of the six assumptions in the previous section, Assumptions, have been violated. At the end of these six steps, we show you how to interpret the results from your multinomial logistic regression.

SPSS Statistics will generate quite a few tables of output for a multinomial logistic regression analysis. In this section, we show you some of the tables required to understand your results from the multinomial logistic regression procedure, assuming that no assumptions have been violated.

The "Final" row presents information on whether all the coefficients of the model are zero (i.e., whether any of the coefficients are statistically significant). Another way to consider this result is whether the variables you added statistically significantly improve the model compared to the intercept alone (i.e., with no variables added). You can see from the "Sig." column that p = .027, which means that the full model statistically significantly predicts the dependent variable better than the intercept-only model alone.

In multinomial logistic regression you can also consider measures that are similar to R2 in ordinary least-squares linear regression, which is the proportion of variance that can be explained by the model. In multinomial logistic regression, however, these are pseudo R2 measures and there is more than one, although none are easily interpretable. Nonetheless, they are calculated and shown below in the Pseudo R-Square table:

SPSS Statistics calculates the Cox and Snell, Nagelkerke and McFadden pseudo R2 measures. Of much greater importance are the results presented in the Likelihood Ratio Tests table, as shown below:

This table presents the parameter estimates (also known as the coefficients of the model). As you can see, each dummy variable has a coefficient for the tax_too_high variable. However, there is no overall statistical significance value. This was presented in the previous table (i.e., the Likelihood Ratio Tests table). As there were three categories of the dependent variable, you can see that there are two sets of logistic regression coefficients (sometimes called two logits). The first set of coefficients are found in the "Lib" row (representing the comparison of the Liberal Democrats category to the reference category, Labour). The second set of coefficients are found in the "Con" row (this time representing the comparison of the Conservatives category to the reference category, Labour). You can see that "income" for both sets of coefficients is not statistically significant (p = .532 and p = .508, respectively; the "Sig." column).