Statistics 2 For Dummies

0 views

Skip to first unread message

Baudilio Eliason

unread,

Aug 4, 2024, 8:08:15 PM8/4/24

to wildhobbdela

Inregression analysis, a dummy variable (also known as indicator variable or just dummy) is one that takes a binary value (0 or 1) to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.[1] For example, if we were studying the relationship between biological sex and income, we could use a dummy variable to represent the sex of each individual in the study. The variable could take on a value of 1 for males and 0 for females (or vice versa). In machine learning this is known as one-hot encoding.

Dummy variables are commonly used in regression analysis to represent categorical variables that have more than two levels, such as education level or occupation. In this case, multiple dummy variables would be created to represent each level of the variable, and only one dummy variable would take on a value of 1 for each observation. Dummy variables are useful because they allow us to include categorical variables in our analysis, which would otherwise be difficult to include due to their non-numeric nature. They can also help us to control for confounding factors and improve the validity of our results.

As with any addition of variables to a model, the addition of dummy variables will increase the within-sample model fit (coefficient of determination), but at a cost of fewer degrees of freedom and loss of generality of the model (out of sample model fit). Too many dummy variables result in a model that does not provide any general conclusions.

Dummy variables are useful in various cases. For example, in econometric time series analysis, dummy variables may be used to indicate the occurrence of wars, or major strikes. It could thus be thought of as a Boolean, i.e., a truth value represented as the numerical value 0 or 1 (as is sometimes done in computer programming).

Dummy variables may be extended to more complex cases. For example, seasonal effects may be captured by creating dummy variables for each of the seasons: D1=1 if the observation is for summer, and equals zero otherwise; D2=1 if and only if autumn, otherwise equals zero; D3=1 if and only if winter, otherwise equals zero; and D4=1 if and only if spring, otherwise equals zero. In the panel data fixed effects estimator dummies are created for each of the units in cross-sectional data (e.g. firms or countries) or periods in a pooled time-series. However in such regressions either the constant term has to be removed, or one of the dummies removed making this the base category against which the others are assessed, for the following reason:

If dummy variables for all categories were included, their sum would equal 1 for all observations, which is identical to and hence perfectly correlated with the vector-of-ones variable whose coefficient is the constant term; if the vector-of-ones variable were also present, this would result in perfect multicollinearity,[2] so that the matrix inversion in the estimation algorithm would be impossible. This is referred to as the dummy variable trap.

Being able to make the connections between those statistical techniques and formulas is perhaps even more important. It builds confidence when tackling statistical problems and solidifies your strategies for completing statistical projects.

When designing a study, the sample size is an important consideration because the larger the sample size, the more data you have, and the more precise your results will be (assuming high-quality data). If you know the level of precision you want (that is, your desired margin of error), you can calculate the sample size needed to achieve it.

In statistics, a confidence interval is an educated guess about some characteristic of the population. A confidence interval contains an initial estimate plus or minus a margin of error (the amount by which you expect your results to vary, if a different sample were taken).

To test a statistical hypothesis, you take a sample, collect data, form a statistic, standardize it to form a test statistic (so it can be interpreted on a standard scale), and decide whether the test statistic refutes the claim. The following table lays out the important details for hypothesis tests.

Deborah J. Rumsey, PhD, is an Auxiliary Professor and Statistics Education Specialist at The Ohio State University. She is the author of Statistics For Dummies, Statistics II For Dummies, Statistics Workbook For Dummies, and Probability For Dummies.

Stymied by statistics? No fear? this friendly guide offers clear, practical explanations of statistical ideas, techniques, formulas, and calculations, with lots of examples that show you how these concepts apply to your everyday life.

Statistics For Dummies shows you how to interpret and critique graphs and charts, determine the odds with probability, guesstimate with confidence using confidence intervals, set up and carry out a hypothesis test, compute statistical formulas, and more.

Stymied by statistics? No fear ? this friendly guide offers clear, practical explanations of statistical ideas, techniques, formulas, and calculations, with lots of examples that show you how these concepts apply to your everyday life.

Summing up my understanding of the topic 'Dummy Coding' is usually understood as coding a nominal attribute with K possible values as K-1 binary dummies. The usage of K values would cause redundancy and would have a negative impact e.g. on logistic regression, as far as I learned it. That far, everything's clear to me.

2) An issue arises as soon as I consider attribute selection. Where the left-out attribute value is implicitly included as the case where all dummies are zero if all dummies are actually used for the model, it isn't included clearly anymore, if one dummy is missing (as not selected in attribute selection). The issue is much easy to understand with the sketch I uploaded. How can that issue be treated?

WEKA Output: The Logistic algorithm was run on the UCI dataset German Credit, where the possible values of the first attribute are A11,A12,A13,A14. All of them are included in the logistic regression model. -089out9.png

The output is generally more easy to read, interpret and use when you use k dummies instead of k-1 dummies. I figure that is why everybody seems to actually use k dummies.But yes, as the k values sum up to 1, there exists a correlation that may cause problems. But correlations in data sets are common, you will never completely get rid of them!

You really should be using weighting, or consider more advanced algorithms that can handle such data. In fact the dummy variables can cause just as much trouble, because they are binary, and oh so many algorithms (e.g. k-means) don't make much sense on binary variables.

As for the decision tree: don't perform, feature selection on your output attribute...Plus, as a decision tree already selects features, it does not make sense to do all this anyway... leave it to the decision tree to decide upon which attribute to use for splitting. This way, it can learn dependencies, too.

A Z-score can be calculated to assess the significance of the D-statistic. I will not explain the mathematical underpinnings of the Z-score. All you need to know, is that a Z-score bigger than 3 or smaller than -3 can be interpreted as a significant result. Interested readers can check Durand et al. (2011) for more information.

The figure below illustrates the D-statistic with an example from my own work (see Ottenburghs et al. (2017) for more details). Comparing the genomes of four goose species reveals that Cackling Goose (Branta hutchinsii) and Canada Goose (B. canadensis) share more derived alleles than expected by chance. The resulting positive D-statistic suggests introgression between these species, which is not that surprising because there is a hybrid zone between these geese.

This new statistic seems promising for studies that could not sample an appropriate outgroup. However, one should not take this method at face value. A significant D3-statistic does not automatically mean that there has been introgression. Other evolutionary processes can influence this statistic (similar to the classic D-statistic). For example, population structure in the ancestor can produce deviations in the number of discordance topologies. Or introgression might come from unsampled or extinct species. Therefore, it is important to complement these statistics with other analyses to quantify introgression.

Up until then, for all human history, calculating the square root or logarithm of a number was a pain. It required looking up their values in bulky tables (available only in libraries) or a long sequence of steps and a lot of calculations by hand, eating up time and paper on a regular basis. With the advent of the Texas Instruments SR10 pocket LED calculator, however, all that work and sweat equity could then be done instantaneously at the push of a button.

The specific marriage that Kabala is talking about is a ChatGPT plugin provided by Wolfram Research, a longtime leader in computational technology. Wolfram launched its first version of its flagship program Mathematica (now Wolfram Language) in 1988, which has ever since been a mainstay in the arsenal of STEM students for its built-in libraries for many areas of technical computing including machine learning, statistics, data analysis, visualizations, plotting functions, and much more.

By asking the Wolfram GPT in plain English to develop code for a problem, students are immediately exposed to the syntax (grammar) of the programming language without the necessity of looking up its rules and documentation, a major source of frustration in the past. Being exposed to mostly good code generated by Wolfram GPT allows students to assimilate the computer language effortlessly, in a process analogous to learning a foreign language through immersion.