Data Analysis Software Stata

0 views

Skip to first unread message

Sherley

unread,

Aug 3, 2024, 6:07:27 PM8/3/24

to drapensekel

Stata is a commercial statistical software package widely used by quantitative social scientists (e.g. economists, sociologists, political scientists). It has an extensive collection of commands that can be used to easily accomplish practically any manipulation and analysis of data that one would need. It also allows relatively easy access to programming features. You do not need to know a programming language to start using Stata, although an understanding of basic programming concepts is helpful.

Why should you work with do-files?A do-file contains every command that you ever used for your project, from the very first step (loading data) to the very last (exporting your results). It documents every step you took in the process of manipulating and analyzing data. If you need to modify or repeat certain steps, you simply modify your do-file appropriately instead of redoing everything.

Data Analysis Using Stata, Third Edition has been completely revamped to reflect the capabilities of Stata 12. This book will appeal to those just learning statistics and Stata, as well as to the many users who are switching to Stata from other packages. Throughout the book, Kohler and Kreuter show examples using data from the German Socio-Economic Panel, a large survey of households containing demographic, income, employment, and other key information.

Data Analysis Using Stata, Third Edition has been structured so that it can be used as a self-study course or as a textbook in an introductory data analysis or statistics course. It will appeal to students and academic researchers in all the social sciences.

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies Advertising cookies

Required cookies These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

The pages below contain examples (often hypothetical) illustrating theapplication of different statistical analysis techniques using differentstatistical packages. Each page provides ahandful of examples of when the analysis might be used along with sample data,an example analysis and an explanation of the output, followed by references for more information. These pages merely introduce theessence of the technique and do not provide a comprehensivedescription of how to use it.

The combination of topics and packagesreflect questions that are often asked in our statistical consulting. As such,this heavily reflects the demand from our clients at walk in consulting, notdemand of readers from around the world. Many worthy topics will not be coveredbecause they are not reflected in questions by our clients. Also, not allanalysis techniques will be covered in all packages, again largely determined byclient demand. If an analysis is not shown in a particular package,this doesnot imply that the package cannot do the analysis, itmay simply mean that the analysis is not commonly done in that packageby our clients.

The purpose of this workshop is to explore some issues in the analysis ofsurvey data using Stata 17. Before we begin, you will want to be sure thatyour copy of Stata is up-to-date. To do this, please type

Regular procedures in statistical software (that is not designed for surveydata) analyzes data as if the data were collected using simple random sampling.For experimental and quasi-experimental designs, this is exactly what we want.However, very few surveys use a simple random sample to collect data.Not only is it nearly impossible to do so, but it is not as efficient (eitherfinancially and statistically) as other sampling methods. When anysampling method other than simple random sampling is used, we usually need to use surveydata analysis software to take into account the differences between the designthat was used to collect the data and simple random sampling. This is because the sampling design affectsboth the calculation of thepoint estimates and the standard errors of those estimates. If you ignore the sampling design, e.g.,if you assume simple random sampling when another type of sampling design wasused, both the point estimates and their standard errors will likely becalculated incorrectly. The sampling weight will affect the calculation ofthe point estimate, and the stratification and/or clustering will affect thecalculation of the standard errors. Ignoring the clustering will likelylead to standard errors that are underestimated, possibly leading toresults that seem to be statistically significant, when in fact, they are not.The difference in point estimates and standard errors obtained using non-surveysoftware and survey software with the design properly specified will vary fromdata set to data set, and even between analyses using the same data set.While it may be possible to get reasonably accurate results using non-surveysoftware, there is no practical way to know beforehand how far off the resultsfrom non-survey software will be.

Most people do not conduct their own surveys. Rather, theyuse survey data that some agency or company collected and made available to thepublic. The documentation must be read carefully to find out what kind ofsampling design was used to collect the data. This is very importantbecause many of the estimates and standard errors are calculated differently forthe different sampling designs. Hence, if you mis-specify the samplingdesign, the point estimates and standard errors will likely be wrong.

Sampling weights: There are several types of weights thatcan be associated with a survey. Perhaps the most common is the sampling weight.A sampling weight is a probability weight that has had one or more adjustmentsmade to it. Both a sampling weight and a probability weight are used to weight the sample back to the population from which the samplewas drawn. By definition, a probability weight is the inverse of the probability of being included in thesample due to the sampling design (except for a certainty PSU, see below).The probability weight, called a pweight in Stata, is calculated as N/n, where N = the number of elements in thepopulation and n = the number of elements in the sample. For example, if a population has 10elements and 3 are sampled at random with replacement, then the probability weight would be10/3 = 3.33. In a two-stage design, the probability weight is calculated as f1f2,which means that the inverse of the sampling fraction for the first stage ismultiplied by the inverse of the sampling fraction for the second stage.Under many sampling plans, the sum of the probability weights will equal the population total.

Strata: Stratification is a method of breaking up thepopulation into different groups, often by demographic variables such as gender,race or SES. Each element in the population must belong to one, and onlyone, strata. Once the strata have been defined, samples are taken from eachstratum as if it were independent of all of the other strata. For example,if a sample is to be stratified on gender, men and women would be sampledindependently of one another. This means that the probability weights for men willlikely be different from the probability weights for the women. In most cases, youneed to have two or more PSUs in each stratum. The purpose ofstratification is to reduce the standard error of the estimates, and stratification works mosteffectively when the variance of the dependent variable is smaller within thestrata than in the sample as a whole.

PSU: This is the primary sampling unit.This is the first unit that is sampled in the design. For example, schooldistricts from California may be sampled and then schools within districts maybe sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within eachstate, and then schools from within each district, then states would be the PSU.One does not need to use the same sampling method at all levels of sampling.For example, probability-proportional-to-size sampling may be used atlevel 1 (to select states), while cluster sampling is used at level 2(to select school districts). In the case of a simple random sample, thePSUs and the elementary units are the same. In general, accounting for theclustering in the data (i.e., using the PSUs), will increase the standard errorsof the point estimates. Conversely, ignoring the PSUs will tend to yieldstandard errors that are too small, leading to false positives when doingsignificance tests.

Replicate weights: Replicate weights are a series of weightvariables that are used to correct the standard errors for the sampling plan.They serve the same function as the PSU and strata variables (which are used a Taylor serieslinearization) to correct the standard errors of the estimates for the samplingdesign. Many public use data sets are now being released withreplicate weights instead of PSUs and strata in an effort to more securelyprotect the identity of the respondents. In theory, the same standarderrors will be obtained using either the PSU and strata or the replicateweights. There are different ways of creating replicate weights; themethod used is determined by the sampling plan. The most common arebalanced repeated and jackknife replicate weights. You will need to readthe documentation for the survey data set carefully to learn what type ofreplicate weight is included in the data set; specifying the wrong type ofreplicate weight will likely lead to incorrect standard errors. For moreinformation on replicate weights, please seeStataLibrary: Replicate Weights and Appendix D of theWesVar Manualby Westat, Inc. Several statistical packages, including Stata, SAS, R, Mplus, SUDAAN and WesVar, allow the use of replicate weights.