Thisguide discusses techniques to explore data using Stata. To explore data, we usually need to know about the format of the variables, summary statistics, crosstab, frequency, etc. We will provide Stata command to do all of this exploration. We will use built-in Stata data throughout this guide, which we can get by typing the following codes in the Stata command window:
Note: To practice the commands using your data, you have to open your data from your working directory. You can do it using the point and click technique. For instance, to open a Stata dataset, which is stored as a .dta file, click on
Descriptive statistics is vital to understanding the nature of your data. It provides a basic description of your data and allows you to explore the formats ("display format") of the variables. We will use the describe command to get descriptive statistics.
We will use the summarize command in Stata to get the basic summary statics of the variables. If you use the latest versions of Stata, you can use su instead of summarize. Summarize provides basic statistics of your data and helps us understand the essential characteristics of the variables.
The "summarize, detail" command is useful for getting a comprehensive overview of the statistical properties and distribution of each variable in the dataset, For getting the detail set of summary statistics for each of the variables in the dataset, type:
We can get the summary statistics for a particular variable with a condition. For instance, if we want to get the summary statistics of the variable 'age' if the person is 'married', type
If we want the summary statistics for a set of variables, we will have to type su and then mention the name of the set of variables. For instance, to get the summary statistics for the variables 'age', 'race', 'occupation', 'union, and 'wage' type:
Grouped summary statistics: We can get the summary statistics separately for different groups within a variable. For instance, if we want the summary statistics of 'grade' and 'wage' variable for each group of the 'occupation' variable, we will have to type :
The codebook command in Stata is a valuable tool to get detailed information about the variables in a dataset. It provides information on variable names, value labels, data types, summary statistics, and other relevant details. Follow the following steps to apply the codebook command.
Crosstab: Crosstabulation is useful if we want to get the common distribution of two variables in a dataset. To get the crosstabulation of the categorical variables "race" and "collgrad", type
To get the tabstat, type the command name (tabstat) followed by the variable names and an argument (s) specifying the statistics we want to check. For instance, if we want the summary statistics for a list of variables - "age," "married," "collgrad," "south," "c_city," "union," and "wage" type:
Like many things in Stata, there is no single right way to get the program to display descriptives. Below, we will explore two methods. The first of these is the summarize command and the second is tabstat.
To obtain the most basic descriptives (N, mean, std. deviation, min and max) the command is simply summarize [varname(s)] if you do not specify a variable, Stata will print them all. The format for summarize is a list:
The summarize command cannot be split using by, so it will not summarize things by a category, thus you may want to use tabstat to get descriptives split by gender or gender and grade. The tabstat command is one of the rare cases where it is probably best and certainly fastest to use the menu. Therefore, go to Statistics => Summaries, Tables, and Tests => Other tables => Compact table of summary statistics.
Selecting Compact table of summary statistics brings up the tabstat menu. Use the first drop-down menu to select a variable to summarize. The menu immediately under it allows you to select a grouping variable when the box next to itis checked. Finally, at the bottom is a series of drop-down menus with a variety of other statistics available. Check a box to activate a menu, and select your statistic of choice. All of the statistics that you select will be displayed in a single table.
Additionally, the by/if/in tab will repeat the command for additional categorical variables (by) or allow you to only run the command on cases satisfying some condition (if) using a range of observations (in). The Weights tab is for cases where the data is weighted. Finally, the Options tab allows you to change the appearance of the table and save the list of summary statistics. Execute the command by clicking OK.
Once you have submitted the command, it will appear in the Review window. You can always click on it again to place it back in the Command window, change the options, and resubmit it. Using the menu even once can be a great way to get a feel for this command's syntax.
Reed College prohibits unlawful discrimination on the basis of race, color, national origin, religion, sex, sexual orientation, gender identity, gender expression, age, marital or familial status, military status, veteran status, genetic information, physical or mental disability, pregnancy, or any other category protected by federal, state, or local laws that apply to the college, in any area, activity or operation of the college, including in its employment policies, educational policies, admission policies, scholarship and loan programs, housing policies, athletic programs, and other school-administered programs.
Getting the descriptive statistics in Stata is quick for one or multiple variables. Descriptive statistics are measures we can use to learn more about the distribution of observations in variables for analysis, transforming variables, and reporting. Each descriptive statistic has their own formula that we will not be covering in this guide, but we will walk through the interpretation of each.
The by command can also be used in other commands, such as creating graphics. For example, if we wanted to examine histograms of mpg by the make of the car, we would use the by command as an option. The make of car does not have to be sorted for this command.
The by statement will give us descriptives for all levels of the by variable (i.e., both foreign and domestic). Suppose we just want the describes for one level of the by variable. We can use the if statement for that. For foreign cars (i.e., foreign == 1):
As a helpful hint for any of these processes, if your variables are labeled (showing the label instead of the numeric value) and you need to find the numeric values to examine levels of the variable, you can use the nolabel option.
In Stata 17, we introduced the new collect suite of commands for creating and customizing tables and the etable command for easily creating and exporting a table of estimation results. Stata 18 offers another new command, dtable, that easily builds and exports a table of descriptive statistics, often called Table 1 in publications. Now generating tables of descriptive statistics for both categorical and continuous variables is easier than ever. It is worth mentioning that the twin commands etable and dtable are both built on the collect framework we introduced in Stata 17, so they share a lot of properties.
By default, dtable reports sample size for the dataset, means and standard deviations for continuous variables, and frequencies and percentages for categorical variables. But we can request other descriptive statistics such as medians and interquartile ranges. We can even specify different statistics for different variables in the same table. Before we move to a more advanced example, I want to show you the dialog box of dtable.
It is a good idea to browse through the tabs in the dialog box to get familiar with this command. It is a great way to explore what we can do using dtable. I want to highlight three tabs and leave the others for you to explore.
For an example, we will load the Modified Bangkok IDU Preparatory Study data provided in Zeng, Mao, and Lin (2016). We may want to try specifying customized statistics and tests for different variables instead of generating the default table. Here I used the dialog box (mainly the three tabs I mentioned above) to easily build the table, and the corresponding syntax is displayed in the output below.
You may notice we have added a column of customized tests to compare the variables across the groups. The tests can only be included when there is a by variable specified. The specific tests we choose for different variables are mentioned clearly in the notes (before the table) because we have specified the by() suboption testnotes.
Looking at the above table, we may want to make improvements in its appearance. For example, we may want to show the subgroup sample sizes and proportions in the column header instead of in the first row. We may also want to increase or decrease the number of decimals reported for some statistics. We may want to change the display format for min and max values to "min-max" and put this into parentheses, and we may want to put proportions into parentheses as well. All of these changes can be done by options of dtable without additional coding. Here is the modified syntax of dtable and the output.
As shown by the above example, we can export the table to our document using the export() option if we like how it looks right now. Here is a list of all the supported file types to export our tables:
The table above looks nice. But I will demonstrate how to make some additional changes not directly available with dtable. Because dtable is implemented using collect, we can use the collect suite of commands to further manage tables that were created using dtable and to edit them in various ways. By the way, collect commands require a little effort at the beginning to become familiar with all the tools, but I believe you will master the skills and love to use this suite of commands to create any tables you need after a little bit of practice. If you would like to learn about collect, you can view our reference manual of Customizable Tables and Collected Results.
3a8082e126