Stata Bad Serial Number

1 view

Skip to first unread message

Verline Wesolowski

unread,

Aug 5, 2024, 1:43:44 PM8/5/24

to letanreman

Inthis example we sort the observations by all of the variables. Then we use all of the variable in the by statement and set set n equal to the total number of observations that are identical. Finally, we list the observations for which _N is greater than 1, thereby identifying the duplicate observations.

This simple, elegant, and obvious solution to shuffling data will play an important part of the solution to drawing observations without replacement. I have already more than hinted at the solution when I showed you your hand and mine.

Drawing without replacement is exactly the same problem as dealing cards. The solution to the physical card problem is to shuffle the cards and then draw the top cards. The solution to randomly selecting n from N observations is to put the N observations in random order and keep the first n of them.

You might wonder if we would ever need three random numbers. It is very unlikely. p, the probability of no problem, equals 1 to at least 5 digits for N=500,000. Of course, the chances of duplication are always nonzero. If you are concerned about this problem, you could add an assert to the code to verify that the two random numbers together do uniquely identify the observations:

In the generation of random numbers in all of the above, note that I am storing them as doubles. For the reproducibility issue, that is important. As I mentioned in part 1, the 32-bit random numbers that runiform() produces will be rounded if forced into 23-bit floats.

We have discussed drawing without replacement n observations from N observations. The number of observations selected has been fixed. Say instead we wanted to draw a 10% random sample, meaning that we independently allow each observation to have a 10% chance of appearing in our sample. In that case, the final number of observations is expected to be 0.1*N, but it may (and probably will) vary from that. The basic solution for drawing a 10% random sample is

In Stata, the .format command insertscommas into a large numeric value to make it easier to read. Forexample, suppose a numeric variable gnp has a value of12345678.9012, which is 13 characters long (including thedecimal point). To insert commas into the numeric value, use thiscommand:

This displays the variable as 12,345,678.9012. Thef in the format indicates a real number, whilec inserts commas after every three digits, starting fromthe decimal point and going left. You can also use g toindicate a general, and s to indicate a string. Thevariable xx in the format %xx.yy should beequal to or longer than the maximum character length of a value,including commas and a period; otherwise, the command will not work.

The procedure is to first store a number of models and then applyesttab to these stored estimation sets to compose a regressiontable. The main difference between esttab and estout isthat esttab produces a fully formatted right away. Example:

The default in esttab is to display raw point estimates along with t statistics andto print the number of observations in the table footer. To replace thet-statistics by, e.g., standard errors and add the adjusted R-squared type:

esttab has sensible default settings for numerical display formats.For example, t-statistics are printed using two decimal places and R-squaredmeasures are printed using three decimal places. For point estimates and,for example, standard errors an adaptive display format is used where thenumber of displayed decimal places depends on the scale of the statistic tobe printed (the default format is a3; see below).

The format applied to a certain statistic can be changed by adding theappropriate display format specification in parentheses. For example, toincrease precision for the point estimates and display p-values andthe R-squared using four decimal places, type:

Depending on whether the plain option is specified ornot, esttab uses two different variants of the CSV format. Bydefault, that is, if plain is omitted, the contents of the tablecells are enclosed in double quotes preceded by an equal sign (i.e.="..."). This prevents Excel from trying to interpret thecontents of the cells and, therefore, preserves formatting elements such asparentheses around t-statistics. One drawback of this approach is, however,that the displayed numbers cannot directly be used for further calculationsin Excel. Hence, if the purpose of exporting the estimates is to doadditional computations in Excel, specify the plain option. In thiscase, the table cells are enclosed in double quotes without the equal sign,and Excel will interpret the contents as numbers. Example:

If you know a bit RTF you can also include RTF commandsto achieve specific effects, although you have to be careful not to break the document(most importantly, do not introduce unmatched curly braces).Useful are, for example, "\b ..." for boldface and "\i ..." for italics.A very helpful reference is the "RTF Pocket Guide" by Sean M. Burke (O'Reilly). Example

There are times we need to do some repetitive tasks in the process of data preparation, analysis, or presentation. For instance, we may need to compute a set of variables in the same manner, rename or create a series of variables, or repetitively recode values of a number of variables. In this post, we show a few simple example "loops" using the Stata commands foreach, local and forvalues to handle some common repetitive tasks.

Consider this sample dataset of monthly average temperature for three years. Below we enter the data by hand using the input command. First enter input year mtemp1-mtemp12 in the Command window. Next copy-and-paste each row of temperatures for each year into the Command window (one row at a time; do not include the leading line number) and hit Enter. When finished, enter end and click Enter.

However this takes a lot of typing. Alternatively, we can use the foreach command to achieve the same goal. In the following codes, we tell Stata to do the same thing (the computation: c*9/5+32) for each of the variables in the varlist: mtemp1 to mtemp12.

Note that braces must be specified with foreach. The opening brace has to be on the same line as the foreach, and the closing brace must be on a line by itself. It's crucial to close loops properly, especially if you have one or more loops nested in another loop.

The previous example was a rather simple repetitive task which can be handled solely by the foreach command. Here we introduce another command local, which is utilized a lot with commands like foreach to deal with repetitive tasks that are more complex. The local command is a way of defining macro in Stata. A Stata macro can contain multiple elements; it has a name and contents. Consider the following two examples:

Take the temperature dataset we created as an example. Let's say we want to rename variables mtemp1-mtemp12 as mtempjan-mtenpdec. We can do so by just tweaking a bit of the codes in the previous example.

We recommend running display to see how the macro looks before actually applying the defined macro on tasks like changing variable names, just to make sure you don't accidentally change them to some undesired results or cause errors. However the display line is not necessary in this case.

The forvalues command is also useful for repetitive tasks. Consider the same temperature dataset we created. Suppose we would like to generate twelve dummy variables (warm1-warm12) to indicate whether the monthly average temperature is higher than the one in the previous year. For example, we code warm1 for the year of 2014 as 1 if the value of fmtemp1 for 2014 is higher than the value for 2013. We code all the warm variables as 99 for 2013, since they don't have a references to compare.

Sometimes for whatever reason, string variables need to be categorical and categorical variables need to be strings. In Stata this is often true because Stata treats string-encoded variables as missing and will not use them in analyses. However, anticipating that this may be problematic, Stata offers various commands to change string variables into categorical variables and vice versa.

The first case most often occurs when importing data from another source. Sometimes, for whatever reason, Stata incorrectly calls a categorical variable a string variable. The easiest way to tell if this is the case is to look at the Variables window. If a variable is a string, the Type will be str followed by some number. If, for example, you had a gender variable consisting of ones and zeroes that encoded as str1 and was therefore all numbers, you could use the destring command. If you want to replace the existing variable, the command is simply destring [varname] This will replace the existing specified variable with the same data but now in a nonstring format. If you prefer to retain the existing variable, you can generate a new variable that is a nonstring version of the existing variable. To do this type generate [new variable name]=real([string]) In my example, this would look like generate sex2=real(sex) This command would create a new variable called sex2 that contained the numeric data from my original variable (sex) stored in a numeric format.

Both of these commands have a reverse: in the first case destring will revert the format to a string, and generate name=string([numeric variable]) will generate a new string variable with the same data as the numeric variable specified, but not saved in a numeric format.

The above will only work if all of the data is numeric. However, sometimes it's not. In a case where your string variables are in fact strings (e.g., "female" instead of "1") you have to tell Stata to encode [varname] the string data. Running this command will cause Stata to make a new numeric categorical variable wherein the data has labels that correspond to the old string values. If you do this, be aware that Stata is cap sensitive; female, Female and FEmale will be treated as three different types of data. Encode is a slightly more complicated command, requiring a subcommand, generate([newvariablename]) Continuing the gender example, the full command would look something like this encode gender, generate(sex) This would cause Stata to generate a new variable called "sex" that contains numeric categories based off the old variable (called" gender"). However if you browse the new variable it will look the same, because Stata displays the labels (not the raw numbers). The only visual clue that something is different is that the text will now be blue instead of black. The opposite of encode is decode The decode command has the same syntax as the encode command, but generates a string variable based on the labels of a numeric categorical variable.