In this example we sort the observations by all of the variables. Then we use all of the variable in the by statement and set set n equal to the total number of observations that are identical. Finally, we list the observations for which _N is greater than 1, thereby identifying the duplicate observations.
In a survey dataset I have a string variable (type: str244) with qualitative responses. I want to count the number of characters in each response/string and generate a new variable containing this number.
There is no egen function because there has long [sic] been a function strict sense to do this. In recent versions of Stata, the function is called strlen() but the older name length() continues to work:
where the are included simply for readability. My variable names look like those above, but I don't know the naming pattern ahead of time. I can get the length of each vector as ng = (c(k) - 1)/3, so if I could refer to variables by column, it would be straightforward to write a loop to find Euclidean distances, e.g., from v to nv.
I know there are other ways to go about calculating the Euclidean distance (like reshaping the data; or extracting the varnames from 2/(ng+1) and doing a foreach loop on them), but if there is a way to reference variables by column number, I would like to know it.
In Stata, the .format command insertscommas into a large numeric value to make it easier to read. Forexample, suppose a numeric variable gnp has a value of12345678.9012, which is 13 characters long (including thedecimal point). To insert commas into the numeric value, use thiscommand:
This displays the variable as 12,345,678.9012. Thef in the format indicates a real number, whilec inserts commas after every three digits, starting fromthe decimal point and going left. You can also use g toindicate a general, and s to indicate a string. Thevariable xx in the format %xx.yy should beequal to or longer than the maximum character length of a value,including commas and a period; otherwise, the command will not work.
The procedure is to first store a number of models and then applyesttab to these stored estimation sets to compose a regressiontable. The main difference between esttab and estout isthat esttab produces a fully formatted right away. Example:
The default in esttab is to display raw point estimates along with t statistics andto print the number of observations in the table footer. To replace thet-statistics by, e.g., standard errors and add the adjusted R-squared type:
esttab has sensible default settings for numerical display formats.For example, t-statistics are printed using two decimal places and R-squaredmeasures are printed using three decimal places. For point estimates and,for example, standard errors an adaptive display format is used where thenumber of displayed decimal places depends on the scale of the statistic tobe printed (the default format is a3; see below).
The format applied to a certain statistic can be changed by adding theappropriate display format specification in parentheses. For example, toincrease precision for the point estimates and display p-values andthe R-squared using four decimal places, type:
Depending on whether the plain option is specified ornot, esttab uses two different variants of the CSV format. Bydefault, that is, if plain is omitted, the contents of the tablecells are enclosed in double quotes preceded by an equal sign (i.e.="..."). This prevents Excel from trying to interpret thecontents of the cells and, therefore, preserves formatting elements such asparentheses around t-statistics. One drawback of this approach is, however,that the displayed numbers cannot directly be used for further calculationsin Excel. Hence, if the purpose of exporting the estimates is to doadditional computations in Excel, specify the plain option. In thiscase, the table cells are enclosed in double quotes without the equal sign,and Excel will interpret the contents as numbers. Example:
If you know a bit RTF you can also include RTF commandsto achieve specific effects, although you have to be careful not to break the document(most importantly, do not introduce unmatched curly braces).Useful are, for example, "\b ..." for boldface and "\i ..." for italics.A very helpful reference is the "RTF Pocket Guide" by Sean M. Burke (O'Reilly). Example
There are times we need to do some repetitive tasks in the process of data preparation, analysis, or presentation. For instance, we may need to compute a set of variables in the same manner, rename or create a series of variables, or repetitively recode values of a number of variables. In this post, we show a few simple example "loops" using the Stata commands foreach, local and forvalues to handle some common repetitive tasks.
Consider this sample dataset of monthly average temperature for three years. Below we enter the data by hand using the input command. First enter input year mtemp1-mtemp12 in the Command window. Next copy-and-paste each row of temperatures for each year into the Command window (one row at a time; do not include the leading line number) and hit Enter. When finished, enter end and click Enter.
However this takes a lot of typing. Alternatively, we can use the foreach command to achieve the same goal. In the following codes, we tell Stata to do the same thing (the computation: c*9/5+32) for each of the variables in the varlist: mtemp1 to mtemp12.
Note that braces must be specified with foreach. The opening brace has to be on the same line as the foreach, and the closing brace must be on a line by itself. It's crucial to close loops properly, especially if you have one or more loops nested in another loop.
The previous example was a rather simple repetitive task which can be handled solely by the foreach command. Here we introduce another command local, which is utilized a lot with commands like foreach to deal with repetitive tasks that are more complex. The local command is a way of defining macro in Stata. A Stata macro can contain multiple elements; it has a name and contents. Consider the following two examples:
Take the temperature dataset we created as an example. Let's say we want to rename variables mtemp1-mtemp12 as mtempjan-mtenpdec. We can do so by just tweaking a bit of the codes in the previous example.
We recommend running display to see how the macro looks before actually applying the defined macro on tasks like changing variable names, just to make sure you don't accidentally change them to some undesired results or cause errors. However the display line is not necessary in this case.
The forvalues command is also useful for repetitive tasks. Consider the same temperature dataset we created. Suppose we would like to generate twelve dummy variables (warm1-warm12) to indicate whether the monthly average temperature is higher than the one in the previous year. For example, we code warm1 for the year of 2014 as 1 if the value of fmtemp1 for 2014 is higher than the value for 2013. We code all the warm variables as 99 for 2013, since they don't have a references to compare.
I recently had a project where I had to manually adjust over 100 variable names in a .csv file because of the way data for loop and merge are stored. After this manipulation I read the file into Stata and reshaped from wide to long so it could be used in Vocalize, but it seems important even for analyses in any program to do some sort of reshaping (or stacking). It seems to me that a simple fix would be to allow users to pick how data are stored when using loop and merge (e.g., whether they want the leading number to instead appear at the end). This would given everyone a much more efficient experience. I'll illustrate an example of the problem in greater detail below. Does anyone have work arounds or ideas? I also posted in the product idea section.
For the purposes of an example, let's say we have a multiselect item where students identify the schools they have attended from a list of four.
With the second item set up like this:
The way these data are stored is 1_[var], 2_[var], etc. at the front of the variable based on the position of the identifier in the loop. See below:
However, Stata, SPSS, and Python (and probably most, if not all, other data manipulation tools) will not read in variables with leading numbers. Moreover, for Stata at least, reshaping from wide to long works much better if the identifier appears at the end of the variable, like [var]1 for School 1 instead of 1_[var].
It's the same in the data file export:
Thus, as it stands, one has to manually adjust the variable names for each school-by_variable combination -- removing the "1_" and adding a "1" at the end and so on. So if you had, say, a survey of 50 items and 150 schools, that's 7,500 manual manipulations before one can even read these data into Stata.
I called technical support and they asked around and no one had a good solution. I think this would save everyone a lot of time if it were possible:
Solution: allow for users to choose where how they want the loop identifier denoted in the survey data.
If you download the data in SPSS format it puts an "A" in front of the numbers, so it isn't an issue. I've never used Stata, but I believe it has the ability to import SPSS data.
For Python, if you are using an API call and returning the data in json format it isn't an issue...use json.load.
I hear you, but that's not quite the point. Should we all have to use SPSS? Also, as I mentioned, I want more flexibility than just a leading letter, I want to be able to make it appear after the variable.
It seems that I would need Stata 16 (I have 14) to be able to import using SPSS, but then I would need to make some adjustments (I believe) in Stata, but at least I would be able to manipulate a little better.
I appreciate you bringing up a possible workaround! I just wanted to note that it might not work for me in this case and maybe not for others either.
I adjusted my heading title to leave out SPSS. I had a colleague tell me that he was struggling with SPSS and this too.