While survey and social science researchers have become well versed in traditional modeling approaches such as multiple regression or logistic regression, there are more contemporary nonparametric techniques that are more flexible in terms of model form and distributional assumptions. Classification and regression trees (CARTs) and random forests represent two of the methods that are being applied more commonly within the survey research context for creating nonresponse adjustments and for creating propensity scores to be used within the responsive/adaptive survey context. Both of these methods can be used for regression or classification related tasks and offer researchers and practitioners excellent alternatives to the more classical approaches. CARTs and random forests can be applied when typical statistical distributional assumptions are not likely satisfied and can incorporate interactions automatically. CART models can be estimated in the presence of missing data and random forest methods can adapt to the complexity of the dataset and can be estimated when the number of predictors is large relative to the sample size. This article provides an accessible description for both of these methods and illustrates their use by developing models that predict survey response from a collection of demographic variables known for both respondents and nonrespondents.
If you have ever used the popular chi-square automatic interaction detection (CHAID) (Kass 1980) method for predicting survey response or other market segmentation, you have been building tree-based models. Classification and regression trees (CARTs) (L. et al. 1984) represent another type of tree-based method for classification or prediction. Like CHAID, CART models can be applied to both categorical outcomes as well as continuous outcomes, but CART models extend the capabilities of CHAID models by allowing both categorical and continuous predictors.
The application of CARTs to various aspects of the survey process has grown steadily in the past decade. For example, McCarthy and Earp (2009) used classification trees to investigate factors related to survey reporting errors. Garber (2009) used classification trees to predict eligibility of units included in a master mailing list for a survey targeting farms. Burgette and Reiter (2010) use regression trees as part of a multiple imputation strategy for continuous health-related survey outcomes such as birth weight. Phipps and Toth (2012) applied regression trees to data from the Occupational Employment Statistics Survey to estimate response propensities for sampled establishments. They also used a second regression tree to examine the potential of nonresponse bias in reported wages.
Developed by Breiman (2001), random forests are ensemble-based methods that generate estimates by combining the results from a collection (i.e., the ensemble) of classification or regression trees. More specifically, if the outcome of interest is continuous, then a random forests model produces an estimate of the outcome by averaging the estimates derived from a series of regression trees. On the other hand, if the outcome is binary, a random forest generates an estimate defined as the level that is predicted most often among a collection of classification trees. By combining results across an ensemble of trees, random forests avoid the overfitting tendency of any single tree and generate predictions with lower variance compared to those obtained from a single tree (Breiman 2001; James et al. 2013). Each tree in the forest is grown using an independent bootstrap subsample that is the same size as the original dataset and selected with replacement from it. While not as commonly used for this purpose, response propensities can be estimated from random forests as the fraction of trees in the forest that predict a returned survey for a given address (see, for example, Buskirk and Kolenikov 2015). We note that the more common approach with binary outcomes is for the random forests to generate an estimated class for each sampled case (e.g., respondent or not).
The overall prediction error of the random forest is generally a nonincreasing, bounded function of the number of trees, meaning that after a certain number of trees, the additional reduction in error from adding additional trees to the forest becomes negligible (Breiman 2001). However, it is also completely possible for a smaller forest to produce similar accuracy rates as a larger forest (Goldstein, Polley, and Briggs 2011). While the value of mtry can impact the overall prediction accuracy of the forest, studies have indicated that the overall results tend to be fairly robust with similar performance being achieved across a fairly wide range of values (Pal 2005). For continuous outcomes, it has been shown in practice that prediction error rates can be reduced by using larger values of the node size parameter beyond the default (Segal 2004). More details about random forests construction are provided in Figure 2. Popular packages for implementing random forest models in R are highlighted in Table 3.
As mentioned previously, the fact that random forests create estimates by aggregating over a series of trees generally implies less overfitting than a single tree model. Moreover, since random forests are grown based on bootstrap subsamples taken with replacement, they produce an internally valid and nearly unbiased estimate of performance. However, unlike tree models that are easy to visualize, random forests are not easily visualized. However, they can produce a ranking of variable importance for each possible predictor that can easily be displayed graphically. Other major advantages and disadvantages of random forests are provided in Table 4.
The use of random forest models in survey research has not been as common compared to tree-based models, but their use has steadily been increasing within the past 5 years. For example, Caiola and Reiter (2010) illustrated how random forests could be used to generate partially synthetic categorical data using data from the 2000 U.S. Current Population Survey. Buskirk, West, and Burks (2013) investigated the use of random forests for estimating response propensities, which were then applied to sampled units on subsequent cross-sectional surveys at later time points to estimate the propensity to respond. Earp et al. (2014) investigated the use of a random forest-like ensemble of trees for evaluating nonresponse bias for establishment surveys. Buskirk and Kolenikov (2015) compared logistic regression and random forest models for nonresponse adjustments to sampling weights based on propensity scores.
Generally speaking, the random forest model outperformed both the tree and logistic regression models on a majority of the metrics, but both the forest and tree models outperformed the logistic regression model on all metrics. In particular, both the random forest and tree models were more specific than the logistic regression model (i.e., higher correct detection of non-respondents) and had between 7 and 10 percentage points higher sensitivity values (i.e., higher correct detection of respondents). The same spread for the area under the curve was also realized for the forest and tree models compared to the logistic regression model. Since the binary outcome was simulated through a series of probit models involving nonlinear and interaction terms, we would expect lower performance from the main effects logistic regression. In addition, it is important to note that the nonparametric nature of the forest and tree models were able to approximate these more nonlinear and complex probit models and create predictions that had a relatively high level of accuracy and performance without having to specify the shape/structure of the underlying survey outcome model.
The first algorithm for random decision forests was created in 1995 by Tin Kam Ho[1] using the random subspace method,[2] which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.[4][5][6]
An extension of the algorithm was developed by Leo Breiman[7] and Adele Cutler,[8] who registered[9] "Random Forests" as a trademark in 2006 (as of 2019[update], owned by Minitab, Inc.).[10] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho[1] and later independently by Amit and Geman[11] in order to construct a collection of decision trees with controlled variance.
The general method of random decision forests was first proposed by Salzberg and Heath in 1993,[12] with a method that used a randomized decision tree algorithm to generate multiple different trees and then combine them using majority voting. This idea was developed further by Ho in 1995.[1] Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines[2] concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. Note that this observation of a more complex classifier (a larger forest) getting more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination.[4][5][6]
The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman[11] who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho[2] was also influential in the design of random forests. In this method a forest of trees is grown, and variation among the trees is introduced by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich.[13]
7fc3f7cf58