Fulltext:
https://www.dropbox.com/s/lrqb9ajthk0o6tv/2015-dougherty.pdf
/
http://sci-hub.org/downloads/1b56/dougherty2015.pdf
> While we commend Au et al. on a rigorous meta-analysis, we contend that their analysis insufficiently address these issues. For example, while Au et al. relied on well-established null hypothesis significance testing (NHST) methods for meta-analysis, two well-known limitations of the NHST framework are that it tends to overstate evidence for the alternative hypothesis and does not permit one to evaluate the relative probability that the null hypothesis is in fact true. 1 In the context of the WM training literature, both of these problems are especially salient because the primary issue of debate is if working memory training is effective at all. This implies a need to evaluate the degree to which the data support the alternative hypothesis relative to the null, and is most easily addressed within a Bayesian approach.
Especially if you use an informative prior for how often intelligence
interventions fail to boost scores on IQ tests, and how often IQ test
score increases then fail to turn out to be on the latent g factor.
> Au et al. (2015) made an excellent attempt to reduce the potential influence of publication bias, with many studies included from nonpublished reports. The selection of studies to be included in the analysis appears to have been thorough and fair.
lol no.
> Au et al. (2015) presented effect sizes for 24 individual comparisons drawn from 20 papers. The aggregate weighted effect size across these 24 comparisons was 0.24. They also evaluated several possible mediators, including whether the studies used an active control (N = 12) or a passive control (N = 12), which yielded effect sizes of 0.06 and 0.44, respectively. Although Au et al. reported this effect as significant, they concluded that type of control group did not moderate the effect. This strikes us as an odd conclusion given that the magnitudes of these effect sizes differ considerably. Au et al.’s conclusion was based on a comparison between the control groups for active and passive studies, not by comparing the control groups to the treatment condition. The comparison of control groups while ignoring the training groups isn’t particularly informative regarding effect of training, since the effects of training can only be assessed relative to the control. In this regard, it is interesting to note that the effect size for the training condition amongst active-control studies (d = 0.25) is actually numerically smaller than the effect size amongst the control participants in the passive control studies (d = 0.28). The question is: Do these effect sizes provide evidence for training effectiveness?
Indeed. But in Au's defense, they claimed that an interaction with
non-American samples is driving the inflation of passive control
groups, not that the inflation exists (as it obviously does just
looking at the chart).
> The first step of our analysis involves transforming the effect sizes presented in Fig. 3 of Au et al. to their corresponding t values using t = sqrt(1/n1 + 1/n2) * g, where g is the measure of effect size and n1 and n2 are the sample sizes for two independent groups used in the effect-size calculations. We then computed the default Bayes factor (BF) corresponding to each t statistic using the ttestBF function in the BayesFactor package in R (Morey, Rouder, & Jamil, 2014; R Core Team, 2014) as well as the meta-analytic Bayes factor using the meta.ttestBF function. For all analyses, we set the scale factor on effect size to r = 1 and used a one-sided interval, which places the mass of the prior on effects greater than zero. The one-sided test is a reasonable assumption under the hypothesis that training should lead to improvements in Gf. Importantly, even large modifications to the prior distribution do not alter our conclusions in any substantive way, nor does using a two-sided null interval.
>
> ...As should be evident from Fig. 1 and Table 1, few of the individual studies provide particularly strong evidence for either the null or the alternative. Yet, looking across the entirety of the results, a curious pattern is obvious. First, 11 of the 12 effect sizes for the passive control studies are positive, whereas only 6 of the 12 effect sizes are positive for the active-control studies. Second, when these effect sizes are evaluated in terms of the Bayes factor, the majority of the individual studies favor the null hypothesis, including 6 of the 12 passive-control studies. These individual results using the BF roughly mirror the conclusions drawn from the significance tests, though the BF illustrates that the bulk of the studies show evidence for the null. However, these individual comparisons do not capitalize on a major strength of meta-analytic techniques, which is the ability to aggregate across studies to overcome the sample size problem. Moving on to the meta-analytic results, here the results diverge somewhat from the conclusions garnered from the individual studies. First, ignoring the type of control, the odds in favor of the alternative hypothesis is 152:1. This qualifies as 'decisive' evidence according to Jeffreys’ (1961) scheme. Figure 2, which provides the BFs conditioned on the use of passive- versus active-control groups, paints a much different picture. While the Bayes factor for the passive control studies is a whopping 13,241:1 in favor of the alternative, the Bayes factor for the active control studies is a more modest 7.7:1 in favor of the null.
A hypothesis-testing framework is ugly though, and makes it harder to
examine Au et al's international claim, so:
> Thus, we conducted a series of follow-up analyses using hierarchical Bayesian modeling, in which we modeled the effect sizes as a function of control group type (passive vs. active) as well as an additive effect of both control group type and country of origin (USA vs. non- USA). Au et al. (2015) identified country of origin as an im- portant moderator variable, with studies conducted within the USA yielding a small nonsignificant effect size and studies conducted outside the USA resulting in a moderate significant effect size – an effect that Au et al. hypothesized could be due to differences in motivation or compliance between USA and non-USA subjects. The inclusion of country of origin in our analysis allowed us to control for a potential important source of variability that Au et al. (2015) felt was theoretically justified. As we illustrate, inclusion of this variable in the Bayesian model reveals that the only estimated effect sizes that are different from zero are those based on non-USA passive-control studies. Furthermore, the estimated effect size for the active-control studies within the USA shrink to essentially zero...Strikingly, even when the prior distribution is set such that the effect of training is assumed to be large, there is still no evidence of that n-back training leads to improvements on Gf measures. While this model estimates that the median effect size amongst the active control studies is slightly above zero, this small positive effect is essentially eliminated when country of origin is added as a predictor in the model, as shown in Figs. 5 and 6. Importantly, the three international studies using active controls fail to yield a reliable positive effect. Furthermore, at the aggregate level the only effect size in which the HDI does not include zero are effects based on studies conducted outside the USA that use passive control designs.
>
> ...In fact, the mere size of the BF for the passive control studies should be enough to warrant a critical eye to those studies, especially given the a priori uncertainty surrounding the question of whether WM training can improve Gf. This leaves us with the 12 active control studies, for which (a) the Bayes factors for the individual studies overwhelmingly favor the null, (b) the meta-analytic BF favors the null, (c) the estimated effect sizes are not different from zero, and (d) half of the studies show raw effect sizes indicating a negative effect of transfer.
>
> ...The hierarchical Bayesian models suggest a two-factor model for explaining training effects: One factor is the type of experimental design used by the researcher (active vs. passive control) and the other is country of origin of the study (USA vs. non-USA). We submit that the discrepancy between the active and passive controls is consistent with a placebo effect, and we suspect that the effect of country of origin reflects idiosyncratic differences in experimental methods between the USA and non-USA studies. Setting aside specific causal mechanisms for the observed pattern of effect sizes, it is clear that the data reflect two separate data-generating processes, neither of which can be attributed to n-back training...It should be noted, however, that choice of experimental design also covaried with whether the study was conducted within the USA or outside the USA. Most of the studies using active controls were conducted within the USA, whereas the majority of the studies conducted outside of the USA used passive controls. While this leaves open the possibility that cultural differences are driving the difference between the active and passive studies, we doubt cultural differences would account for the 7-fold increase in the training effect, especially since the non-USA studies were primarily conducted in Westernized cultures (e.g., Europe).
Much as expected.
> The authors thank Jacky Au and Susan Jaeggi for sharing their data and for providing details of their analysis.
Almost a year later, Au still hasn't sent me the data, incidentally.
--
gwern
http://www.gwern.net