What does it take to show intelligence gains? Hayes et al 2014

231 views
Skip to first unread message

Gwern Branwen

unread,
Oct 23, 2014, 3:06:57 PM10/23/14
to N-back
"Do We Really Become Smarter When Our Fluid-Intelligence Test Scores
Improve?", Hayes et al 2014
http://memory.osu.edu/_static/pubs/Hayes.etal.Intelligence.pdf

> Recent reports of training-induced gains on fluid intelligence tests have fueled an explosion of interest in cognitive training-now a billion-dollar industry. The interpretation of these results is questionable because score *gains* can be dominated by factors that play marginal roles in the scores themselves, and because intelligence gain is not the only possible explanation for the observed control-adjusted far transfer across tasks. Here we present novel evidence that the test score gains used to measure the efficacy of cognitive training may reflect strategy refinement instead of intelligence gains. A novel scanpath analysis of eye movement data from 35 participants solving Raven's Advanced Progressive Matrices on two separate sessions indicated that one-third of the variance of score gains could be attributed to test-taking strategy alone, as revealed by characteristic changes in eye-fixation patterns. When the strategic contaminant was partialled out, the residual score gains were no longer significant. These results are compatible with established theories of skill acquisition suggesting that procedural knowledge tacitly acquired during training can later be utilized at posttest. Our novel method and result both underline a reason to be wary of purported intelligence gains, but also provide a way forward for testing for them in the future.
>
> Can intelligence be improved with training? For the most part, the numerous training methods attempted through the years have yielded disappointing results for healthy adults (e.g., Detterman & Sternberg, 1982). Nonetheless, if an effective training method could be designed, it would have immense practical implications. Therefore, when Jaeggi, Buschkuehl, Jonides, and Perrig (2008) recently published some encouraging experimental results, they were greeted with remarkable enthusiasm. Cognitive enhancement is now a billion-dollar industry ("Brain sells," 2013). Millions of customers buy "brain building" games and subscribe to "mental gyms" on-line where they perform various "cognitive workouts" in the hope of raising their IQ (Hurley, 2012). Hundreds of millions of dollars are being invested in educational (e.g., Cogmed, http://www.cogmed.com), military, and commercial programs (e.g., Lumosity, http://www.lumosity.com) on the assumption that intelligence can be improved through training. But can it really? Given the massive societal resources that are at stake and the checkered track record of similar initiatives in the past (e.g., Detterman & Sternberg, 1982; Melby-Lervåg & Hulme, 2013; Owen et al., 2010), this claim must be evaluated very carefully.
>
> - Detterman, D. K., & Sternberg, R. J. (Eds.). (1982). _How and how much can intelligence be increased?_ Mahwah, NJ: Erlbaum.
> - "Is Working Memory Training Effective? A Meta-Analytic Review" http://www.apa.org/pubs/journals/releases/dev-49-2-270.pdf
> - Owen, A. M., Hampshire, A., Grahn, J. A., Stenton, R., Dajani, S., Burns, A. S., . . . Ballard, C. G. (2010). "Putting brain training to the test" http://uwo.ca/bmi/owenlab/pdf/2010-Owen-Nature-Putting%20brain%20training%20to%20the%20test.pdf . _Nature_, 465(7299), 775-778.
>
> ...One goal of this article is to point out how methodologically challenging it is to measure the *change* of a latent variable.
>
> ...This hypothesis is simple and elegant but the methodology for testing it empirically is fraught with difficulties because an objective method for measuring *Gf gains* is required. The commonly used test-retest method is seriously flawed. The overwhelming majority of studies use test-retest score gains to measure Gf gains. This practice is based on the misleading intuition that if a test such as Raven's APM is a valid measure of Gf, then a *gain* in the score on this test is a valid measure of *Gf gain*. This is not necessarily true because, in addition to Gf, the scores reflect non-Gf factors such as visuospatial ability, motivation, and test-taking strategy. The latter factors-and hence the test scores-can improve while Gf itself remains stable. Indeed, Raven's APM scores increase significantly on repeated testing without any targeted training (e.g., Bors & Vigneau, 2003; Bors & Forrin, 1995; Denney & Heidrich, 1990). Worse, a large meta-analysis of 64 test-retest studies (te Nijenhuis, van Vianen, & van der Flier, 2007) indicates a strong *negative* correlation between score gains and the G loadings of test items. To control for such "mere retest" effects, the common practice in the field is to compare the score gains in the treatment group to those in an untreated control group. Cognitive enhancement advocates (e.g., Jaeggi et al., 2008) acknowledge the interpretive problems of unadjusted score gains but assume that control-adjusted gains necessarily measure real gains in Gf. As we argue below, however, this assumption is incorrect because the adjustment does not guarantee validity either.
>
> - Bors, D. A., & Vigneau, F. (2003). "The effect of practice on Raven's advanced progressive matrices" https://pdf.yt/d/NtcZNg9J9uo24uOH / https://dl.dropboxusercontent.com/u/5317066/DNB/2003-bors.pdf / http://libgen.org/scimag/get.php?doi=10.1016%2Fs1041-6080%2803%2900015-3 . _Learning and Individual Differences_, 13(4), 291-312.
> - Bors, D. A., & Forrin, B. (1995). "Age, speed of information processing, recall, and fluid intelligence" http://jtoomim.org/brain-training/age,%20speed%20of%20information%20processing,recall%20and%20fluid%20intelligence.pdf . _Intelligence_, 20(3), 229-248.
> - Denney, N. W., & Heidrich, S. M. (1990). "Training effects on Raven's progressive matrices in young, middle-aged, and elderly adults" https://pdf.yt/d/1JJEk_PooLNixWBK / https://dl.dropboxusercontent.com/u/5317066/DNB/1990-denney.pdf . _Psychology and Aging_, 5(1), 144-145.
> - te Nijenhuis, J., van Vianen, A. E. M., & van der Flier, H. (2007). "Score gains on g-loaded tests: No g" http://emilkirkegaard.dk/en/wp-content/uploads/Score-gains-on-g-loaded-tests-No-g.pdf . _Intelligence_, 35, 283-300.
>
> These methodological difficulties can be illustrated by analogy with athletics. In a classic study of motor skill learning (Hatze, 1976), an athlete practiced kicking a target as rapidly as possible. His performance improved at first and then plateaued. However, after seeing a film about kicking technique, the athlete immediately improved his time considerably and with additional practice was able to reach a much higher asymptote. For our purposes, this illustrates the relationships between the following three variables. The first is kicking time, which was the only objective measurement. The second variable is general athletic ability, which includes factors such as cardiovascular capacity, agility, muscle strength, and so forth. The third is kicking technique- the optimal way to execute a kick so as to minimize kicking time, all else being equal. Importantly, because the kicking time reflects a mixture of athletic ability and technique, gains in kicking time can occur without any change in athletic ability. Indeed, watching a movie could not have changed the strength or agility of the participant in Hatze's (1976) experiment. Analogously, gains in test scores can occur without any change in "brainpower" factors such as WM capacity or processing speed.
>
> - Hatze, H. (1976). "Biomedical aspects of a successful motion optimization". In P. V. Komi (Ed.), _Biomechanics V-B_ (pp. 7-17). Baltimore, MD: University Park Press.
>
> This brings us to the central topic of transfer across tasks. The most widely used inference pattern in the cognitive enhancement literature is to infer gains in Gf on the basis of control-adjusted gains in test scores. This inference pattern logically requires the auxiliary assumption that *only* Gf can transfer across tasks. Few cognitive-enhancement advocates would endorse such a strong claim, and the more cautious authors explicitly disavow it, often near the end of their Discussion sections (e.g., Morrison & Chein, 2011, p. 58). But without this assumption, there is no logically necessary link from the observed control-adjusted score gains to the theoretical conclusion of Gf gains. Why not? Because *non-Gf-related factors can transfer across tasks too.*
>
> The athletic analogy can easily be extended to illustrate this. Suppose that instead of watching a movie, the athlete in Hatze's (1976) experiment practiced a seemingly unrelated task such as high jump. The problem is that tasks that seem unrelated on the surface can still share critical technical components. For example, the approach of the high jump may actually be as important as the take off. It requires the right amount of speed and the correct number of strides-factors that affect kicking too. So, if an athlete practices high jump for many hours and then can kick a ball faster than before, is this because the jumping practice improved the explosive power of their leg muscles? Or is it because it provided an opportunity to learn to control the approach better? In other words, was there transfer of athletic ability, of technical components, or both? These possibilities cannot be differentiated on the basis of measured gains in kicking speed alone.
> Analogously, a control-adjusted gain on an intelligence test may stem from genuine Gf transfer from the training task, from transfer of some non-Gf -related component(s), or from a combination thereof.
>
> There are two complementary ways to marshal more data to test whether WM training improves Gf. The first is to assess Gf not with a single test but with a broad battery of multiple tests. The second approach is to use tools from cognitive psychology to open the black box and investigate the actual processes that determine the test scores and the gains thereof. In this article we follow the second approach. The topic of multiple tests is introduced only briefly here and will be discussed in more detail later. This literature is in active development and the results are still tentative. Two emerging patterns are particularly relevant to the present analysis. First, when a battery of multiple Gf tests was administered before and after WM training, strong inter-test correlations were found as expected, and yet only some tests showed a significant control-adjusted transfer effect (Colom et al., 2013; Harrison et al., 2013; Jaeggi, Buschkuehl, Shah, & Jonides, 2014; Stephenson & Halpern, 2013). This selectivity of transfer highlights that test *scores* and *gains* can index distinct aspects of the variability across individuals. The high inter-test correlation presumably reflects the shared Gf loading of *scores*, whereas the dissociable *gains* suggest plasticity in one or more non-Gf -related factors.
>
> Recently we (Hayes et al., 2011) demonstrated that approximately 40% of the variance of Raven's APM scores across participants can be predicted on the basis of individual differences in eye-fixation patterns. Critical for this success was a novel data-processing algorithm called Successor Representation Scanpath Analysis (SRSA, Hayes et al., 2011) that captures the statistical regularities of scanpath sequences of arbitrary lengths
> ...Importantly, the SRs are interpretable: Different test-taking strategies give rise to characteristic SR patterns that can be traced in the human data (Figure 2). SRSA thus provides unprecedented insight into the role of strategic processing in matrix reasoning tests. Our goal in this article is to apply this powerful new tool to investigate whether strategy refinement can account for the test-retest improvement of Raven scores.
> The answer is a clear yes. We observed a highly significant practice effect, replicating published results (Bors & Vigneau, 2003; Denney & Heidrich, 1990). Approximately 30% of the variance of score gains across participants could be predicted on the basis of individual differences in the changes in eye-fixation patterns as captured by SRSA...Moreover, when the strategy-related variance was partialled out, the residual score gains were no longer significant, even in the high-improvement subgroup. This indicates that strategy refinement is a powerful determinant of score gains - it controls a major portion of the variance and can change the substantive conclusion of an experiment. Consequently, it must be considered carefully when interpreting score gains on Raven's APM and similar matrix-based tests.
>
> - Hayes, T. R., Petrov, A. A., & Sederberg, P. B. (2011). "A novel method for analyzing sequential eye movements reveals strategic influence on Raven's Advanced Progressive Matrices" http://www.journalofvision.org/content/11/10/10.long / http://t.alexpetrov.com/pub/jov11a/HayesPetrovSederberg11-JoV.pdf . _Journal of Vision_, 11(10), 1-11.
>
> ...It is important to dispel a tempting interpretive mistake that arises at this point. For concreteness, let us assume that 60% of the variance in Raven's scores are attributable to Gf, whereas less than 10% are attributable to visuospatial ability. One might argue on the basis of these figures that the the main Gf component dwarfs the visuospatial "contamination." This is the rationale for the widespread acceptance of Raven's APM as a unidimensional measure of Gf (Raven et al., 1998). However, these figures apply to Raven's scores across individuals, whereas the dependent measure in WM training studies is the difference between two scores for the same individual. If Gf is a stable latent variable, it will contribute equally to the pre- and posttest scores and this contribution, no matter how large, will cancel out in the subtraction. Therefore, *the variance of the score gains can have a radically different composition than the variance of the scores themselves.*
> ...This illustrates a general limitation of score gains-they can lead to fallacious conclusions and hence must be interpreted with great caution. Some prominent methodologists have even advised against their use altogether: "Gain scores are rarely useful, no matter how they may be adjusted or refined. . . . Investigators who ask questions regarding gain scores would ordinarily be better advised to frame their questions in other ways" (Cronbach & Furby, 1970, p. 80).
> ...This methodological imperative is gradually being acknowledged in the field and there is a growing number of studies that administer multiple tests (Colom et al., 2013; Harrison et al., 2013; Jaeggi et al., 2014, 2011; Schmiedek et al., 2010; Stephenson & Halpern, 2013; von Bastian & Oberauer, 2013).
>
> - Cronbach, L. J., & Furby, L. (1970). "How should we measure 'change' - or should we?" https://pdf.yt/d/YDkta3Cxn-sALifN / https://dl.dropboxusercontent.com/u/5317066/DNB/1970-cronbach.pdf _Psychological Bulletin_, 74(1), 68-80.
>
> Fortunately, quantitative psychologists have developed sophisticated methods for analyzing learning and change at the latent level. A test of the training effect on Gf can be realized by using a bifactor model (Yung, Thissen, & McLeod, 1999) with Gf as the general dimension. The model must guarantee that the nature of the latent variable does not change from pretest to posttest and that the training effect is an effect on this general dimension. One method that guarantees this is the Multiple-indicator multiple-cause (MIMIC) model (Goldberger, 1972) with pretest-versus-posttest as an external covariate of the general dimension that is shared by pretest and posttest. The same modeling framework also makes it possible to estimate effects on more specific latent variables and to isolate a strategy-specific effect from a genuine effect on Gf. The Latent difference score model (McArdle & Nesselroade, 1994) is based on similar principles and has similar virtues. It has already been applied successfully to cognitive enhancement data (Schmiedek et al., 2010). A second approach to guarantee comparability between pretest and posttest is to analyze the data at the level of individual test items instead of aggregate scores. Item response theory (De Boeck & Wilson, 2004) can then be used to impose constraints on the item parameters at pretest and posttest. This approach is developed in Embretson's (1991) model of learning and change.

--
gwern
http://www.gwern.net
Reply all
Reply to author
Forward
0 new messages