What does it take to show IQ gains? te Nijenhuis et al 2007 & te Nijenhuis et al 2014

486 views
Skip to first unread message

Gwern Branwen

unread,
Aug 7, 2014, 3:46:52 PM8/7/14
to N-back, Elijah Armstrong
Summary: hollow gains are common in which training or motivational
effects cause apparent increases on an IQ test, especially
single-measure IQ tests; see
http://www.pnas.org/content/105/19/6791.full , Nutley 2011, and
Shipstead, Redick, & Engle 2012 on the methodological points here.
This would explain the apparent increases from n-backing, although one
couldn't demonstrate this based on most of the n-back studies which
use a single measurement like RAPM or a subset of a wider suite like
WAIS.
One might be able to demonstrate the hollowness using Thompson
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0063614#close
, Colom et al 2010
http://jtoomim.org/brain-training/Improvement%20in%20working%20memory%20is%20not%20related%20to%20increased%20intelligence%20scores.pdf
or Colom et al 2013 http://www.gwern.net/docs/dnb/2013-colom.pdf

"Score gains on g-loaded tests: No g", te Nijenhuis et al 2007
http://emilkirkegaard.dk/en/wp-content/uploads/Score-gains-on-g-loaded-tests-No-g.pdf

> IQ scores provide the best general predictor of success in education, job training, and work. However, there are many ways in which IQ scores can be increased, for instance by means of retesting or participation in learning potential training programs. What is the nature of these score gains? Jensen [Jensen, A.R. (1998a). _The g factor: The science of mental ability_] argued that the effects of cognitive interventions on abilities can be explained in terms of Carroll's three-stratum hierarchical factor model. We tested his hypothesis using test–retest data from various Dutch, British, and American IQ test batteries combined into a meta-analysis and learning potential data from South Africa using Raven's Progressive Matrices. The meta-analysis of 64 test– retest studies using IQ batteries (total N = 26,990) yielded a correlation between _g_ loadings and score gains of − 1.00, meaning there is no _g_ saturation in score gains. The learning potential study showed that: (1) the correlation between score gains and the g loadedness of item scores is −.39, (2) the _g_ loadedness of item scores decreases after a mediated intervention training, and (3) lowg participants increased their scores more than high-g participants. So, our results support Jensen's hypothesis. The generalizability of test scores resides predominantly in the _g_ component, while the test-specific ability component and the narrow ability component are virtually non-generalizable. As the score gains are not related to g, the generalizable _g_ component decreases and, as it is not unlikely that the training itself is not g-loaded, it is easy to understand why the score gains did not generalize to scores on other cognitive tests and to g-loaded external criteria.
>
> ...IQ test scores can be increased by various forms of training. Kulik, Bangert-Drowns, and Kulik's (1984) meta-analysis on test preparation studies resulted in effect sizes on intelligence tests for practice and additional coaching of 0.25 S.D. and 0.51 S.D., respectively. Dynamic testing (Grigorenko & Sternberg, 1998) focuses on what children learn in a special training in an attempt to go beyond IQ scores. A general finding is that scores go up by 0.5 to 0.7 S.D. after a dynamic training (Swanson & Lussier, 2001). Ericsson and Lehmann (1996) report immense score increases after intensive training, for instance on a memory task very similar to the subtest Forward Digit Span of the WISC.
>
> ...Thus, there is an increase in narrow abilities or test-specific ability that is independent of g. Test-specific ability is defined as that part of a given test's true-score variance that is not common to any other test; i.e., it lacks the power to predict performance on any other tasks except those that are highly similar. Gains on test specificities are therefore not generalizable, but ‘empty’ or ‘hollow’. Only the g component is highly generalizable. Jensen (1998a, ch. 10) gives various examples of empty score gains, including a detailed analysis of the Milwaukee project, claiming IQ scores rose, but not _g_ scores. Another example of empty score gains is given by Christian, Bachnan, and Morrison (2001) who state that increases due to schooling show very little transfer across domains.
>
> ...What do we find after repeated test taking? In a classic study by Fleishman and Hempel (1955) as subjects were repeatedly given the same psychomotor tests, the _g_ loading of the tests gradually decreased and each task's specificity increased. Neubauer and Freudenthaler (1994) showed that after 9 h of practice the g loading of a modestly complex intelligence test dropped from .46 to .39. Te Nijenhuis, Voskuijl, and Schijve (2001) showed that after various forms of test preparation the _g_ loadedness of their test battery decreased from .53 to .49...In the first study, Jensen (1998a, ch. 10) analyzed the effect of practice on the General Aptitude Test Battery (GATB). He found negative correlations ranging from − .11 to − .86 between effect sizes on practice and the tests' _g_ loadings. Therefore, the gains were largest on the least cognitively complex tests. In the second study, te Nijenhuis et al. (2001) found a small correlation of − .08 for test practice, and large negative correlations of − .87 for both of their two test coaching conditions...In a third study (Coyle, 2006), factor analysis demonstrated that the change in aptitude test scores had a zero loading on the g factor.
>
> ...To test whether there is a negative correlation between _g_ loading of tests and score gains, we carried out a meta-analysis of all test–retest studies of Dutch, British, and American test batteries available in the Netherlands. All studies were simple practice studies– no intervention such as additional coaching took place– and used well-validated tests.
>
> ...In the present study, we corrected for five artifacts that alter the value of outcome measures listed by Hunter and Schmidt (1990): (1) sampling error, (2) reliability of the vector of _g_ loadings, (3) reliability of the vector of score gains, (4) restriction of range of _g_ loadings, and (5) deviation from perfect construct validity.
>
> ...A large-scale meta-analysis of 64 test–retest studies shows that after corrections for several artifacts there is an estimated true correlation of − 1.06 between g loading of tests and score gains and virtually all of the variance in observed correlations is attributable to these artifacts. As several artifacts explain virtually all the variance in the effect sizes, other dimensions on which the studies differ, such as age of the test takers, test– retest interval, test used, average-IQ samples, or samples with learning problems, play no role at all...A correlation of −1.06 falls outside the range of acceptable values of a correlation, but one has to make a distinction between the meta-analytical estimate of the true correlation between _g_ and d, and the true correlation between _g_ and d. We interpret the value of − 1.06 for the meta-analytical estimate as meaning that the true correlation between _g_ and d is − 1.00. A correlation of − 1.00 means that there is an inverse relationship between _g_ and score gains.
>
> ...Both the Dutch training and the South African training took 3 h, but whereas in the Dutch training the focus was on two different test formats, the South African training dealt only with one test format. The test training by Lloyd and Pidgeon (1961) took even less time, namely two half-hour segments, each focusing on one test format. The effect sizes in all studies were roughly comparable. This suggests that the methodologies employed by te Nijenhuis et al. and Lloyd and Pidgeon were more efficient than those used by Skuy et al. It is possible that the components of the mediation training that are not present in the other two training formats are not effective in raising test scores and could therefore be left out. If true, it might be possible to increase the scores on the RSPM by one S.D. with a relatively simple 1-h training.

"Are Headstart gains on the g factor? A meta-analysis", te Nijenhuis
et al 2014 https://pdf.yt/d/i5PsjJ3sNIKHL-JG /
https://dl.dropboxusercontent.com/u/85192141/2014-nijenhuis.pdf

> Headstart studies of compensatory education tend to show impressive gains on IQ scores for children from low-quality environments. However, are these gains on the _g_ factor of intelligence? We report a meta-analysis of the correlation between Headstart gains on the subtests of IQ batteries and the _g_ loadings of these same subtests (K = 8 studies, total N = 602). A meta-analytic sample-weighed correlation of −.51 was found, which became −.80 after corrections for measurement error. We conclude that the pattern in Headstart gains on subtests of an IQ battery is highly similar to the pattern in test–retest gains and is hollow with respect to g. So, Headstart leads to gains in IQ scores, but not to gains in g. We discuss this finding in relation to the Flynn effect, training effects, and heritability.
>
> ...Much research in the past three decades has been centered on the Flynn effect, e.g. the recent special issue in Intelligence (Thompson, 2013). The nature of the effect is hotly debated. Some authors, like Lynn (2013), believe it to be a real increase in intelligence, citing, among other things, the similar rise in height as evidence. Many non-specialists similarly treat the Flynn effect as a real increase in intelligence (e.g. Somin, 2013). Hypothesized causes for a real increase include: better nutrition (Flynn, 1987; Lynn, 2006), heterosis (i.e. outbreeding, Mingroni, 2007), improvement in hygiene (Eppig, Fincher, & Thornhill, 2010), and reduced lead poisoning (Nevin, 2000).
>
> An alternate explanation posits that the effect has little or nothing to do with general intelligence, or g, itself. Jensen (1998, p. 143) invented the method of correlated vectors to check whether a phenomenon has to do with the underlying latent variable of interest, i.e. g, or whether it has to do with the non-g variance. Other researchers have since called phenomena that show a positive relation to the _g_ loading of subtests “Jensen effects” (e.g. Colom, Juan-Espinosa, & Garcı́ a, 2001; Rushton, 1998). Wholly or partly genetically influenced variables, such as subtest heritabilities (Rushton & Jensen, 2010), dysgenic fertility (Woodley & Meisenberg, 2013), fluctuating asymmetry (Prokosch, Yeo, & Miller, 2005), brain size (Rushton & Ankney, 2009), inbreeding depression (Jensen, 1998), and reaction times (Jensen, 1998) have been shown to be Jensen effects.
>
> On the other hand, environmental variables seem to be negative Jensen effects. te Nijenhuis and van der Flier (2013) reported a meta-analysis of the Flynn effect which yielded a negative Jensen effect of − .38 (after corrections). Moreover, in a newer study, Woodley, te Nijenhuis, Must, and Must (2014) reexamined one of the datasets in this meta-analysis and found that if one corrects for increased guessing at the harder items (the Brand effect) then the negative Jensen effect came even closer to − 1 at − .82, indicating that the gains may be more hollow with respect to _g_ than previously thought (see also Flynn, te Nijenhuis, & Metzen, 2014). In a related study, te Nijenhuis, van Vianen, and van der Flier (2007) reported a meta-analysis of 64 studies (total N = 26,990) on score gains from test training yielding a negative Jensen effect of − 1.0 (after corrections). Score gains from training are theoretically interesting because they present a clear case that one can increase the proxy (or manifest variable), IQ, without increasing the underlying latent variable of interest, g. Whatever causes the Flynn effect gains, it seems likely this effect is similarly mostly hollow with respect to g; it represents no large gain in g. Accordingly, we have not seen the substantial increase in the number of geniuses in Western countries that we could expect to result from a mean increase in _g_ of a standard deviation or more (Jensen, 1987, pp. 445–446). As Herrnstein and Murray (1994, p. 364) point out, a mere 3 IQ point increase in _g_ would make a large difference on the tails of the distribution. For instance, it would increase the number of people above IQ = 130, often taken as the threshold of giftedness, by 68% (from 2.3% to 3.6%). An increase of one or more SD in _g_ could not possibly be overlooked.
>
> - te Nijenhuis, J., van Vianen, A. E., & van der Flier, H. (2007). "Score gains on g-loaded tests: No g". _Intelligence_, 35, 283–300 http://emilkirkegaard.dk/en/wp-content/uploads/Score-gains-on-g-loaded-tests-No-g.pdf
>
> ...Several meta-analyses of Headstart studies showed that children in the program outscored children in control groups (Caruso et al., 1982; Ramey, Bryant, & Suarez, 1985; Nelson, Westhues, & MacLeod, 2003; see also Protzko, Aronson, & Blair, 2013). However, no one, to our knowledge, has yet carried out an analysis to see if the gains are a Jensen effect. ...Spitz (1986) reviewed most of the literature on the attempts to increase intelligence and his conclusions were also mostly negative. He mentions (p. 103) that in the Perry Preschool Program, the teachers seemed to focus on teaching material that was similar to the content of subtests of the IQ tests, so-called “teaching to the test”. It is not unlikely that highly comparable practices were present in many other programs, including Headstart.
>
> ...In the widely accepted model in Figure 1, U_n is the variance specific to each subtest, V_n . The teaching to the test-hypothesis can be clearly stated in terms of the model. According to the hypothesis, when one trains test takers on the exact subtests or subtests very similar to those used in a test, the resultant effect is on the U_n factors in the model (and maybe somewhat on the group factors F_n ), but there is no increase in the latent variable _g_. If one assumes that test takers are taught comparably on all the subtests, then this leads directly to the prediction that any resultant training effect should have a strong negative correlation with the _g_ loading of the subtests. This is because, for each V_n , the greater the influence of U_n , the smaller the influence of _g_ (through the group factors). If ability in U_n is increased, it will be higher on the V_n's where _g_ has a smaller influence, that is, that are less _g_-loaded (see also Jensen, 1998, pp. 336–337).
>
> This leads us to the present study. The goal was to determine whether the gains from Headstart are similar to training effects, with a strong negative Jensen effect, or whether they are genuine increases in _g_, in which case they should show a strong Jensen effect.
>
> ...To identify studies for inclusion in the meta-analysis, both electronic and manual searches for studies that contained cognitive ability data of Headstart children or adults who participated in a Headstart program as a child were conducted in 2007...Studies that reported IQ scores of Headstart children, preschool, and kindergarten children were included in the meta-analysis. We used the term “Headstart” in a generic sense, so it included preschool and kindergarten children as well. For a study to be included in the meta-analysis two criteria had to be met: First, to get a reliable estimate of the true correlation between Headstart gains and the _g_ loadings the cognitive batteries had to have a minimum of seven subtests; second, well-validated tests had to be used. The general inclusion rules were applied and yielded six papers which resulted in eight correlations between _g_ and _d_ (Headstart gains).
>
> Psychometric meta-analysis is based on the principle that there are artifacts in every dataset and that most of these artifacts can be corrected. In the present meta-analyses we corrected for five artifacts that alter the value of outcome measures listed by Hunter and Schmidt (2004). These are: (1) sampling error, (2) reliability of the vector of _g_ loadings, (3) reliability of the vector of Headstart gains (d), (4) restriction of range of _g_ loadings, and (5) deviation from perfect construct validity.
>
> Table 2 lists the results of the psychometric meta-analysis of the eight data points. The estimated true correlation has a value of − .72, and artifacts explain 71% of the variance in the observed correlations. Finally, a correction for deviation from perfect construct validity in _g_ was made, using a conservative value of .90. This resulted in a value of − .80 for the final estimated true negative Jensen effect.
>
> ...Results were strongly in line with the prediction that Headstart involves a lot of teaching to the test, so that the gains would be strongly at the level of the specific or group factors. The gains involve mostly the non-g variance, which means that they were mostly hollow in terms of _g_. The final estimated true correlation of − .80, rather than a correlation of exactly − 1.0, need not mean that there was some gain in _g_. It might instead indicate that the teachers did not give equal amounts of training to activities related to each subtest. The finding that the IQ gains from Headstart were mostly on the non-_g_ variance might explain why IQ gains from such programs fade with time (Brody, 1992). IQ tests given to people of different ages do not have the same items, as items that are useful for discriminating between small children are generally too easy for adults (Jensen, 1980). If one trains young children on the specific factors U_1, U_2, and U_3, and one later tests the same group with another test battery with the specific factors U_4, U_5, and U_6 then the earlier training would be irrelevant (barring any near-transfer effects), and therefore any IQ gain would vanish. Alternatively, one might view the fading of IQ gains in light of the repeated finding that heritability increases with age, or equivalently, environmentality decreases with age (Plomin, DeFries, Knopik, & Neiderhiser, 2013).
>
> ...Indeed, the study of any phenomenon's relation to IQ scores could benefit from applying the method of correlated vectors. This is as true for compensatory education and dual n-back training (e.g., Jaeggi et al., 2010; but see Chooi & Thompson, 2012) as it is for fluoride poisoning (Choi, Sun, Zhang, & Grandjean, 2012) and myopia (Saw et al., 2004). The fact that the literature on intelligence focuses so much on manifest variables (i.e. IQ) leads to confusion in the press when phenomena such as the Flynn effect are reported, as well as when people observe that they can make their IQ scores go up by taking a test more than once. The only remedy is to focus on latent traits and always report the _g_ loading.

--
gwern
http://www.gwern.net

Jabba Dabba

unread,
Aug 13, 2014, 2:54:31 AM8/13/14
to brain-t...@googlegroups.com, elijah.l....@gmail.com, gw...@gwern.net
Hmm . This is something I've thought about. At what point do we stop saying that the intelligence gains are hollow ?

Suppose we do intensive training for every task on the IQ test. Suppose we train for every factor in the Cattel-Horn-Carrol model :


If it is indeed possible that every single component is trainable then you'd end up with a increase in IQ that would be indistinguishable from if the person really had that IQ in the first place.

So if it looks like a duck , walks like a duck ..  

Green

unread,
Aug 16, 2014, 10:48:20 PM8/16/14
to brain-t...@googlegroups.com, elijah.l....@gmail.com, gw...@gwern.net

 
>If it is indeed possible that every single component is trainable then you'd end up with a increase in IQ that would be indistinguishable from if the person really had that IQ in the first place.

>So if it looks like a duck , walks like a duck ..  
>

  Or you end up better at IQ tests, without improving at anything else. You can't train for every conceivable intellectual task that could be used to measure intelligence.

  

 

 


Reply all
Reply to author
Forward
0 new messages