http://wmlabs.psy.unipd.it/Publication/zavagnin/Carretti%20et%20al._2012_Gains%20in%20language%20comprehension%20relating%20to%20working%20memory%20training%20in%20healthy%20older%20adults.pdf
"Gains in language comprehension relating to working memory training
in healthy older adults"; Carretti, Borella, et al 2012 (linked in the
research thread).
It doesn't use n-back and so is of limited interest to us, but I
thought the IQ results afforded an interesting example of an otherwise
abstruse statistical maxim: "The difference between “statistically
significant” and “not statistically significant” is not in itself
necessarily statistically significant"
http://andrewgelman.com/2005/06/14/the_difference/
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.65.3470
> It is common in applied research–in the last couple of weeks, I have seen this mistake made in a talk by a leading political scientist and a paper by a psychologist–to compare two effects, from two different analyses, one of which is statistically significant and one which is not, and then to try to interpret/explain the difference. Without any recognition that the difference itself was not statistically significant.
>
> Let me explain. Consider two experiments, one giving an estimated effect of 25 (with a standard error of 10) and the other with an estimate of 10 (with a standard error of 10). The first is highly statistically significant (with a p-value of 1.2%) and the second is clearly not statistically significant (with an estimate that is no bigger than its s.e.).
>
> What about the difference? The difference is 15 (with a s.e. of sqrt(10^2+10^2)=14.1), which is clearly not statistically significant! (The z-score is only 1.1.)
In other words, one of the results may have been 'significant' enough
to allow one to 'reject' the null hypothesis, and the other result not
'significant' enough; but this only applies to the original null
hypothesis (x==0 and y==0, eg.), and not the new null hypothesis that
you *actually* care about, which is x==y.
Specifically, from the Carretti paper:
> Fluid intelligence: Cattell. Performance improved from pretest to posttest (MDiff. = À2.81, p < .05) and follow-up (MDiff. = À2.5, p < .001), with no difference between the latter two. Although the interaction was only marginally significant, post hoc comparisons indicated that the trained participants performed better at posttest (p < .001) and follow-up (p < .001) than at pretest, and their better performance was maintained from posttest to follow-up. No significant differences emerged for the control group. No statistically significant differences emerged between the groups in any session, however.
In other words: both the training and control groups improved on their
second test. This improvement, compared to their first tests, was
'statistically significant' for both groups: the scores at the
post-test were higher than the pre-test; in this case, the null
hypothesis is 'the scores the second time == the first time', which we
reject. No surprise there, we know about things like test-retest
effects and suchlike, it would be odd if both groups didn't improve
and the original null hypothesis rejected. However, we don't really
care about that. What we care about is our new null hypothesis:
whether the training group improved *more* than the control group.
*This* null hypothesis turns out to be unrejectable (the training
group has higher means, but the larger standard deviation of the
control group wipes out the significance); hence we are left with
statistically-significant differences (between each group and its
previous self) whose difference is not statistically-significant.
--
gwern
http://www.gwern.net