Discussion 2

Stuart Daman

unread,

Feb 3, 2010, 9:25:23 PM2/3/10

to socialneuro780

So, this week we're going nuts over voodoo correlations in social
neuroscience. I found the readings particularly interesting (I focused
on the main Vul et al article) because they helped me to understand
how fMRI research is carried out. I now understand what voxels are,
and find it important to emphasize that there are bazillions of them.
The primary concern that Vul et al. describe and highlight is
something they call the "nonindependence error". It took several
readings to understand what this is, and I don't necessarily
understand the in's and out's of it, but I feel that it was said best
towards the end of the article: "Suppose an author reported that a
questionnaire measure was correlated with some target behavioral
measure at r = .85 and that he or she arrived at this number by
separately computing the correlation between the target measure and
each of the items on the questionnaire and reporting just the average
of the highest correlated questionnaire items." (p. 285)

I hope that all of us, who are probably quite familiar with and use
questionnaires, immediately know that this is problematic. The only
situation I can think of in which such an action would be useful would
be purely explorative data analysis; i.e. not the main analyses.
Furthermore, even in such a situation, we would most likely report the
correlations to all items, not only the high ones, and we probably
would not average across only high ones. In fMRI research, the problem
is that they do this with voxels, which are the gridded areas of the
brain. The aggregating problem in fMRI research is that there are
literally tens or hundreds of thousands of voxels in one single fMRI
image (not to mention the rather large number of images taken, and
then again the number of participants, who do not have spatially
identical images because of individual differences in brain size and
sulci/gyrus locations etc.). So, one can probably quickly see that
alpha or Type I error rate inflation should be a major concern of such
research.

However, this is not even the core of the nonindependence error. In
the case of fMRI research, the core of this problem lies in how voxels
are selected for analysis. So, we have, let's say, 100,000 voxels
(aggregated across images and participants, or somehow otherwise
disregarding this level of the analysis). It's not practical at all to
analyze a correlation matrix including this many variables, so we need
to select a subset of them. Which specific voxels or groups of voxels
are chosen is done in one of two ways. More erroneously, ones
exhibiting larger differences between some other variable (the IV or
predictor) are used, or they are selected based on anatomical
locations that are theoretically expected to be related to the other
variable. In either case, these are then the only ones reported and or
averaged across for reported effects. As Vul et al. said, when looking
at some 100,000 voxels, this is like "selecting noise that exhibits
the effect being searched for" (p. 279). They're fishing!

Now, for the feedback part. I am glad with how this has provided me
with a better understanding of fMRI research, but it has also put a
big hole in my trust in its results. It makes me somewhat glad that I
am not greatly familiar with and cite such research in my own projects/
ideas, because now I would have to be calling it into question. I'll
be honest and say that I have not yet thoroughly reviewed the
responses to this paper, but I hope that those reading the replies get
the opportunity to explain to me what is wrong with these criticisms,
or to express their own lack of faith in the defenses put forth in the
replies.

beka strock

unread,

Feb 3, 2010, 9:28:42 PM2/3/10

to socialn...@googlegroups.com

Discussion Paper Wk2: Vuhl et al Response Summary

Diener (2009) introduces the Vul, Harris, Wimkielman, and Pashler (2009) original article by explaining the intent and practice of pre-releasing the proposed article on the internet and giving others a chance to respond prior to publishing in an official format in a journal. He also addresses the questionable use of survey data by non-consenting individuals, as well as perceived article tone and accuracy which were both controversial aspects of the article.

Vul and colleagues (2009) responded to the criticisms of their initial argument by addressing the following: categorization of others’ procedures, the magnitude of correlation inflation, the attribution of the inflation, the interpretation of effect size, statistical error calculation methods, and reporting of effect sizes. Vul and colleagues (2009) reinforce their interpretation of others’ methods by claiming that there has been zero disagreement voiced by authors and note that there is unanimous agreement in the inflation of correlations even by dissenting authors. The authors’ defend their claim that presentation and interpretation of said inflated statistics is commonly misused and misunderstood, giving additional examples. The magnitude of the inflation error was emphasized again, and one reanalyzed dataset was cited as an example which supported their claim. Vul and colleagues (2009) emphasize the impact an overestimated effect size can have on interpretation of results, which is therefore cause of frequent and egregious misinterpretations.

They also address questions about calculation of average sample sizes, the chosen scope of review, the relevance (or lack thereof) of missing correlations, the definition of replications, theoretical range restrictions, and finally return to their main point of nonindependence. They review standard sample sizes in previous studies, concluding that the actual number of subjects would cause even worse inflation that their estimated example simulations due to their own reasoning as well as the additional concerns of commentator Yarkoni (2009) regarding increased inflation in small sample sizes. Vul and colleagues (2009) also dismiss critiques of scope and missing correlations as irrelevant, even to the extent of recalculating with additions that were previously omitted. The authors dispute the examples of so-called replications that were used as evidence against them due to the limited nature and narrow applicability of the replications cited. Vul and colleagues (2009) also dismiss claims they did not account for range restriction and reemphasize the low probability of having perfect reliability in measures that would be needed to obtain the extremely high correlations outside of the “upper bounds” they propose. Finally, they discuss the pervasiveness of what they termed the “nonindependence” problem in other large data analysis situations in various other academic fields.

Camille Barnes

unread,

Feb 3, 2010, 9:46:39 PM2/3/10

to socialneuro780

Camille: Social Neuro Response Week 2
The Lazar response article comes from the point of view of a
statistician. And basically the main point of it is that the issues
that these FMRI studies face statistically is not a matter of being
FMRI studies, but is due to the fact that there is so much data. The
issues are actually a result of theses massive data sets. The
statistics that w use were not developed to handle such large amounts
of data, so any study with a massive data set faces similar
statistical issues.
Lazar did report that nonindependence error is a real problem, and
stated that it is not a new issue, but rather is the same problem as
selection bias or the file drawer problem in meta analysis. SO
basically it is selecting participants that score particularly high on
a selection criteria, or disproportionately using only that happens to
be above a specific significance value in analysis. Lazar does not
agree with the current methods of measuring activation in the brain
regions of interest, and is currently working on more objective ways
to measure this so that selection biases are avoided.
In conclusion, Lazar notes that Vul may be a little harsh, since it is
not the intention of scientists to distort the data in this manner,
but that it is important that better statistical methods are developed
for handling large data sets.

Jennifer Vosilla

unread,

Feb 4, 2010, 12:52:01 AM2/4/10

to socialn...@googlegroups.com

Jen V.: Discussion Week #2

Alright, so I also had the main Vul article “Puzzlingly High Correlations in fMRI….”. Basically, the authors strongly criticized correlations greater than .8 as being impossibly high stating that even if the instruments used yielded perfect measurements, these high correlations would still be questionable. In response to their concern, they surveyed the authors of 55 articles (though only 53 answered) asking questions about the specific methodology used in their research.

Vul and colleagues focused on what they refer to as the “nonindependence error”. From what I understood, it is dependent on how the subset of voxels used in the overall correlation is selected. My favorite example was actually the one using temperature readings at a weather station to predict changes in stock on the bottom of page 278-279. The nonindependence error would be if in such an example, the researchers computed separate correlations between the weather readings and each stock and then chose only a subset of stocks that yielded high correlations to average to get the correlation of -.87. As the authors say “Of the 3,315 stocks assessed, some were sure to be correlated with the Adak Island temperature measurements simply by chance”. My understanding is that there are two ways to avoid the nonindependence error by changing how the subset of voxels is selected. The authors suggest that either voxels are selected basically without even being in the same room as the behavioral data or by selecting voxels by finding the desired correlations with the behavioral measure using only half of the data collected so that the other half can be used to measure the effect.

Overall, I thought the article was well-written and informative. Before reading this article, I had no idea what a voxel was or how the correlations between brain activity and behavioral measures were found. The authors may be harsh, but I believe they raise valid points about the methodological problems and the fact that in the publications reviewed, the authors failed to adequately explain their voxel selection process.

David Dinwiddie

unread,

Feb 4, 2010, 9:31:47 AM2/4/10

to socialn...@googlegroups.com

For the discussion I am focusing on the commentary by Nichols and Poline. They criticized Vul for overstating his point. The authors say that Vul's argument can be broken into two different issues and each of these issues is already widely known by the neuroimaging community. The first of these issues is the multiple testing problem. They have spent 20 years trying to fix this problem and have finally come to a consensus on the appropriate methods. In the accepted methods false positives are controlled for. The reported raw score t values are considered to be the maximum t-scores and should be considered as such. The second issue that Vul brings up is that neuroimaging studies have very confusing or even incomplete methods sections. The authors partly blame this confusion on a lack of understanding on the readers part. Nichols and Poline believe that Vul argues for a focus on measuring effect size assuming and known location and throws out their inferential methods completely. They believe this focus should be investigated but that this does not mean that their current methods should be thrown out.

Lindsay Morton

unread,

Feb 4, 2010, 10:19:34 AM2/4/10

to socialn...@googlegroups.com

The chapter by Johnston, Kim, and Whalen (2009) provided an overview of fMRI and related techniques. It was presented in an easy to understand format, and combined with the Vul et al. (2009) article, I believe I have a much better understanding of what exactly BOLD measurements and voxels are. I think it’s important to point out that FMRI is an indirect approximation of dentritic activity averaged over a particular brain region and that fMRI gives no information about the amount of neural activation. Importantly, I think we all need to recognize that fMRI cannot be used to look at communication across brain areas (i.e., can only examine localized changes in oxygenation) and that faster processes, such as perception, are better captured by EEG/ERP procedures.

The chapter also called for the use of complimentary techniques, such as perfusion imaging/ASL and diffusion tensor imaging (DTI), used in tandem with fMRI. Although ASL is still in the process of refinement, it offers several improvements over fMRI, such as the ability to directly compare the data collected from different participants and the information collected over multiple experimental sessions. DTI also provides researchers with a way to compare across brain regions, which with the known inter-connectivity of neural processing seems like a necessary procedure. After reading Vul et al. (2009) and the commentaries, I more strongly believe such methods are important. As we found out fMRI research is based on correlations, which means that we cannot make causal statements about which brain regions are responsible for specific processes. I believe that the use of multiple techniques that work to answer fundamental social-affective questions will help us to move forward in this field.

Switching gears slightly, I was assigned to focus on the Lieberman, Berkman, and Wager (2009) critique. I think it is interesting to point out that in the Vul, Harris, Winkielman, and Pashler (2009b) response to the commentaries this was the one that was selectively slammed. Basically, Lieberman et al. stated that the survey used by Vul et al. (2009a) did not allow for researchers to accurately explain the statistical techniques used. As you read, Vul et al. originally criticized a large proportion of research for selecting voxels based on high correlations with the personality measure of interest and then used those selected voxels for their final correlational analysis – which they defined as the “nonindependent error.” Lieberman and colleagues (2009) explained that the second step is a descriptive one, in which the selected voxel correlations to the personality variable are shown in images. No further correlations are conducted on the first correlation, but the way in which the survey was worded made researchers select an option that made it seem as if they had performed subsequent non-independent analyses. In their response, Vul et al. (2009b) state that even if this may be the case, researchers often make causal statements based on this “descriptive” procedure. This point is a good one and goes back to what I highlighted earlier in my discussion – fMRI only involves correlations. That is, it is invalid to make statements about brain areas responsible for psychological processes with the findings from fMRI studies.

In their commentary, Lieberman et al. (2009) question the way in which the “meta-analysis” of the main article was performed. Sadly, many of their critiques were based on an earlier version of the Vul et al. (2009a) article, as was pointed out in the rebuttal. This makes it hard to determine the legitimacy of comparisons between the independent and non-independent correlations. Basically, Lieberman et al. attempted to show that the strength of correlations based on independent and non-independent voxel correlations were not significantly different. This was their attempt to demonstrate that non-independent correlations are not as grossly exaggerated as Vul et al. suppose. For me, the important issue still remains in how voxel selection is used in the analysis, an idea that is re-iterated in Vul et al. (2009b).

Finally, Lieberman et al. (2009) also point out that the “puzzlingly high correlations” are not actually unfounded. Vul et al. (2009a) stated that low reliabilities cannot achieve correlations higher than around 0.74. They base this assumption on a 0.8 reliability of personality measures and a 0.7 reliability of fMRI. Thus, if reliabilities are higher, which is frequently the case, high correlations can indeed be found and are not statistically impossible. At the same time, Lieberman et al. (2009) points out that even if the correlations are inflated and corrections showed that the correlations were not as strong in the population, the truth of the matter is that the relationships between these personality-social psychological variables and specific brain region activation does exist. After reading the back-and-forth between these researchers, the take-home point for me was the need to either choose brain areas a-priori based on theoretical grounds or on previously collected data not to be used in the correct analyses. The use of complementary techniques, as advocated by Johnston et al. (2009) would also help to make more concrete conclusions from neurological research. Overall, these changes in methodologies would help to discourage criticism as well as to make social psychological fMRI studies more scientifically grounded.

Jenny Perella

unread,

Feb 4, 2010, 4:57:52 PM2/4/10

to socialneuro780

Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI
correlations reflect low statistical power – commentary on Vul et al.
(2009)

Yarkoni argues that Vul’s primary conclusion is correct (that r values
are inflated), but that this is because of a combination of small
sample sizes and strict alpha-correction levels, not nonindependence.
Yarkoni argues that the inflation is probably even worse than is
suggested by Vul et al.
According to the author, the nonindependence error cannot be the
primary source of inflated r values; r-values are inflated even when
we engage in ‘independent analysis’, when the procedure used to
identify the ROIs is completely independent of the activation levels
observed within those ROIs. To demonstrate, Yarkoni proposes a
hypothetical study scenario in which the true population correlation
is .4, and N = 20. Each ROI-level test is conducted at an alpha-level
of p < .005 (which is p < .05, corrected for 10 comparisons). A power
analysis shows that the probability of detecting a significant effect
in each ROI if it is truly there is 13%; that’s only 1.3 out of 10
ROIs showing a significant effect in this hypothetical sample. In
addition, the average r value within significant ROIs cannot be .4;
the critical r-value for N=20 at p < 0.005 is .6. So even if the true
population effect size is large (.4), an fMRI study such as this will
inflate significant rs to at least .6, at a minimum. What’s more,
inflation will get worse as the number of comparisons increases:
10,000 tests gives an average significant r of .69. Thus, extreme r
inflation can occur even when analysis of fMRI data is independent.
If even correct (independent) analyses inflate r, what could be the
problem? Yarkoni argues that it is poor power. Yarkoni reminds the
reader that the power to detect a true effect is smaller for between-
subjects than for within-subjects designs, given that the magnitude of
the effect is equal, perhaps by 5-10%. While low power most commonly
leads to Type II errors (failure to detect a true effect), low power
also inflates significant effect sizes. fMRI studies of 30 or more are
not expected to result in massive significant r-value inflations.
(Example: population effect of .3, with N = 20 and p < .001, on
average will show r = 0.73)
Aside from statistical analyses, Yarkoni offers 3 arguments to explain
why brain activity and behavior aren’t as highly correlated as is
often reported. First, it is simply not plausible that a complex
behavior (like empathy), in response to a not-very-reliable task,
could be explained by the activation of just one part of the brain.
Second, there is no good theoretical reason to explain why we tend to
see more localized effects in correlational (as opposed to within-
subjects) effects. However, this effect would be predicted if power
were too low. Third, why do large rs tend to appear the most often in
small-sample studies? Almost all of the studies Vul examined had N <=
30, but (according to Yarkoni), no fMRI studies of N >=50 report r > .
8.
Yarkoni concludes by acknowledging that there is no way to know how
much of a problem low power is for fMRI studies because we can’t know
the true population effect sizes. However, the issue is probably still
very problematic. Most fMRI studies probably find only a small portion
of the true effects, greatly inflate these effect sizes, and “promote
a deceptive illusion of highly selective activation”.
What to do? While using independent analyses will help reduce r
inflation, the best way is to increase power (by increasing sample
size). Yarkoni recognizes this is difficult because it is very costly,
but offers no better solution. He does suggest, however, that all
researchers should at least perform power analyses prior to any study
and report the results of these analyses in their papers. Researchers
should also include caveats about correlation interpretation, should
report confidence intervals, and should stress their probable
unreliability. On part of the readers, we should be skeptical of any
studies that report a localized relationship between brain activity
and behavior.

On Feb 4, 10:19 am, Lindsay Morton <morton.lind...@gmail.com> wrote:
> The chapter by Johnston, Kim, and Whalen (2009) provided an
> overview of fMRI and related techniques. It was presented in an easy to
> understand format, and combined with the Vul et al. (2009) article, I
> believe I have a much better understanding of what exactly BOLD measurements

> and voxels are. I think it’s important to point out that FMRI is an *
> indirect* *approximation* of dentritic activity averaged over a particular

Reply all

Reply to author

Forward