Will Reproducibility Project unearth "an excess of significant findings"?

195 views
Skip to first unread message

Roger Giner-Sorolla

unread,
May 21, 2012, 9:03:42 AM5/21/12
to openscienc...@googlegroups.com
Recently, Gregory Francis has had at least two papers applying the methods of Ioannidis & Trikalinos to articles in psychology: one looking at Bem's 2011 JPSP precognition article

Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 1–6.

and another looking at the closeness of desirable objects effect from Balcetis and Dunning

Francis, G. (2012). The same old New Look: Publication bias in a study of wishful seeing. i-Perception, 3(3), 176–178.

which elicited a reply and an exchange from the authors (linked from http://i-perception.perceptionweb.com/journal/I/volume/3/article/i0519ic)

In theory this could go on forever (a search reveals he has another one in press at JEP:General) and of course the "hit list" approach doesn't leave us with very firm grounds for discipline-wide generalizations about false-positive bias. My personal reaction is, why go after results one at a time if this so obviously reflects an endemic practice in the field?

Indeed, it occurred to me that we could do a lot better by looking at our sample of the psychology field from the Reproducibility Project. We would have to do power analyses for all studies in each articles, of course, but we already have the framework in place. Although I have my own archival project I want to start this summer, I'm wondering if anyone could take on organizing this project, or at least give their opinion as to whether it's worth doing?

Roger Giner-Sorolla

unread,
May 21, 2012, 9:17:06 AM5/21/12
to openscienc...@googlegroups.com
Oh, and some important context: Ioannidis & Trikalinos (2007)

Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4(3), 245–253.

note that for a multi-study article,  if all results reported are significant, the likelihood can be calculated that a complete report of all studies done would have come up with a similar result, given the studies' experimental power. For example, if 5 studies are run a priori, each at .80 power to detect the effect size eventually found, then even if the aggregate sample effect size is a true estimate of the population effect size, it is likely that at least one study will come out with p > .05, and only a ~33% chance that they will all be significant.

Joachim Vandekerckhove

unread,
May 21, 2012, 5:58:57 PM5/21/12
to openscienc...@googlegroups.com
Hi Roger,

I haven't been active in this group, but I am doing exactly this right now. An RA is collecting all the relevant statistics now and we plan to write a report by the end of the summer. I'm also meeting Greg soon to discuss the implementation. 
I think it would be a worthwhile effort not only in order to provide an overview of observed power (post hoc) in the sample, but also to provide power-based advice to groups aiming to replicate the studies. I'm happy to communicate more on this if there is an interest.

Cheers,
Joachim

Roger Giner-Sorolla

unread,
May 22, 2012, 5:29:59 AM5/22/12
to openscienc...@googlegroups.com
That's great, Joachim. Are you using the Reproducibility Project sample of articles specifically?

And some power-related questions: what are you doing for studies that report no effect size (all too common in Psych Science), and repeated measures studies that report no correlation among DVs, which is needed for ES of the repeated measures ANOVA effects?

Gregory Francis

unread,
May 22, 2012, 8:13:20 AM5/22/12
to Open Science Framework
The main reason to apply the technique to individual findings is that
some scientists care about those individual findings. People who care
about these phenomenon should know when the reported studies do not
provide proper evidence for the stated claims.

By the way, I have a letter that just appeared in PNAS

http://www.pnas.org/content/early/recent

The authors' reply strikes me as mostly nonsense, but you can judge
for yourself.

There is already pretty convincing evidence that this kind of bias
exists across the field, but this kind of general characterization
does not seem to have had much impact (not much has changed from
Sterling's observations in the 1950s). Perhaps a more "personal"
approach will be more effective.

By the way, I do not have a "hit list", nor do I think the authors of
the work I have criticized are behaving worse than most researchers.
The problems are difficult and systemic. Most of us are running and
reporting experimental findings incorrectly.

I think someone should apply the approach to the articles in the
Reproducibility Project. If the reported findings are unbelievable,
then I think there is no reason to do the replication (unless you
happen to care about the topic). It's more than one person can do on
their own, but I would be happy to help.

-Greg Francis

On May 21, 9:03 am, Roger Giner-Sorolla <rogersebast...@gmail.com>
wrote:
> Recently, Gregory Francis has had at least two papers applying the methods
> of Ioannidis & Trikalinos to articles in psychology: one looking at Bem's
> 2011 JPSP precognition article
>
> Francis, G. (2012). Too good to be true: Publication bias in two prominent
> studies from experimental psychology. *Psychonomic Bulletin & Review*, *19*,
> 1–6.
>
> and another looking at the closeness of desirable objects effect from
> Balcetis and Dunning
>
> Francis, G. (2012). The same old New Look: Publication bias in a study of
> wishful seeing. *i-Perception*, *3*(3), 176–178.
>
> which elicited a reply and an exchange from the authors (linked fromhttp://i-perception.perceptionweb.com/journal/I/volume/3/article/i0519ic)

Brian Nosek

unread,
May 22, 2012, 12:34:43 PM5/22/12
to openscienc...@googlegroups.com
I think we have a good opportunity to look not just at power, but many features of "standard practice" with our 2008 study sample.  Elizabeth Bartmess and a couple of others are finishing up a simple web form to help study coders complete and submit information about each of the studies.  This ought to help standardize the coding process and lower the bar to making contributions on that coding.  So, many people could make small contributions.

There is a possibility of expanding this coding project to facilitate power investigations (like what Joachim is starting) and others - estimating average effect size, sample size, types of study designs, distributions of p-values, appropriateness of statistical tests, conducting of conceptual and direct replications (the archival project idea that Roger is initiating), etc.  That is, if we do a very good job on a comprehensive coding of 2008 papers, then there might be many projects that could use that same dataset.

Current: The current coding project is focused on coding a few features of a single study from each 2008 article from three journals.  This coding is focused entirely on supporting the Reproducibility Project goals.  

Proposal: We can boost the power of all of the current archival projects by collectively amassing a large dataset of study characteristics.  And, the amassed dataset would allow additional OSC (or independent) investigations.  How about we formally separate the coding project from the Reproducibility Project and expand it to:

(a) coding every study in each article

(b) coding all major features of study design and reporting - sample size, effect size, statistical tests, replication or not, exclusion criteria, hypothesis supported or not, what key information reported and what is not, etc. (many possibilities here, we'd need to meet to discuss what is essential and how to improve coding from present approach)

(c) broaden the journal base so that there can be formal comparisons across subdisciplines - e.g., Journal of Abnormal Psychology, Developmental Psychology, Journal of Cognitive Neuroscience, even outside of psychology/neuroscience if there are folks with relevant expertise and interest in the OSC

Ruben Arslan

unread,
May 21, 2012, 1:56:05 PM5/21/12
to openscienc...@googlegroups.com
Hey all,

I've only lurked on this list before. I greatly enjoy reading about the progress, so
maybe I can contribute a little work myself now. 

To me, this seems like an obvious target for crowdsourcing, because extracting the necessary 
coefficients and making the analytic decisions (how to pool,...) could be 
done quite easily by graduate students (maybe after some required reading). 
It could also be done redundantly, so that we wouldn't have to worry too much about the 
individuals' classifications.

I'd volunteer to do a simple web interface for entering the necessary information and possibly
sending out emails to resolve disagreement in the classifications, if the group
decides that it's a worthy effort. To me it seems interesting enough to offer to put in some work 
and it would be nice to do it without putting too much emphasis on individual actions in a flawed system  :-)

Of course it may be overkill to do it as I suggested, but if it makes things any easier, I'd gladly do it.

Best wishes,
Ruben



--  
Ruben Arslan
Student assistant
Lab: http://www.psychology.hu-berlin.de/profship/perdev
Humboldt-University of Berlin
Unter den Linden 6
10099 Berlin, Germany

Jesse Chandler

unread,
May 22, 2012, 5:47:26 PM5/22/12
to Open Science Framework
Hi Greg,
I think it might actually make a stronger case if we try to reproduce
everything as planned, and then look at the relationship between power
and the probability of replication. I think we would find the
unsurprising result that underpowered findings do not replicate.

You might ask "well, why is this worth doing" Two reasons.- First,
part of this project was to look at the overall replicability of the
field, and it seems like including these low powered studies is an
important part of this the whole sampling strategy was designed around
picking a representative sample of what is published. Second, this
approach would allow us to address the argument that you seem to
encounter that "when one uses many different measures, cumulative
power tells us little." In principle, papers could be coded as a
series of direct or conceptual replications, and their cumulative
power regressed onto the probability of successful replication for the
selected experiment. It may in fact be the case that it is a weaker
predictor for heterogeneous studies, but that is a different question
from whether this method overall is useful heuristic to assess the
believability of results.
Message has been deleted

Joachim Vandekerckhove

unread,
May 22, 2012, 6:21:05 PM5/22/12
to openscienc...@googlegroups.com
Roger:

Yes, the idea was to use the exact same sample -- it seems like a nicely unbiased sample from the literature.

Re: power: Those problems have yet to come up (since we're only just starting this project), but the strategy I have in mind is to take the following steps (in order): 1) try to use other reported statistics to compute the ES and power (often possible with just t statistics, or with F statistics if the means are reported as well); 2) contact authors to fill in blanks; 3) use numerical methods to integrate out the unknown parameter(s), and report best/worst cases. That last step would be basically a Bayesian inference step with a prior over the unknown(s). I'd also provide some code snippets that other people can use to supply their own priors. There are definitely cases where some unknown varying over a very reasonable range has only a marginal effect on the power estimate.

Marcus Munafo

unread,
May 24, 2012, 4:48:27 AM5/24/12
to openscienc...@googlegroups.com
The use of observed power may be problematic here, because there's no independent confirmation that the effect size is accurate (it may be over-estimated). Observed power is simply another way of representing the  information contained in the observed effect size, sample size and p-value for given study. 

Ioannidis and Trikalinos use the effect size from a meta analysis to anchor their power calculation for individual studies, not the observed power for each study, on the assumption that the meta analysis effect size estimate will be a more accurate estimate of any true population effect.

In fact, in the presence of publication bias when an effect is not in fact real, one would expect large studies to be the only ones showing null results to actually get published. All the small studies showing null (or opposite to predicted) results would be censored. This is the rationale of funnel plot methods.

In fact, in the presence of publication bias when an effect is not in fact real, one would expect large studies to be the only ones showing null results to actually get published. All the small studies showing null (or opposite to predicted) results would be censored. This is the rationale of funnel plot methods.

So if publication bias is operating within a field we would probably expect the studies with the *greatest* power to detect a given effect to be the ones likely to show no effect. This would not be reflected in an observed power calculation. An illustration of that can be found in this recent study:

http://www.ncbi.nlm.nih.gov/pubmed/22488255

The Viviani study has 68% power to detect the effect size estimate indicated by the meta-analysis, but only 5% observed power when the effect size estimate from the Viviani study itself is used (because the effect size observed in that study was almost zero, and the p-value 0.99).

The fact that the only adequately powered study in this literature reported no effect (i.e., failed to replicate) is very important. Studies which report no effect will have the lowest observed power by definition, but it's often exactly these studies which tell us that something untoward is going on.

Marcus.
-- 

Marcus Munafò
Professor of Biological Psychology
School of Experimental Psychology
University of Bristol
12a Priory Road
BRISTOL BS8 1TU
United Kingdom

+44.117.9546841	t.
+44.117.9288588	f.

marcus...@bristol.ac.uk

http://www.bris.ac.uk/expsych/people/academic/marcusmunafo.html
http://www.bris.ac.uk/expsych/research/brain/targ/

Gregory Francis

unread,
May 25, 2012, 10:57:34 AM5/25/12
to Open Science Framework
I generally agree with Marcus' descriptions of the relative
plausibility of finding evidence for publication bias based on
observed power versus power based on a pooling of effect sizes. If the
experiments are precise replications, then pooling the effect sizes is
definitely the way to go. However, I think there is still value to
using observed power, as long as one is careful.

For small experiment sets, the observed power analysis gives a huge
benefit of the doubt to the experiments, by supposing that the
reported effect size is valid (this is also true for the pooled effect
size analysis). If there is a bias, the reported effect size probably
grossly overestimates the true effect size. The result its that the
observed power is also an overestimate of true power. What this means
is that the analysis will miss many cases where bias does exist. The
test is very conservative.

There is a different concern when using observed power (I discussed
this a bit in my rebuttal to the Piff reply; see my earlier comment
for the link). If there is no bias, and true power is bigger than 0.5,
then observed power tends to underestimate true power. This means that
a straight application of the approach will report bias where it does
not exist, and this problem gets excessively large as the number of
experiments under consideration increases. It is only a serious
problem for certain methods of choosing the sample size, but even a
worst case scenario (for the power analysis making a false positive
declaration of bias) needs to be considered. This is why I've always
run simulations to verify that the analysis was not likely to produce
a false positive for the experiments under consideration. For larger
experiment sets, one would need to do some kind of correction to try
to compensate for the worst case bias (I've not yet worked out exactly
how to do this).

-Greg
> >> of Ioannidis&  Trikalinos to articles in psychology: one looking at Bem's
> >> 2011 JPSP precognition article
>
> >> Francis, G. (2012). Too good to be true: Publication bias in two prominent
> >> studies from experimental psychology. *Psychonomic Bulletin&  Review*, *19*,
> marcus.mun...@bristol.ac.uk
>
> http://www.bris.ac.uk/expsych/people/academic/marcusmunafo.htmlhttp://www.bris.ac.uk/expsych/research/brain/targ/

Gregory Francis

unread,
May 25, 2012, 11:23:24 AM5/25/12
to Open Science Framework
Jesse,

Certainly replication attempts will add some valuable data. I have to
confess that I am not clear exactly what outcome this project hopes to
reveal. I've gleaned a few possibilities, but maybe I misunderstand
some things.

1) Show that many reported phenomena in psychology do not replicate.
I'm pretty sure the project will be successful at this task. Indeed,
if it was not, I would charge bias in the replication attempts. Given
the power values of the original findings, a lot of experiments should
not replicate, regardless of whether the effect is real or not.

2) Show that across the field too many experiments do not replicate
(this seems to reflect the stated project goals). Establishing the
predicted number of replications would seem to require something like
the power analysis I have done, and this might be worthwhile. On the
other hand, I think the result will not be telling us anything new.
Sterling (1955) and Sterling et al. (1994) make a pretty good case for
this already. The problem is that those analyses do not indicate which
experimental findings are biased and which are not. There's plausible
denial for everyone, even though clearly a lot of findings must be
biased.

3) Show that a particular result does not replicate. This is not one
of the stated goals, and the project seems unsuited to do this
effectively. Bem's quote in the recent Nature article by Yong was
correct, a single failure to replicate is unlikely to settle the issue
about whether a finding is real or not. Indeed, a failure to replicate
can sometimes make a finding more believable (by avoiding the
appearance of bias). If the project motivates people to think about
statistics this way, then that would be a very good thing. However, I
suspect the initial reaction will be finger-pointing, accusations, and
denial.

I really do see some benefits to the reproducibility project, but I
fear that the findings will be misunderstood. On the other hand, given
the interest, there is a lot of opportunity for education.

Good luck,

-Greg
Reply all
Reply to author
Forward
0 new messages