On p-hacking

13 views
Skip to first unread message

Fiore, Steve

unread,
Oct 18, 2017, 9:18:06 PM10/18/17
to osi20...@googlegroups.com

Hi Glenn - it is not just the social sciences, but they are getting the most attention because they are taking it head on (like the folks heading up the Center for Open Science).  Below is a thread from the SciSIP listserv where the broader topic of retractions was being discussed.  There are many contributing factors to failures at replication and reproducibility and p-hacking is just one of them.  John Ioannidis has done a lot of great work unpacking these kinds of problems across science. His paper, published over a decade ago, kind of woke everyone up to the varied problems (see "Why Most Published Research Findings Are False" -- http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124). He started an entire center at Stanford to study this problem (https://metrics.stanford.edu/about-us/bio/john-ioannidis).  Anyway, you can start at the bottom and read up to get a sense of how much attention this is now getting across all of science. This email was over a year ago, and this area of inquiry has only grown since then.


Best,

Steve


--------

Stephen M. Fiore, Ph.D.

Professor, Cognitive Sciences, Department of Philosophy (philosophy.cah.ucf.edu/staff.php?id=134)

Director, Cognitive Sciences Laboratory, Institute for Simulation & Training (http://csl.ist.ucf.edu/)

University of Central Florida

sfi...@ist.ucf.edu




From: Fiore, Steve
Sent: Thursday, June 16, 2016 4:55 PM
To: SCI...@LISTSERV.NSF.GOV
Subject: Re: [scisip] Retracted paper resources?
 

Hi Richard - here are a subset of the various articles/blogs etc. related to this topic of retraction, replication, and reproducibility that I pulled from emails/discussions I've had with colleagues.  So this is about more than just retraction - it includes issues associated with the 'why' (e.g., p-hacking).  Your student should do a reverse citation search (with Google Scholar) on these first three to see what kind of influence they had (e.g., the article "Retracted Science and the Retraction Index" has been cited 125 times since 2011, so a few of those should be relevant).


Best,

Steve Fiore



The Continued Use of Retracted, Invalid Scientific Literature (1990)

http://jama.jamanetwork.com/article.aspx?articleid=380976


How many scientific papers should be retracted? (2007)

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1866214/

 

Retracted Science and the Retraction Index (2011)

http://iai.asm.org/content/79/10/3855.full

 

Why high-profile journals have more retractions

http://www.nature.com/news/why-high-profile-journals-have-more-retractions-1.15951

 

Retraction policies of top scientific journals ranked by impact factor

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511053/

 

A Comprehensive Survey of Retracted Articles from the Scholarly Literature

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0044118


----------------------------------- 

Preregistration of clinical trials causes medicines to stop working!

http://chrisblattman.com/2016/03/01/13719/

 

Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132382

 

NIH plans to enhance reproducibility

http://www.nature.com/news/policy-nih-plans-to-enhance-reproducibility-1.14586

 

A call for transparent reporting to optimize the predictive value of preclinical research

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3511845/?report=classic

 

Instead of “playing the game” it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond

http://www.aimspress.com/article/10.3934/Neuroscience.2014.1.4/fulltext.html


----------------------------------- 

Significance chasing in research practice: causes, consequences and possible solutions.

http://europepmc.org/abstract/med/25040652

 

The Extent and Consequences of P-Hacking in Science [NOTE: Figure 3. Evidence for p-hacking across scientific disciplines]

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106

 

Reanalyzing Head et al. (2015): No widespread p-hacking after all?

https://www.authorea.com/users/2013/articles/31568

 

Tests for evidential value and p-hacking across disciplines, using p-values obtained from the Abstract.

https://figshare.com/articles/_Tests_for_evidential_value_and_p_hacking_across_disciplines_using_p_values_obtained_from_the_Abstract_/1335039

 

High Impact = High Statistical Standards? Not Necessarily So

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0056180


The N-Pact Factor: Evaluating the Quality of Empirical Journals with Respect to Sample Size and Statistical Power

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0109019


Deep impact: unintended consequences of journal rank

http://journal.frontiersin.org/article/10.3389/fnhum.2013.00291/full

 

High-impact journals: where newsworthiness trumps methodology

http://blogs.lse.ac.uk/impactofsocialsciences/2013/03/15/high-impact-journals-where-newsworthiness-trumps-methodology/


-----------------------------------

NSF Gets an Earful about Replication

http://funderstorms.wordpress.com/2014/02/2 5/nsf-gets-an-earful-about-replication


The replication crisis has engulfed economics

https://theconversation.com/the-replication-crisis-has-engulfed-economics-49202

 

Internal conceptual replications do not increase independent replication success

http://link.springer.com/article/10.3758/s13423-016-1030-9

 

Selection Bias, Vote Counting, and Money-Priming Effects: A Comment on Rohrer, Pashler, and Harris (2015) and Vohs (2015)

http://psycnet.apa.org/journals/xge/145/5/655.html

 

Debunked Science: Studies Take Heat In 2011

http://www.npr.org/2011/12/29/144431640/debunked-science-studies-take-heat-in-2011

 

 





From: Science of Science Policy Listserv <SCI...@LISTSERV.NSF.GOV> on behalf of Susan Fitzpatrick <su...@JSMF.ORG>
Sent: Thursday, June 16, 2016 3:41 PM
To: SCI...@LISTSERV.NSF.GOV
Subject: Re: [scisip] Retracted paper resources?
 

Excellent lead – I was going to suggest to look at Ioannidis as he was and is quite active in this area… as are his colleagues. I had also heard some rumblings that disease advocate organizations had an interest in this topic (directed funding to encourage attempts at reproducibility)  but I cannot recall seeing any formal announcements of collections, databases, or reports.   Will be a good project I suspect – although time might be eaten up with legwork.  

 

Susan M. Fitzpatrick, Ph.D.

President, James S. McDonnell Foundation

Visit JSMF forum on academic issues: www.jsmf.org/clothing-the-emperor

SMF blog  www.scientificphilanthropy.com  

smf

 

 

From: Science of Science Policy Listserv [mailto:SCI...@LISTSERV.NSF.GOV] On Behalf Of Brooke Struck
Sent: Thursday, June 16, 2016 1:53 PM
To: SCI...@LISTSERV.NSF.GOV
Subject: Re: [scisip] Retracted paper resources?

 

Hi all,

 

This article is a bit old, and received (still receives) considerable press coverage, so I feel a bit foolish mentioning it. However, in the spirit that what seems obvious is probably in the eyes of the beholder, this article by John Ioannidis is a valuable addition to the reproducibility discussion: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/


I heard him speak in Montreal a few years ago, at which point he mentioned that he was working on a formal tool to assess reproducibility. Anybody know whether that tool ever came to fruition? Josh, such a tool might also help to put some more meat on the bones of your burgeoning R-factor indicator.

 

Best wishes,

Brooke


 

Brooke Struck, Ph.D.

Policy Analyst | Analyste de la politique

Science-Metrix

1335, Mont-Royal E

Montréal, QC  H2J 1Y6 - Canada

 

T. 1.514.495.6505 x.117

T. 1.800. 994.4761 x.117

F. 1.514.495.6523

E-mail: brooke...@science-metrix.com

Web:    science-metrix.com

 

Please consider the environment before printing this email.
SVP, pensez à l’environnement avant d’imprimer ce message.

 

From: Science of Science Policy Listserv [mailto:SCI...@LISTSERV.NSF.GOV] On Behalf Of Josh Nicholson
Sent: June 16, 2016 2:24 PM
To: SCI...@LISTSERV.NSF.GOV
Subject: Re: [scisip] Retracted paper resources?

 

Hi Richard,

 

Sounds like an interesting project.  I have some answers that may be able to help but not all.  

 

You can find retracted/withdrawn papers by looking at Pubmed but I don’t think you can tell if they are voluntary (there are about ~2k I believe).  Even looking at notices may not be sufficient as journals typically muddy these to hide their “mistakes.” I would look at the work by Ferric Fang and Arturo Casadevall on this.  They have done a lot of great analyses and should serve as a good background and starting point.  

 

As to identifying work that may be reproducible.  I co-authored a piece with Yuri Lazebnik that describes the R-factor that proposes to identify work that is reproducible/robust by citation analysis (https://thewinnower.com/papers/1-the-r-factor-a-measure-of-scientific-veracity).  I am working hard on actually bringing this to life!  We have built a corpus of work but need to try to automate it now.  I am happy to elaborate more on this, if your interested. 

 

Last, you may be interested in some backstories behind retracted work by authors themselves. See the Chaff: https://thewinnower.com/discussions

 

Hope that is helpful!

 

Best,

Josh

 

 

On Jun 16, 2016, at 2:04 PM, Conroy, Richard (NIH/NIBIB) [E] <Richard...@NIH.GOV> wrote:

 

Hello everyone – I have a summer intern working with me over the summer on a project looking at research reproducibility based on recent interest in this topic such as Monya Baker’s May article in Science that Steve Fiore mentioned last month.

 

One aspect of the project is to look at papers that have been withdrawn. From our preliminary look at what has been done previously there does not appear to be a consistent way to identify papers that have been withdrawn. The folks at Retraction Watch had mentioned construction of a database last year though nothing more recently that we could find. Are people aware of any large scale analysis of papers that have been withdrawn/retracted or datasets that have been constructed? We have a number of questions in this area, e.g. Is it possible to distinguish “voluntary” vs. “for cause” withdrawal? and would appreciate any leads that people have.

 

We are also interested in the following questions and would appreciate any suggestions off-line of resources or people to talk with:

·         Have there been any large-scale international comparisons of research reproducibility?

·         Have there been any proposed models for identifying research that may not be reproducible?

·         Have there been any studies following up on the career consequences of people (beyond the key figure) associated with research that hasn’t been reproduced?

 

Thanks

Richard

 

 

########################################################################

To send to the list, address your message to: SCI...@listserv.nsf.gov

To subscribe to the list: send the text “subscribe SCISIP” to list...@listserv.nsf.gov

To unsubscribe: sent the text “unsubscribe SCISIP” to list...@listserv.nsf.gov

 

########################################################################

To send to the list, address your message to: SCI...@listserv.nsf.gov

To subscribe to the list: send the text “subscribe SCISIP” to list...@listserv.nsf.gov

To unsubscribe: sent the text “unsubscribe SCISIP” to list...@listserv.nsf.gov

########################################################################

To send to the list, address your message to: SCI...@listserv.nsf.gov

To subscribe to the list: send the text “subscribe SCISIP” to list...@listserv.nsf.gov

To unsubscribe: sent the text “unsubscribe SCISIP” to list...@listserv.nsf.gov

########################################################################

To send to the list, address your message to: SCI...@listserv.nsf.gov

To subscribe to the list: send the text “subscribe SCISIP” to list...@listserv.nsf.gov

To unsubscribe: sent the text “unsubscribe SCISIP” to list...@listserv.nsf.gov

Glenn Hampson

unread,
Oct 18, 2017, 10:17:29 PM10/18/17
to osi20...@googlegroups.com, rsc...@googlegroups.com

Thanks Steve, Jack (and cc’ing RScomm on my reply),

 

There are some interesting articles in this thread Steve---thanks. There also seems to be some conflation here between replicability and retraction—kind of apples and oranges (just because a study isn’t replicable doesn’t mean in gets retracted). Replication is a more systemic issue, right? We’re either designing and evaluating research to high standards or we aren’t, as Ioannidis describes in his seminal paper.

 

WRT the retraction crisis, I don’t necessarily agree with Feng’s conclusions as I wrote in this piece a few years ago: http://nationalscience.org/nsci-focus-areas/science-writing/2013/jumping-off-the-retractions-bandwagon/. But maybe this conclusion isn’t sound any more given what we now know about replicability, and the ongoing work of Retraction Watch which has helped to daylight the fuller extent of this issue (i.e., maybe I need to retract my blog post 😊).

 

So, given all the concerns about replicability and the proposals (from NIH and others) to work on this, is anything actually happening yet and where/how?---i.e., are new study design standards being created, are key studies being redone, how is science adapting, and so on?

 

Cheers,

 

Glenn

 

 

Glenn Hampson
Executive Director
Science Communication Institute (SCI)
Program Director
Open Scholarship Initiative (OSI)

osi-logo-2016-25-mail

2320 N 137th Street | Seattle, WA 98133
(206) 417-3607 | gham...@nationalscience.org | nationalscience.org

--
As a public and publicly-funded effort, the conversations on this list can be viewed by the public and are archived. To read this group's complete listserv policy (including disclaimer and reuse information), please visit http://osinitiative.org/osi-listservs.
---
You received this message because you are subscribed to the Google Groups "The Open Scholarship Initiative" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osi2016-25+...@googlegroups.com.
To post to this group, send email to osi20...@googlegroups.com.
Visit this group at https://groups.google.com/group/osi2016-25.
For more options, visit https://groups.google.com/d/optout.

image001.png
image002.jpg

Schultz, Jack C.

unread,
Oct 19, 2017, 8:12:00 AM10/19/17
to Glenn Hampson, osi20...@googlegroups.com, rsc...@googlegroups.com
Here’s another take on this, appeared today:
A statistical fix for the replication crisis in sc

Many scientific studies aren’t holding up in further tests. 

Author

  1. Valen E. Johnson

    University Distinguished Professor and Department Head of Statistics, Texas A&M University 

Disclosure statement

Valen E. Johnson does not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.


In a trial of a new drug to cure cancer, 44 percent of 50 patients achieved remission after treatment. Without the drug, only 32 percent of previous patients did the same. The new treatment sounds promising, but is it better than the standard?

That question is difficult, so statisticians tend to answer a different question. They look at their results and compute something called a p-value. If the p-value is less than 0.05, the results are “statistically significant” – in other words, unlikely to be caused by just random chance.

The problem is, many statistically significant results aren’t replicating. A treatment that shows promise in one trial doesn’t show any benefit at all when given to the next group of patients. This problem has become so severe that one psychology journal actually banned p-values altogether. 

My colleagues and I have studied this problem, and we think we know what’s causing it. The bar for claiming statistical significance is simply too low. 

Most hypotheses are false

The Open Science Collaboration, a nonprofit organization focused on scientific research, tried to replicate 100 published psychology experiments. While 97 of the initial experiments reported statistically significant findings, only 36 of the replicated studies did. 

Several graduate students and I used these data to estimate the probability that a randomly chosen psychology experiment tested a real effect. We found that only about 7 percent did. In a similar study, economist Anna Dreber and colleagues estimated that only 9 percent of experiments would replicate. 

Both analyses suggest that only about one in 13 new experimental treatments in psychology – and probably many other social sciences – will turn out to be a success. 

This has important implications when interpreting p-values, particularly when they’re close to 0.05. 

The Bayes factor

P-values close to 0.05 are more likely to be due to random chance than most people realize.

To understand the problem, let’s return to our imaginary drug trial. Remember, 22 out of 50 patients on the new drug went into remission, compared to an average of just 16 out of 50 patients on the old treatment. 

The probability of seeing 22 or more successes out of 50 is 0.05 if the new drug is no better than the old. That means the p-value for this experiment is statistically significant. But we want to know whether the new treatment is really an improvement, or if it’s no better than the old way of doing things.

To find out, we need to combine the information contained in the data with the information available before the experiment was conducted, or the “prior odds.” The prior odds reflect factors that are not directly measured in the study. For instance, they might account for the fact that in 10 other trials of similar drugs, none proved to be successful.

If the new drug isn’t any better than the old drug, then statistics tells us that the probability of seeing exactly 22 out of 50 successes in this trial is 0.0235 – relatively low. 

What if the new drug actually is better? We don’t actually know the success rate of the new drug, but a good guess is that it’s close to the observed success rate, 22 out of 50. If we assume that, then the probability of observing exactly 22 out of 50 successes is 0.113 – about five times more likely. (Not nearly 20 times more likely, though, as you might guess if you knew the p-value from the experiment was 0.05.)

This ratio of the probabilities is called the Bayes factor. We can use Bayes theorem to combine the Bayes factor with the prior odds to compute the probability that the new treatment is better. 

What’s the probability of observing success in 50 trials? The black curve represents probabilities under the ‘null hypothesis,’ when the new treatment is no better than the old. The red curve represents probabilities when the new treatment is better. The shaded area represents the p-value. In this case, the ratio of the probabilities assigned to 22 successes is A divided by B, or 0.21. Valen JohnsonCC BY-SA

For the sake of argument, let’s suppose that only 1 in 13 experimental cancer treatments will turn out to be a success. That’s close to the value we estimated for the psychology experiments.

When we combine these prior odds with the Bayes factor, it turns out that the probability the new treatment is no better than the old is at least 0.71. But the statistically significant p-value of 0.05 suggests exactly the opposite!

A new approach

This inconsistency is typical of many scientific studies. It’s particularly common for p-values around 0.05. This explains why such a high proportion of statistically significant results do not replicate. 

So how should we evaluate initial claims of a scientific discovery? In September, my colleagues and I proposed a new idea: Only P-values less than 0.005 should be considered statistically significant. P-values between 0.005 and 0.05 should merely be called suggestive.

In our proposal, statistically significant results are more likely to replicate, even after accounting for the small prior odds that typically pertain to studies in the social, biological and medical sciences. 

What’s more, we think that statistical significance should not serve as a bright-line threshold for publication. Statistically suggestive results – or even results that are largely inconclusive – might also be published, based on whether or not they reported important preliminary evidence regarding the possibility that a new theory might be true. 

On Oct. 11, we presented this idea to a group of statisticians at the ASA Symposium on Statistical Inference in Bethesda, Maryland. Our goal in changing the definition of statistical significance is to restore the intended meaning of this term: that data have provided substantial support for a scientific discovery or treatment effect.

Criticisms of our idea

Not everyone agrees with our proposal, including another group of scientists led by psychologist Daniel Lakens. 

They argue that the definition of Bayes factors is too subjective, and that researchers can make other assumptions that might change their conclusions. In the clinical trial, for example, Lakens might argue that researchers could report the three-month rather than six-month remission rate, if it provided stronger evidence in favor of the new drug. 

Lakens and his group also feel that the estimate that only about one in 13 experiments will replicate is too low. They point out that this estimate does not include effects like p-hacking, a term for when researchers repeatedly analyze their data until they find a strong p-value. 

Instead of raising the bar for statistical significance, the Lakens group thinks that researchers should set and justify their own level of statistical significance before they conduct their experiments.

I disagree with many of the Lakens group’s claims – and, from a purely practical perspective, I feel that their proposal is a nonstarter. Most scientific journals don’t provide a mechanism for researchers to record and justify their choice of p-values before they conduct experiments. More importantly, allowing researchers to set their own evidence thresholds doesn’t seem like a good way to improve the reproducibility of scientific research. 

Lakens’s proposal would only work if journal editors and funding agencies agreed in advance to publish reports of experiments that haven’t been conducted based on criteria that scientists themselves have imposed. I think this is unlikely to happen anytime in the near future.

Until it does, I recommend that you not trust claims from scientific studies based on p-values near 0.05. Insist on a higher standard.








David Wojick

unread,
Oct 19, 2017, 8:41:30 AM10/19/17
to osi20...@googlegroups.com
I am curious why people think that psychological results should be replicable? This seems to assume that humans are simple systems with no internal variation, which is wildly false. If we think of a psych experiment as a statistical sample of a highly variable population, then sampling theory tells us that getting the same result twice is extremely unlikely. And this is just what we are seeing. In short, people are highly variable.

David

Schultz, Jack C.

unread,
Oct 19, 2017, 8:48:22 AM10/19/17
to David Wojick, osi20...@googlegroups.com
Agreed. But then the question boils down to “what should we believe?”  This problem is particularly acute in psychology for the reasons you cite. But that just makes it more difficult to answer any question with confidence.  Increasing sample sizes would help, but even that might not overcome the variance problem, and would cost a lot more.  

So, if you’re pretty sure that a one-off experiment only captures a fraction of variable human behavior, why should you believe any conclusion or generalization about human behavior?


Jack Schultz


Jack C. Schultz

Sr. Executive Director of Research Development

University of Toledo

573-489-8753

@jackcschultz

https://schultzappel.wordpress.com


David Wojick

unread,
Oct 19, 2017, 11:02:25 AM10/19/17
to osi20...@googlegroups.com
I see no need to believe hypotheses about human behavior that have not been well tested. This is how psych morons into folk wisdom.

One approach would be to do much more systematic testing of far fewer hypotheses. In particular, it is not a problem of sample size, which can be factored in, but rather that these are typically just convenience samples. Statistical sampling theory is based on probability theory so it is fundamental that the samples must be drawn randomly from the population. No valid conclusion can be drawn from a convenience sample.

The deeper issue is what population is being sampled? Forty sophomores from one school is an absurd convenience sample of the human population, but it might be a reasonable sample of that class. So let's first test truly random samples from small well defined populations. If we get a pattern of success we can then work our way out to broader cases.

Of course this requires lots of separate researchers testing the same hypothesis, which may be counter to the present culture. But testing myriad hypotheses, each on a small convenience sample, is hopeless.

David

susan

unread,
Oct 20, 2017, 11:08:57 AM10/20/17
to The Open Scholarship Initiative
agreed -- sampling should truly be from "the" population - not just from the population that is handy.

Jason Priem

unread,
Nov 3, 2017, 12:38:43 PM11/3/17
to susan, The Open Scholarship Initiative
As a professor of mine used to say, "psychological science offers us a rich insight into the minds of humans who take Psych 101 and need extra credit."  
j

To unsubscribe from this group and stop receiving emails from it, send an email to osi2016-25+unsubscribe@googlegroups.com.

To post to this group, send email to osi20...@googlegroups.com.
Visit this group at https://groups.google.com/group/osi2016-25.
For more options, visit https://groups.google.com/d/optout.



--
Jason Priem, co-founder 
Impactstory: Share the full story of your research impact
follow at @jasonpriem and @impactstory
Reply all
Reply to author
Forward
0 new messages