Hi Glenn - it is not just the social sciences, but they are getting the most attention because they are taking it head on (like the folks heading up the Center for Open Science). Below is a thread from the SciSIP listserv where the broader topic of retractions was being discussed. There are many contributing factors to failures at replication and reproducibility and p-hacking is just one of them. John Ioannidis has done a lot of great work unpacking these kinds of problems across science. His paper, published over a decade ago, kind of woke everyone up to the varied problems (see "Why Most Published Research Findings Are False" -- http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124). He started an entire center at Stanford to study this problem (https://metrics.stanford.edu/about-us/bio/john-ioannidis). Anyway, you can start at the bottom and read up to get a sense of how much attention this is now getting across all of science. This email was over a year ago, and this area of inquiry has only grown since then.
Best,
Steve
--------
Stephen M. Fiore, Ph.D.
Professor, Cognitive Sciences, Department of Philosophy (philosophy.cah.ucf.edu/staff.php?id=134)
Director, Cognitive Sciences Laboratory, Institute for Simulation & Training (http://csl.ist.ucf.edu/)
University of Central Florida
Hi Richard - here are a subset of the various articles/blogs etc. related to this topic of retraction, replication, and reproducibility that I pulled from emails/discussions I've had with colleagues. So this is about more than just retraction - it includes issues associated with the 'why' (e.g., p-hacking). Your student should do a reverse citation search (with Google Scholar) on these first three to see what kind of influence they had (e.g., the article "Retracted Science and the Retraction Index" has been cited 125 times since 2011, so a few of those should be relevant).
Best,
Steve Fiore
The Continued Use of Retracted, Invalid Scientific Literature (1990)
http://jama.jamanetwork.com/article.aspx?articleid=380976
How many scientific papers should be retracted? (2007)
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1866214/
Retracted Science and the Retraction Index (2011)
http://iai.asm.org/content/79/10/3855.full
Why high-profile journals have more retractions
http://www.nature.com/news/why-high-profile-journals-have-more-retractions-1.15951
Retraction policies of top scientific journals ranked by impact factor
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4511053/
A Comprehensive Survey of Retracted Articles from the Scholarly Literature
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0044118
-----------------------------------
Preregistration of clinical trials causes medicines to stop working!
http://chrisblattman.com/2016/03/01/13719/
Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132382
NIH plans to enhance reproducibility
http://www.nature.com/news/policy-nih-plans-to-enhance-reproducibility-1.14586
A call for transparent reporting to optimize the predictive value of preclinical research
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3511845/?report=classic
Instead of “playing the game” it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond
http://www.aimspress.com/article/10.3934/Neuroscience.2014.1.4/fulltext.html
-----------------------------------
Significance chasing in research practice: causes, consequences and possible solutions.
http://europepmc.org/abstract/med/25040652
The Extent and Consequences of P-Hacking in Science [NOTE: Figure 3. Evidence for p-hacking across scientific disciplines]
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106
Reanalyzing Head et al. (2015): No widespread p-hacking after all?
https://www.authorea.com/users/2013/articles/31568
Tests for evidential value and p-hacking across disciplines, using p-values obtained from the Abstract.
High Impact = High Statistical Standards? Not Necessarily So
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0056180
The N-Pact Factor: Evaluating the Quality of Empirical Journals with Respect to Sample Size and Statistical Power
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0109019
Deep impact: unintended consequences of journal rank
http://journal.frontiersin.org/article/10.3389/fnhum.2013.00291/full
High-impact journals: where newsworthiness trumps methodology
-----------------------------------
NSF Gets an Earful about Replication
http://funderstorms.wordpress.com/2014/02/2 5/nsf-gets-an-earful-about-replication
The replication crisis has engulfed economics
https://theconversation.com/the-replication-crisis-has-engulfed-economics-49202
Internal conceptual replications do not increase independent replication success
http://link.springer.com/article/10.3758/s13423-016-1030-9
Selection Bias, Vote Counting, and Money-Priming Effects: A Comment on Rohrer, Pashler, and Harris (2015) and Vohs (2015)
http://psycnet.apa.org/journals/xge/145/5/655.html
Debunked Science: Studies Take Heat In 2011
http://www.npr.org/2011/12/29/144431640/debunked-science-studies-take-heat-in-2011
Excellent lead – I was going to suggest to look at Ioannidis as he was and is quite active in this area… as are his colleagues. I had also heard some rumblings that disease advocate organizations had an interest in this topic (directed funding to encourage attempts at reproducibility) but I cannot recall seeing any formal announcements of collections, databases, or reports. Will be a good project I suspect – although time might be eaten up with legwork.
Susan M. Fitzpatrick, Ph.D.
President, James S. McDonnell Foundation
Visit JSMF forum on academic issues: www.jsmf.org/clothing-the-emperor
SMF blog www.scientificphilanthropy.com
From: Science of Science Policy Listserv [mailto:SCI...@LISTSERV.NSF.GOV]
On Behalf Of Brooke Struck
Sent: Thursday, June 16, 2016 1:53 PM
To: SCI...@LISTSERV.NSF.GOV
Subject: Re: [scisip] Retracted paper resources?
Hi all,
This article is a bit old, and received (still receives) considerable press coverage, so I feel a bit foolish mentioning it. However, in the spirit that what seems obvious is probably in the eyes of the beholder, this article by John Ioannidis is a valuable addition to the reproducibility discussion: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/
I heard him speak in Montreal a few years ago, at which point he mentioned that he was working on a formal tool to assess reproducibility. Anybody know whether that tool ever came to fruition?
Josh, such a tool might also help to put some more meat on the bones of your burgeoning R-factor indicator.
Best wishes,
Brooke
Brooke Struck, Ph.D.
Policy Analyst | Analyste de la politique
Science-Metrix
1335, Mont-Royal E
Montréal, QC H2J 1Y6 - Canada
F. 1.514.495.6523
E-mail: brooke...@science-metrix.com
Web: science-metrix.com
Please consider the environment before printing this email.
SVP, pensez à l’environnement avant d’imprimer ce message.
From: Science of Science Policy Listserv [mailto:SCI...@LISTSERV.NSF.GOV]
On Behalf Of Josh Nicholson
Sent: June 16, 2016 2:24 PM
To:
SCI...@LISTSERV.NSF.GOV
Subject: Re: [scisip] Retracted paper resources?
Hi Richard,
Sounds like an interesting project. I have some answers that may be able to help but not all.
You can find retracted/withdrawn papers by looking at Pubmed but I don’t think you can tell if they are voluntary (there are about ~2k I believe). Even looking at notices may not be sufficient as journals typically muddy these to hide their “mistakes.” I would look at the work by Ferric Fang and Arturo Casadevall on this. They have done a lot of great analyses and should serve as a good background and starting point.
As to identifying work that may be reproducible. I co-authored a piece with Yuri Lazebnik that describes the R-factor that proposes to identify work that is reproducible/robust by citation analysis (https://thewinnower.com/papers/1-the-r-factor-a-measure-of-scientific-veracity). I am working hard on actually bringing this to life! We have built a corpus of work but need to try to automate it now. I am happy to elaborate more on this, if your interested.
Last, you may be interested in some backstories behind retracted work by authors themselves. See the Chaff: https://thewinnower.com/discussions
Hope that is helpful!
Best,
Josh
On Jun 16, 2016, at 2:04 PM, Conroy, Richard (NIH/NIBIB) [E] <Richard...@NIH.GOV> wrote:
Hello everyone – I have a summer intern working with me over the summer on a project looking at research reproducibility based on recent interest in this topic such as Monya Baker’s May article in Science that Steve Fiore mentioned last month.
One aspect of the project is to look at papers that have been withdrawn. From our preliminary look at what has been done previously there does not appear to be a consistent way to identify papers that have been withdrawn. The folks at Retraction Watch had mentioned construction of a database last year though nothing more recently that we could find. Are people aware of any large scale analysis of papers that have been withdrawn/retracted or datasets that have been constructed? We have a number of questions in this area, e.g. Is it possible to distinguish “voluntary” vs. “for cause” withdrawal? and would appreciate any leads that people have.
We are also interested in the following questions and would appreciate any suggestions off-line of resources or people to talk with:
· Have there been any large-scale international comparisons of research reproducibility?
· Have there been any proposed models for identifying research that may not be reproducible?
· Have there been any studies following up on the career consequences of people (beyond the key figure) associated with research that hasn’t been reproduced?
Thanks
Richard
########################################################################
To send to the list, address your message to: SCI...@listserv.nsf.gov
To subscribe to the list: send the text “subscribe SCISIP” to list...@listserv.nsf.gov
To unsubscribe: sent the text “unsubscribe SCISIP” to list...@listserv.nsf.gov
########################################################################
To send to the list, address your message to: SCI...@listserv.nsf.gov
To subscribe to the list: send the text “subscribe SCISIP” to list...@listserv.nsf.gov
To unsubscribe: sent the text “unsubscribe SCISIP” to list...@listserv.nsf.gov
########################################################################
To send to the list, address your message to: SCI...@listserv.nsf.gov
To subscribe to the list: send the text “subscribe SCISIP” to list...@listserv.nsf.gov
To unsubscribe: sent the text “unsubscribe SCISIP” to list...@listserv.nsf.gov
To send to the list, address your message to: SCI...@listserv.nsf.gov
To subscribe to the list: send the text “subscribe SCISIP” to list...@listserv.nsf.gov
To unsubscribe: sent the text “unsubscribe SCISIP” to list...@listserv.nsf.gov
Thanks Steve, Jack (and cc’ing RScomm on my reply),
There are some interesting articles in this thread Steve---thanks. There also seems to be some conflation here between replicability and retraction—kind of apples and oranges (just because a study isn’t replicable doesn’t mean in gets retracted). Replication is a more systemic issue, right? We’re either designing and evaluating research to high standards or we aren’t, as Ioannidis describes in his seminal paper.
WRT the retraction crisis, I don’t necessarily agree with Feng’s conclusions as I wrote in this piece a few years ago: http://nationalscience.org/nsci-focus-areas/science-writing/2013/jumping-off-the-retractions-bandwagon/. But maybe this conclusion isn’t sound any more given what we now know about replicability, and the ongoing work of Retraction Watch which has helped to daylight the fuller extent of this issue (i.e., maybe I need to retract my blog post 😊).
So, given all the concerns about replicability and the proposals (from NIH and others) to work on this, is anything actually happening yet and where/how?---i.e., are new study design standards being created, are key studies being redone, how is science adapting, and so on?
Cheers,
Glenn
Glenn Hampson
Executive Director
Science Communication Institute (SCI)
Program Director
Open Scholarship Initiative (OSI)
2320 N 137th Street | Seattle, WA 98133
(206) 417-3607 | gham...@nationalscience.org | nationalscience.org
--
As a public and publicly-funded effort, the conversations on this list can be viewed by the public and are archived. To read this group's complete listserv policy (including disclaimer and reuse information), please visit http://osinitiative.org/osi-listservs.
---
You received this message because you are subscribed to the Google Groups "The Open Scholarship Initiative" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osi2016-25+...@googlegroups.com.
To post to this group, send email to osi20...@googlegroups.com.
Visit this group at https://groups.google.com/group/osi2016-25.
For more options, visit https://groups.google.com/d/optout.
University Distinguished Professor and Department Head of Statistics, Texas A&M University
Valen E. Johnson does not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.
In a trial of a new drug to cure cancer, 44 percent of 50 patients achieved remission after treatment. Without the drug, only 32 percent of previous patients did the same. The new treatment sounds promising, but is it better than the standard?
That question is difficult, so statisticians tend to answer a different question. They look at their results and compute something called a p-value. If the p-value is less than 0.05, the results are “statistically significant” – in other words, unlikely to be caused by just random chance.
The problem is, many statistically significant results aren’t replicating. A treatment that shows promise in one trial doesn’t show any benefit at all when given to the next group of patients. This problem has become so severe that one psychology journal actually banned p-values altogether.
My colleagues and I have studied this problem, and we think we know what’s causing it. The bar for claiming statistical significance is simply too low.
The Open Science Collaboration, a nonprofit organization focused on scientific research, tried to replicate 100 published psychology experiments. While 97 of the initial experiments reported statistically significant findings, only 36 of the replicated studies did.
Several graduate students and I used these data to estimate the probability that a randomly chosen psychology experiment tested a real effect. We found that only about 7 percent did. In a similar study, economist Anna Dreber and colleagues estimated that only 9 percent of experiments would replicate.
Both analyses suggest that only about one in 13 new experimental treatments in psychology – and probably many other social sciences – will turn out to be a success.
This has important implications when interpreting p-values, particularly when they’re close to 0.05.
P-values close to 0.05 are more likely to be due to random chance than most people realize.
To understand the problem, let’s return to our imaginary drug trial. Remember, 22 out of 50 patients on the new drug went into remission, compared to an average of just 16 out of 50 patients on the old treatment.
The probability of seeing 22 or more successes out of 50 is 0.05 if the new drug is no better than the old. That means the p-value for this experiment is statistically significant. But we want to know whether the new treatment is really an improvement, or if it’s no better than the old way of doing things.
To find out, we need to combine the information contained in the data with the information available before the experiment was conducted, or the “prior odds.” The prior odds reflect factors that are not directly measured in the study. For instance, they might account for the fact that in 10 other trials of similar drugs, none proved to be successful.
If the new drug isn’t any better than the old drug, then statistics tells us that the probability of seeing exactly 22 out of 50 successes in this trial is 0.0235 – relatively low.
What if the new drug actually is better? We don’t actually know the success rate of the new drug, but a good guess is that it’s close to the observed success rate, 22 out of 50. If we assume that, then the probability of observing exactly 22 out of 50 successes is 0.113 – about five times more likely. (Not nearly 20 times more likely, though, as you might guess if you knew the p-value from the experiment was 0.05.)
This ratio of the probabilities is called the Bayes factor. We can use Bayes theorem to combine the Bayes factor with the prior odds to compute the probability that the new treatment is better.
For the sake of argument, let’s suppose that only 1 in 13 experimental cancer treatments will turn out to be a success. That’s close to the value we estimated for the psychology experiments.
When we combine these prior odds with the Bayes factor, it turns out that the probability the new treatment is no better than the old is at least 0.71. But the statistically significant p-value of 0.05 suggests exactly the opposite!
This inconsistency is typical of many scientific studies. It’s particularly common for p-values around 0.05. This explains why such a high proportion of statistically significant results do not replicate.
So how should we evaluate initial claims of a scientific discovery? In September, my colleagues and I proposed a new idea: Only P-values less than 0.005 should be considered statistically significant. P-values between 0.005 and 0.05 should merely be called suggestive.
In our proposal, statistically significant results are more likely to replicate, even after accounting for the small prior odds that typically pertain to studies in the social, biological and medical sciences.
What’s more, we think that statistical significance should not serve as a bright-line threshold for publication. Statistically suggestive results – or even results that are largely inconclusive – might also be published, based on whether or not they reported important preliminary evidence regarding the possibility that a new theory might be true.
On Oct. 11, we presented this idea to a group of statisticians at the ASA Symposium on Statistical Inference in Bethesda, Maryland. Our goal in changing the definition of statistical significance is to restore the intended meaning of this term: that data have provided substantial support for a scientific discovery or treatment effect.
Not everyone agrees with our proposal, including another group of scientists led by psychologist Daniel Lakens.
They argue that the definition of Bayes factors is too subjective, and that researchers can make other assumptions that might change their conclusions. In the clinical trial, for example, Lakens might argue that researchers could report the three-month rather than six-month remission rate, if it provided stronger evidence in favor of the new drug.
Lakens and his group also feel that the estimate that only about one in 13 experiments will replicate is too low. They point out that this estimate does not include effects like p-hacking, a term for when researchers repeatedly analyze their data until they find a strong p-value.
Instead of raising the bar for statistical significance, the Lakens group thinks that researchers should set and justify their own level of statistical significance before they conduct their experiments.
I disagree with many of the Lakens group’s claims – and, from a purely practical perspective, I feel that their proposal is a nonstarter. Most scientific journals don’t provide a mechanism for researchers to record and justify their choice of p-values before they conduct experiments. More importantly, allowing researchers to set their own evidence thresholds doesn’t seem like a good way to improve the reproducibility of scientific research.
Lakens’s proposal would only work if journal editors and funding agencies agreed in advance to publish reports of experiments that haven’t been conducted based on criteria that scientists themselves have imposed. I think this is unlikely to happen anytime in the near future.
Until it does, I recommend that you not trust claims from scientific studies based on p-values near 0.05. Insist on a higher standard.
Jack Schultz
Jack C. Schultz
Sr. Executive Director of Research Development
University of Toledo
@jackcschultz
https://schultzappel.wordpress.com
To unsubscribe from this group and stop receiving emails from it, send an email to osi2016-25+unsubscribe@googlegroups.com.
To post to this group, send email to osi20...@googlegroups.com.
Visit this group at https://groups.google.com/group/osi2016-25.
For more options, visit https://groups.google.com/d/optout.