Detecting P-hacking

0 views

Skip to first unread message

Margarita Lovvorn

unread,

Aug 4, 2024, 4:18:25 PM8/4/24

to omtetuness

Inthis article, authors Joseph Simmons, Leif Nelson, and Uri Simonsohn propose a way to distinguish between truly significant findings and false positives resulting from selective reporting and specification searching, or p-hacking.

Simmons, Nelson, and Simonsohn conclude that, with the examination of a distribution of p-values, one will be able to identify whether selective reporting was used or not. What do you think about the p-curve? Would you use this tool?

The site is secure.

The ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.

The p-test, which was introduced by the British statistician Ronald Fisher in the 1920s, assesses whether the results of an experiment are more extreme that what would one have given the null hypothesis. The smaller this p-value is, argued Fisher, the greater the likelihood that the null hypothesis is false. However, even Fisher never intended for the p-test to be a single figure of merit; rather it was intended to be part of a continuous, nonnumerical process that combined experimental data with other information to reach a scientific conclusion.

Indeed, the p-test, used alone, has significant drawbacks. To begin with, the typically used level of p = 0.05 is not a particularly compelling result. In any event, it is highly questionable to reject a result if its p-value is 0.051, whereas to accept it as significant if its p-value is 0.049.

The prevalence of the classic p = 0.05 value has led to the egregious practice that Uri Simonsohn of the University of Pennsylvania has termed p-hacking: proposing numerous varied hypotheses until a researcher finds one that meets the 0.05 level. Note that this is a classic multiple testing fallacy of statistics: perform enough tests and one is bound to pass any specific level of statistical significance. Such suspicions are justified given the results of a study by Jelte Wilcherts of the University of Amsterdam, who found that researchers whose results were close to the p = 0.05 level of significance were less willing to share their original data than were others that had stronger significance levels (see also this summary from Psychology Today).

Along this line, it is clear that a sole focus on p-values can muddle scientific thinking, confusing significance with size of the effect. For example, a 2013 study of more than 19,000 married persons found that those who had met their spouses online are less likely to divorce (p Needless to say, a false positive rate of 86% is disastrously high. Yet this is entirely typical of many instances in scientific research where naive usage of p-values leads to surprisingly misleading results.

Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.

Thus the only real long-term solution is for all scientific researchers and others who perform research work to be rigorously trained in modern statistics and how best to use these tools. Special attention should be paid to showing how statistical tests can mislead when used naively. Note that this education needs to be done not only for students and others entering the research work force, but also for those who are already practitioners in the field. This will not be easy but must be done.

Such considerations bring to mind a historical anecdote from the great Greek mathematician Euclid. According to an ancient account, when Pharaoh Ptolemy I of Egypt grew frustrated at the degree of effort required to master geometry, he asked his tutor Euclid whether there was some easier path. Euclid is said to have replied, There is no royal road to geometry.

P-hacking, also known as data dredging or data snooping, is a controversial practice in statistics and data analysis that undermines the validity of research findings. It occurs when researchers consciously or unconsciously manipulate their data or statistical analyses until non-significant results become significant.

The issue with p-hacking is its disregard for the principles of hypothesis testing. This practice can lead to an inflated rate of Type I errors, where a true null hypothesis is incorrectly rejected.

When p-hacking is involved, data analysis loses its reliability. This is because p-hacking allows researchers to present a hypothesis as supported by data, even when the evidence is weak or non-existent.

P-hacking takes several forms. All of them, however, involve the misuse of statistical analysis to produce misleading, often false, statistically significant results. Understanding these types can help researchers and analysts avoid falling into their traps and maintain the integrity of their work.

The first form of p-hacking involves multiple testing, where researchers test a wide range of hypotheses on the same data set. Some of these tests will yield statistically significant results by chance alone, leading to false positives. Researchers can mitigate this form of p-hacking by applying Bonferroni correction or other adjustment methods for multiple comparisons.

A second form is optional stopping, where researchers prematurely stop data collection once they observe a significant p-value. This practice can inflate the type I error rate, leading to more false positives than expected under the null hypothesis. To avoid this, researchers should specify their sample size and stick to it.

Another form is cherry-picking, where researchers select and report only the most promising results from their analysis while disregarding the rest. This practice skews the perception of the data and the validity of the conclusions. Complete and transparent reporting of all tests conducted can help mitigate this issue.

The fourth type is hypothesizing after the results are known (HARKing). In this scenario, researchers formulate or tweak their hypotheses after examining their data, leading to a confirmation bias that inflates the chance of finding statistically significant results. To avoid HARKing, researchers should pre-register their studies, declaring their hypotheses and planned analyses before examining their data.

The final type is overfitting models. This occurs when researchers create an overly complex model that captures the noise, not just the signal, in the data. Although these models might fit their training data well, they typically perform poorly on new data, leading to ungeneralizable conclusions.

In a world increasingly relying on data-driven decisions, the implications of p-hacking are profound. False positives can mislead policymakers, businesses, and other stakeholders who rely on research findings to inform their decisions.

P-hacking has influenced the outcome of several well-known scientific research studies, calling into question the validity of their findings. This dubious practice highlights the need for more rigorous standards in data analysis.

These case studies emphasize the need to acknowledge and prevent p-hacking in scientific research. Without meticulous standards and ethical statistical practices, p-hacking risks compromising the trustworthiness and integrity of scientific discoveries.