More on meta-psychology

121 views
Skip to first unread message

Gwern Branwen

unread,
Jan 10, 2013, 12:37:01 PM1/10/13
to N-back
["Science or Art? How Aesthetic Standards Grease the Way Through the
Publication Bottleneck but Undermine
Science"](http://pps.sagepub.com/content/7/6/562.full), Giner-Sorolla
2012:

> The current crisis in psychological research involves issues of fraud, replication, publication bias, and false positive results. I argue that this crisis follows the failure of widely adopted solutions to psychology’s similar crisis of the 1970s. The untouched root cause is an information-economic one: Too many studies divided by too few publication outlets equals a bottleneck. Articles cannot pass through just by showing theoretical meaning and methodological rigor; their results must appear to support the hypothesis perfectly. Consequently, psychologists must master the art of presenting perfect-looking results just to survive in the profession. This favors aesthetic criteria of presentation in a way that harms science’s search for truth. Shallow standards of statistical perfection distort analyses and undermine the accuracy of cumulative data; narrative expectations encourage dishonesty about the relationship between results and hypotheses; criteria of novelty suppress replication attempts. Concerns about truth in research are emerging in other sciences and may eventually descend on our heads in the form of difficult and insensitive regulations. I suggest a more palatable solution: to open the bottleneck, putting structures in place to reward broader forms of information sharing beyond the exquisite art of present-day journal publication.
>
> ...But crisis is nothing new in psychology. “Crises” of existing practices and ideas in psychology have been declared regularly at least since the time of Wilhelm Wundt (for an overview, see Sturm & Mülberger, 2012, and articles in the associated special issue). Especially relevant to today’s worries is the crisis that peaked about 40 years ago. The 1970s crisis had many facets. For example, in social psychology, mainstays of the field, such as the attitude concept and reliance on lab experiments, fell under question (Rosenthal & Rosnow, 1969; Wicker, 1969). However, other issues concerned all areas of psychology: limitations of null-hypothesis significance testing, bias toward positive results in publication, and the resulting lack of credibility of the standard research article (Elms, 1975; Greenwald, 1975).
>
> Revisiting the 1970s methods crisis gives a certain sense of déjà vu. One key article, by David T. Lykken, appeared in Psychological Bulletin in 1968. It focused on an example pulled arbitrarily from the personality literature. A single study found that eating disorder patients were significantly more likely than others to see frogs in a Rorschach test, which the author interpreted as showing unconscious fear of oral impregnation and anal birth (Sapolsky, 1964). Lykken dissected the frog hypothesis in a wickedly amusing way. But his main point, supported by a survey of colleagues, was that the significant result was not enough to increase their acceptance of the hypothesis. If our research articles give no confidence, Lykken argued, our standards of evidence must be flawed.
>
> Recent critiques of methodology resonate with Lykken’s approach. Most notably, some critiques of Bem’s (2011) precognition studies took their appearance in a top-ranked psychology journal as suggestive of flawed standards of evidence (LeBel & Peters, 2011; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). Simmons, Nelson, and Simonsohn (2011) ran intentionally preposterous experiments to support their argument that even false hypotheses often appear true when we selectively use data analysis to ensure positive results. In one experiment involving the Beatles rather than frogs, participants reported significantly lower calendar ages after listening to “When I’m Sixty-Four.”
>
> In another resonance with today, the 1970s debate also questioned the weakness of current practices in the face of outright fraud. In the middle of that decade, Cyril Burt’s findings on the heritability of IQ came under question (Gillie, 1977). Whatever the merits of accusations of fraud against Burt, which have proved controversial across the decades (Mackintosh, 1995; Samelson, 1997), the case led to reflection on how bias against publishing replications weakens the field’s ability to detect fraud (Samelson, 1980; Wong, 1981). A number of writers have recently expressed similar concerns in the face of less controversial examples of fraud and, more generally, implausible or unreliable results (e.g., Ritchie, Wiseman, & French, 2012; Roediger, 2012). Evidently, the measures taken to solve the issues of the 1970s have not been enough to keep them from popping up again.
>
> ...The 1970s crisis, like today’s, also forced reevaluation of the all-or-none Neyman-Pearson significance test as the gold standard of scientific truth (Hurlbert & Lombardi, 2009). After the 1970s, it slowly became acceptable to interpret “marginally significant” results at p < .10, to report exact p values, and to take into account statistical power and effect size (Cohen, 1994; Wilkinson & Task Force on Statistical Inference, 1999).2 Statistical techniques of meta-analysis were also developed and used, in line with postcrisis pleas for more aggregation of results across studies (Epstein, 1980; Miller & Pollock, 1994b). Making the final word in psychology depend on the outcomes of many labs, instead of just one, is a safeguard against outright fraud. Better yet, it protects against the much more common false-positive biases that arise when positive results are disproportionately rewarded in publishing (Sterling, Rosenbaum, & Weinkam, 1995). To carry out its watchdog role effectively, an aggregate test should include all attempts and all results, be they positive, negative, or inconclusive.
>
> But although aggregate tests are now a part of the research landscape, they do not yet dominate it. Running a meta-analysis is long and painstaking. Although meta-analyses are often rewarded with a slot in a high-impact journal, their absence is rarely seen as a flaw in a midcareer curriculum vitae. Indeed, it might be smarter and faster to focus on making a name by publishing one’s own research. Likewise, meta-analytic validation is not seen as necessary to proclaim an effect reliable. Textbooks, press reports, and narrative reviews often rest conclusions on single influential articles, rather than insisting on a replication across independent labs and multiple contexts. In this climate, it is hard to tell exactly how much evidence there is for the main point made by some well-cited classics. Finally, because the field does not disseminate or evaluate negative results from good-faith replication efforts, meta-analysis can tackle publication bias only indirectly, relying on the good will of researchers to share unpublished data (Rothstein, Sutton, & Borenstein, 2006). Elsewhere in this issue, Bakker, van Dijk, and Wicherts (2012) show that the steps taken by contemporary meta-analyses to gather studies from the “file drawer” are still not enough to defend against the impact of publication bias.
>
> ...In 1979, for example, the journal _Replications in Social Psychology_ began publishing, its mission evident from its title. It put out three volumes before folding. _Representative Research in Social Psychology_ was founded in 1970 and run by graduate students at the University of North Carolina at Chapel Hill, with the aim of publishing studies with good methodology regardless of results (Chamberlin, 2000). It had a longer run, but its last articles seem to have been published in 2006. Today, the online _Journal of Articles in Support of the Null Hypothesis_, founded in 2002, still lives but publishes only one to seven articles a year.
>
> ...The current crisis in psychological research involves issues of fraud, replication, publication bias, and false positive results. I argue that this crisis follows the failure of widely adopted solutions to psychology’s similar crisis of the 1970s. The untouched root cause is an information-economic one: Too many studies divided by too few publication outlets equals a bottleneck. Articles cannot pass through just by showing theoretical meaning and methodological rigor; their results must appear to support the hypothesis perfectly. Consequently, psychologists must master the art of presenting perfect-looking results just to survive in the profession. This favors aesthetic criteria of presentation in a way that harms science’s search for truth. Shallow standards of statistical perfection distort analyses and undermine the accuracy of cumulative data; narrative expectations encourage dishonesty about the relationship between results and hypotheses; criteria of novelty suppress replication attempts. Concerns about truth in research are emerging in other sciences and may eventually descend on our heads in the form of difficult and insensitive regulations. I suggest a more palatable solution: to open the bottleneck, putting structures in place to reward broader forms of information sharing beyond the exquisite art of present-day journal publication.
>
> ...Reality, however, should limit the influence of aesthetics on science (Engler, 1990). Dirac only said that a theory’s beauty should encourage persistence in its testing. If empirical results consistently speak against it, it is the theory, not the results, that must be rejected or revised. A highly selective publication market, with no credible alternate outlets for results, puts this standard in jeopardy. Science values a theory that is authentically supported by pleasing, strong, and consistent results, and rightly so. But what if only the most valuable of findings are allowed to be known? What if only scientists who can reliably present such findings are allowed to make a living from science? We can only expect that scientists under the gun will indulge in selective presentation to increase the apparent consistency of their results, even if most resist the temptation of outright fraud. Then, even the most gorgeous looking results become suspect, because the checks and balances that ensure their truth have failed.
>
> ...The current crisis in psychological research involves issues of fraud, replication, publication bias, and false positive results. I argue that this crisis follows the failure of widely adopted solutions to psychology’s similar crisis of the 1970s. The untouched root cause is an information-economic one: Too many studies divided by too few publication outlets equals a bottleneck. Articles cannot pass through just by showing theoretical meaning and methodological rigor; their results must appear to support the hypothesis perfectly. Consequently, psychologists must master the art of presenting perfect-looking results just to survive in the profession. This favors aesthetic criteria of presentation in a way that harms science’s search for truth. Shallow standards of statistical perfection distort analyses and undermine the accuracy of cumulative data; narrative expectations encourage dishonesty about the relationship between results and hypotheses; criteria of novelty suppress replication attempts. Concerns about truth in research are emerging in other sciences and may eventually descend on our heads in the form of difficult and insensitive regulations. I suggest a more palatable solution: to open the bottleneck, putting structures in place to reward broader forms of information sharing beyond the exquisite art of present-day journal publication.
>
> ...One recent article has argued, tongue in cheek, that a priori scientific hypothesizing is the most reliable form of precognition because so few psychology papers state hypotheses that turn out to be disconfirmed (Bones, 2012).
>
> ...In most fields of psychological research across the decades, the number of peer-reviewed outlets for publication has not kept up with a parallel increase in the amount of research being done. This phenomenon is described at length by Judson (2004) across all fields of science and is the topic of an economic analysis by Young et al. (2008), focusing on bioscience. Although a precise accounting of the narrowing bottleneck in psychology remains to be done, a good estimate of the rise in research-active people in my subfield comes from attendance at the annual Society for Personality and Social Psychology (SPSP) meeting. From an unexpectedly high figure of 812 at the first meeting in 2000, attendance reached roughly 1,500 in 2003 and 3,500 in 2010, with no sign of reaching a plateau yet, as nearly 4,000 attended the 2012 meeting (SPSP Dialogue, 2012). Most SPSP attendees present research posters or talks that they want to publish. A useful if rough figure might therefore be the ratio of the number of articles published in social and personality psychology journals (ISI Web of Knowledge, 2012; category: “PSYCHOLOGY, SOCIAL”) to SPSP attendees. Just from 2003 to 2010, this ratio has dropped from 1.32 to 0.90 article spaces per head.
>
> Nor must the bottleneck show a narrowing trend over time; the typical journal submission has also evolved through constant selective pressure. Analyses of the aforementioned social–personality psychology journals, JPSP and PSPB, found that their rejection rates remained fairly stable, and upward of 70%, across some 20 years, from 1976 to 1996. But at the same time, the number of pages per article increased by a factor of 2 to 4, and as already noted, the number of studies per article increased (Reis & Stiller, 1992; Sherman, Buddie, Dragan, End, & Finney, 1999). This time span coincides with the development of the main response to the first crisis, the requirement of more studies to confirm initially significant results. Bornmann and Marx (2012) reviewed empirical studies of scientific peer review that lend support to an “Anna Karenina principle” named after Tolstoy’s observation that happy families are all alike. When resources supporting proposals are scarce, conjunction rather than sum rules are used in decision making. In effect, this means that the proposal with nothing wrong with it, rather than the proposal highest in overall excellence, is most likely to succeed. In the peer review process, there eventually comes a time when a journal editor implementing an 85% rejection rate has already discarded all the fatally methodologically flawed manuscripts and still has to choose among a number exceeding the available space. It is here that the “artistic” criteria of novelty and perfection of results can enter in. In a head-to-head competition between papers, the paper testing ideas that are new will be preferred over the paper that confirms—or fails to confirm—existing ideas.

--
gwern
http://www.gwern.net
Reply all
Reply to author
Forward
0 new messages