Why Psychologists’ Food Fight Matters: Important findings” haven’t been replicated, and science may have to change its ways. By Michelle N. Meyer and Christopher Chabris. Slate, July 31, 2014. [Via Steven Pinker's Twitter account, who adds: "Lesson for sci journalists: Stop reporting single studies, no matter how sexy (these are probably false). Report lit reviews, meta-analyses."]  Some excerpts:

Psychologists are up in arms over, of all things, the editorial process that led to the recent publication of a special issue of the journal Social Psychology. This may seem like a classic case of ivory tower navel gazing, but its impact extends far beyond academia. The issue attempts to replicate 27 “important findings in social psychology.” Replication—repeating an experiment as closely as possible to see whether you get the same results—is a cornerstone of the scientific method. Replication of experiments is vital not only because it can detect the rare cases of outright fraud, but also because it guards against uncritical acceptance of findings that were actually inadvertent false positives, helps researchers refine experimental techniques, and affirms the existence of new facts that scientific theories must be able to explain.

One of the articles in the special issue reported a failure to replicate a widely publicized 2008 study by Simone Schnall, now tenured at Cambridge University, and her colleagues. In the original study, two experiments measured the effects of people’s thoughts or feelings of cleanliness on the harshness of their moral judgments. In the first experiment, 40 undergraduates were asked to unscramble sentences, with one-half assigned words related to cleanliness (like pure or pristine) and one-half assigned neutral words. In the second experiment, 43 undergraduates watched the truly revolting bathroom scene from the movie Trainspotting, after which one-half were told to wash their hands while the other one-half were not. All subjects in both experiments were then asked to rate the moral wrongness of six hypothetical scenarios, such as falsifying one’s résumé and keeping money from a lost wallet. The researchers found that priming subjects to think about cleanliness had a “substantial” effect on moral judgment: The hand washers and those who unscrambled sentences related to cleanliness judged the scenarios to be less morally wrong than did the other subjects. The implication was that people who feel relatively pure themselves are—without realizing it—less troubled by others’ impurities. The paper was covered by ABC News, the Economist, and the Huffington Post, among other outlets, and has been cited nearly 200 times in the scientific literature.

However, the replicators—David Johnson, Felix Cheung, and Brent Donnellan (two graduate students and their adviser) of Michigan State University—found no such difference, despite testing about four times more subjects than the original studies. [...]

The editor in chief of Social Psychology later agreed to devote a follow-up print issue to responses by the original authors and rejoinders by the replicators, but as Schnall told Science, the entire process made her feel “like a criminal suspect who has no right to a defense and there is no way to win.” The Science article covering the special issue was titled “Replication Effort Provokes Praise—and ‘Bullying’ Charges.” Both there and in her blog post, Schnall said that her work had been “defamed,” endangering both her reputation and her ability to win grants. She feared that by the time her formal response was published, the conversation might have moved on, and her comments would get little attention.

How wrong she was. In countless tweets, Facebook comments, and blog posts, several social psychologists seized upon Schnall’s blog post as a cri de coeur against the rising influence of “replication bullies,” “false positive police,” and “data detectives.” For “speaking truth to power,” Schnall was compared to Rosa Parks. The “replication police” were described as “shameless little bullies,” “self-righteous, self-appointed sheriffs” engaged in a process “clearly not designed to find truth,” “second stringers” who were incapable of making novel contributions of their own to the literature, and—most succinctly—“assholes.” Meanwhile, other commenters stated or strongly implied that Schnall and other original authors whose work fails to replicate had used questionable research practices to achieve sexy, publishable findings. At one point, these insinuations were met with threats of legal action. [...]

Unfortunately, published replications have been distressingly rare in psychology. A 2012 survey of the top 100 psychology journals found that barely 1 percent of papers published since 1900 were purely attempts to reproduce previous findings. Some of the most prestigious journals have maintained explicit policies against replication efforts; for example, the Journal of Personality and Social Psychology published a paper purporting to support the existence of ESP-like “precognition,” but would not publish papers that failed to replicate that (or any other) discovery. Science publishes “technical comments” on its own articles, but only if they are submitted within three months of the original publication, which leaves little time to conduct and document a replication attempt.

The “replication crisis” is not at all unique to social psychology, to psychological science, or even to the social sciences. As Stanford epidemiologist John Ioannidis famously argued almost a decade ago, “Most research findings are false for most research designs and for most fields.” Failures to replicate and other major flaws in published research have since been noted throughout science, including in cancer research, research into the genetics of complex diseases like obesity and heart disease, stem cell research, and studies of the origins of the universe. Earlier this year, the National Institutes of Health stated “The complex system for ensuring the reproducibility of biomedical research is failing and is in need of restructuring.”

Given the stakes involved and its centrality to the scientific method, it may seem perplexing that replication is the exception rather than the rule. The reasons why are varied, but most come down to the perverse incentives driving research. Scientific journals typically view “positive” findings that announce a novel relationship or support a theoretical claim as more interesting than “negative” findings that say that things are unrelated or that a theory is not supported. The more surprising the positive finding, the better, even though surprising findings are statistically less likely to be accurate. Since journal publications are valuable academic currency, researchers—especially those early in their careers—have strong incentives to conduct original work rather than to replicate the findings of others. Replication efforts that do happen but fail to find the expected effect are usually filed away rather than published. That makes the scientific record look more robust and complete than it is—a phenomenon known as the “file drawer problem.”

The emphasis on positive findings may also partly explain the fact that when original studies are subjected to replication, so many turn out to be false positives. The near-universal preference for counterintuitive, positive findings gives researchers an incentive to manipulate their methods or poke around in their data until a positive finding crops up, a common practice known as “p-hacking” because it can result in p-values, or measures of statistical significance, that make the results look stronger, and therefore more believable, than they really are. [...]

The recent special issue of Social Psychology was an unprecedented collective effort by social psychologists to [rectify this situation]—by altering researchers’ and journal editors’ incentives in order to check the robustness of some of the most talked-about findings in their own field. Any researcher who wanted to conduct a replication was invited to preregister: Before collecting any data from subjects, they would submit a proposal detailing precisely how they would repeat the original study and how they would analyze the data. Proposals would be reviewed by other researchers, including the authors of the original studies, and once approved, the study’s results would be published no matter what. Preregistration of the study and analysis procedures should deter p-hacking, guaranteed publication should counteract the file drawer effect, and a requirement of large sample sizes should make it easier to detect small but statistically meaningful effects.

The results were sobering. At least 10 of the 27 “important findings” in social psychology were not replicated at all. In the social priming area, only one of seven replications succeeded. [...]

One way to keep things in perspective is to remember that scientific truth is created by the accretion of results over time, not by the splash of a single study. A single failure-to-replicate doesn’t necessarily invalidate a previously reported effect, much less imply fraud on the part of the original researcher—or the replicator. Researchers are most likely to fail to reproduce an effect for mundane reasons, such as insufficiently large sample sizes, innocent errors in procedure or data analysis, and subtle factors about the experimental setting or the subjects tested that alter the effect in question in ways not previously realized.

Caution about single studies should go both ways, though. Too often, a single original study is treated—by the media and even by many in the scientific community—as if it definitively establishes an effect. Publications like Harvard Business Review and idea conferences like TED, both major sources of “thought leadership” for managers and policymakers all over the world, emit a steady stream of these “stats and curiosities.” Presumably, the HBR editors and TED organizers believe this information to be true and actionable. But most novel results should be initially regarded with some skepticism, because they too may have resulted from unreported or unnoticed methodological quirks or errors. Everyone involved should focus their attention on developing a shared evidence base that consists of robust empirical regularities—findings that replicate not just once but routinely—rather than of clever one-off curiosities. [...]

Scholars, especially scientists, are supposed to be skeptical about received wisdom, develop their views based solely on evidence, and remain open to updating those views in light of changing evidence. But as psychologists know better than anyone, scientists are hardly free of human motives that can influence their work, consciously or unconsciously. It’s easy for scholars to become professionally or even personally invested in a hypothesis or conclusion. These biases are addressed partly through the peer review process, and partly through the marketplace of ideas—by letting researchers go where their interest or skepticism takes them, encouraging their methods, data, and results to be made as transparent as possible, and promoting discussion of differing views. The clashes between researchers of different theoretical persuasions that result from these exchanges should of course remain civil; but the exchanges themselves are a perfectly healthy part of the scientific enterprise.

This is part of the reason why we cannot agree with a more recent proposal by Kahneman, who had previously urged social priming researchers to put their house in order. He contributed an essay to the special issue of Social Psychology in which he proposed a rule—to be enforced by reviewers of replication proposals and manuscripts—that authors “be guaranteed a significant role in replications of their work.” Kahneman proposed a specific process by which replicators should consult with original authors, and told Science that in the special issue, “the consultations did not reach the level of author involvement that I recommend.”

Collaboration between opposing sides would probably avoid some ruffled feathers, and in some cases it could be productive in resolving disputes. With respect to the current controversy, given the potential impact of an entire journal issue on the robustness of “important findings,” and the clear desirability of buy-in by a large portion of psychology researchers, it would have been better for everyone if the original authors’ comments had been published alongside the replication papers, rather than left to appear afterward. But consultation or collaboration is not something replicators owe to original researchers, and a rule to require it would not be particularly good science policy.

Replicators have no obligation to routinely involve original authors because those authors are not the owners of their methods or results. By publishing their results, original authors state that they have sufficient confidence in them that they should be included in the scientific record. That record belongs to everyone. Anyone should be free to run any experiment, regardless of who ran it first, and to publish the results, whatever they are. [...]

some critics of replication drives have been too quick to suggest that replicators lack the subtle expertise to reproduce the original experiments. One prominent social psychologist has even argued that tacit methodological skill is such a large factor in getting experiments to work that failed replications have no value at all (since one can never know if the replicators really knew what they were doing, or knew all the tricks of the trade that the original researchers did), a surprising claim that drew sarcastic responses. [See LW discussion.] [...]

Psychology has long been a punching bag for critics of “soft science,” but the field is actually leading the way in tackling a problem that is endemic throughout science. The replication issue of Social Psychology is just one example. The Association for Psychological Science is pushing for better reporting standards and more study of research practices, and at its annual meeting in May in San Francisco, several sessions on replication were filled to overflowing. International collaborations of psychologists working on replications, such as the Reproducibility Project and the Many Labs Replication Project (which was responsible for 13 of the 27 replications published in the special issue of Social Psychology) are springing up.

Even the most tradition-bound journals are starting to change. The Journal of Personality and Social Psychology—the same journal that, in 2011, refused to even consider replication studies—recently announced that although replications are “not a central part of its mission,” it’s reversing this policy. We wish that JPSP would see replications as part of its central mission and not relegate them, as it has, to an online-only ghetto, but this is a remarkably nimble change for a 50-year-old publication. Other top journals, most notable among them Perspectives in Psychological Science, are devoting space to systematic replications and other confirmatory research. The leading journal in behavior genetics, a field that has been plagued by unreplicable claims that particular genes are associated with particular behaviors, has gone even further: It now refuses to publish original findings that do not include evidence of replication.

A final salutary change is an overdue shift of emphasis among psychologists toward establishing the size of effects, as opposed to disputing whether or not they exist. The very notion of “failure” and “success” in empirical research is urgently in need of refinement. When applied thoughtfully, this dichotomy can be useful shorthand (and we’ve used it here). But there are degrees of replication between success and failure, and these degrees matter.

For example, suppose an initial study of an experimental drug for cardiovascular disease suggests that it reduces the risk of heart attack by 50 percent compared to a placebo pill. The most meaningful question for follow-up studies is not the binary one of whether the drug’s effect is 50 percent or not (did the first study replicate?), but the continuous one of precisely how much the drug reduces heart attack risk. In larger subsequent studies, this number will almost inevitably drop below 50 percent, but if it remains above 0 percent for study after study, then the best message should be that the drug is in fact effective, not that the initial results “failed to replicate.”
New Comment
7 comments, sorted by Click to highlight new comments since:

Both there and in her blog post, Schnall said that her work had been “defamed,” endangering both her reputation and her ability to win grants.

I'm kinda surprised by how shameless some people got.

[-][anonymous]00

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it."

[This comment is no longer endorsed by its author]Reply
[-]Shmi110

As mentioned occasionally on this forum, Feynman famously complained about it some 40 years ago in his Cargo Cult speech. Here is a relevant discussion on stack exchange.

For reference, the comparison between a scientist who's paper wasn't replicated and Rosa Parks as well as the "replication police" thing was from this guy: http://danielgilbert.com/ Professor of Psychology, Harvard University

by the sound of it the replication effort did all the best practice stuff like pre-registering trials and deciding in advance how the data is to be analysed (to rule out the possibility of P-hacking). this is a very very good thing to do. otherwise people can just keep looking for "better" ways to analyse the data or keep finding "flaws" in the ways they've already tried until they get a significant result.

Reading her blog post it sounds like she approved the methods that were to be used but after getting access to the data she decided that the analysis methods she'd signed off on weren't good enough and wanted to change them after the fact to make the result line up better with hers.

Which is p-hacking in a nutshell.

"Authors were asked to review the replication proposal (and this was called “pre-data peer review”), but were not allowed to review the full manuscripts with findings and conclusions."

It seems like some people are trying to use standard SJW tactics in science. Portray the person on your side as a Victim™ , portray the other side as Oppressors™/Bullies™/Privileged™ and go from there.

I wish more studies were run like these replications with pre-registration and review of methods before the first iota of data is collected. It improves the trustworthiness of the results massively and allows us to avoid a lot of problems with publication biases.

Portray the person on your side as a Victim™ , portray the other side as Oppressors™/Bullies™/Privileged™ and go from there.

Not really privileged. In another case they suggested that the people who do replications are juniors who don't know how research is done and fail to get replications because of their lack of research skill.

A more full quote:

The issue is the process, which resembles a witch hunt that is entirely in the hands of a bunch of self-righteous, self-appointed sherrifs, and that is clearly not designed to find truth. Simone Schnall is Rosa Parks — a powerless woman who has decided to risk everything to call out the bullies.

So, it's a "witch hunt", an attempt to prime the reader with the idea that it's part of a misogynistic attack by the powers that be. So we've got a partial on Victim, Oppressors and Bullies already.

By "sherrifs", note, casting them as authority figures with power rather than students, lower in the pecking order than the professor. A partial on Oppressors and Bullies.

it's "clearly not designed to find truth", implying it's just an attack for the sake of bullying.

Next, getting explicit. that "Simone Schnall is Rosa Parks". a full on attempt to leech off the image of a historical oppressed, poor, minority figure Victim™ facing the Oppressors™/Bullies™/Privileged™.

finishing by implying that rather than just being a fairly senior academic who's work hasn't been replicated and is behaving in a manner entirely consistent with pure self interest she's just a "a powerless woman who has decided to risk everything to call out the bullies."

In another case they suggested that the people who do replications are juniors who don't know how research is done and fail to get replications because of their lack of research skill.

You assume that these 2 tactics are mutually exclusive. people love to try to cash in on both.

http://lesswrong.com/lw/9b/help_help_im_being_oppressed/

But one common thread in psychology is that the mind very frequently wants to have its cake and eat it too. Last week, we agreed that people like supporting the underdog, but we also agreed that there's a benefit to being on top; that when push comes to shove a lot of people are going to side with Zug instead of Urk. What would be really useful in winning converts would be to be a persecuted underdog who was also very powerful and certain to win out. But how would you do that?

Some Republicans have found a way. Whether they're in control of the government or not, the right-wing blogosphere invariably presents them as under siege, a rapidly dwindling holdout of Real American Values in a country utterly in the grip of liberalism.

But they don't say anything like "Everyone's liberal, things are hopeless, might as well stay home." They believe in a silent majority. Liberals control all sorts of nefarious institutions that are currently exercising a stranglehold on power and hiding the truth, but most Americans, once you pull the wool off their eyes, are conservatives at heart and just as angry about this whole thing as they are. Any day now, they're going to throw off the yoke of liberal tyranny and take back their own country.

This is a great system. Think about it. Not only should you support the Republicans for support-the-underdog and level-the-playing-field reasons, you should also support them for majoritarian reasons and because their side has the best chance of winning. It's the best possible world short of coming out and saying "Insofar as it makes you want to vote for us, we are in total control of the country, but insofar as that makes you not want to vote for us, we are a tiny persecuted minority who need your help".

They're trying to have their cake and eat it too.

"Insofar as it makes you want to support her, the replicators are junior, inept, inexperienced and have no authority, but insofar as that makes you not want to support her, she is a poor persecuted underdog who need your help and is being attacked by the powerful authority figures".

Other accounts of this affair that I've read made it sound like the "down with the replicators" people were fringe nuts who virtually nobody in the field took seriously. If the telling in this post is more accurate, I can only say... holy crap. The rot goes much deeper than I expected.