Using the same method as in Study 1, we asked 20 University of Pennsylvania undergraduates to listen to either “When I’m Sixty-Four” by The Beatles or “Kalimba.” Then, in an ostensibly unrelated task, they indicated their birth date (mm/dd/yyyy) and their father’s age. We used father’s age to control for variation in baseline age across participants. An ANCOVA revealed the predicted effect: According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted M = 20.1 years) rather than to “Kalimba” (adjusted M = 21.5 years), F(1, 17) = 4.92, p = .040
This is by far the most awesome thing I've read in a while.
I'm sorry if I state the obvious, but you do realise that the paper is about the fact that this result does not hold, and is a result of the misuse of statistics?
No, I thought listening to songs could actually change your chronological age. (Or is that comment supposed to be some kind of joke, but is too subtle for me to get it?)
Actually, I didn't get your 'awesome'. Internet-irony etc. In outside-LW world, I bet there would be plenty of people who'd actually believe the claim, so I thought some of that may have gone into this. Should have checked your other posts.
Great post, upvoted. (And the linked article is blowing my mind.) Just one nitpick:
writing by Bem advising young psychologists to take experiments that failed to show predicted effects and massage/torture them until some statistically significant effect could be produced
That's a somewhat harsher interpretation than is found in the original article.
Yes, they did not use such strong language, and the article was obviously intended to help advance the careers of young researchers in benevolent fashion, even if it was promoting a pernicious practice. I have edited that line.
The success is said to be by a researcher who has previously studied the effect of "geomagnetic pulsations" on ESP, but I could not locate it online.
Can we have a prejudicial summary of the previous studies of the 6 researchers who failed to replicate the effect too?
Follow-up to: Follow-up on ESP study: "We don't publish replications", Feed the Spinoff Heuristic!
Related to: Parapsychology: the control group for science, Dealing with the high quantity of scientific error in medicine
That's from "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant," which runs simulations of a version of Shalizi's "neutral model of inquiry," with random (null) experimental results, augmented with a handful of choices in the setup and analysis of an experiment. Even before accounting for publication bias, these few choices produced a desired result "significant at the 5% level" 60.7% of the time, and at the 1% level 21.5% at the time.
I found it because of another paper claiming time-defying effects, during a search through all of the papers on Google Scholar citing Daryl Bem's precognition paper, which I discussed in a past post about the problems of publication bias and selection over the course of a study. For Bem, Richard Wiseman established a registry for the methods, and tests of the registered studies could be set prior to seeing the data (in addition to avoiding the file drawer).
Now a number of purported replications have been completed, with several available as preprints online, including a large "straight replication" carefully following the methods in Bem's paper, with some interesting findings discussed below. The picture does not look good for psi, and is a good reminder of the sheer cumulative power of applying a biased filter to many small choices.
Background
When Bem's article was published the skeptic David Alcock argued that Bem's experiments involved midstream changes of methods, choices in the transformation of data (raw data was not available), and other signs of modifying the experiment and analysis in response to the data. Wagenmakers et al drew attention to writing by Bem advising young psychologists to take experiments that failed to show predicted effects and relentlessly explore the data in hopes of generating an attractive and significant effect. In my post, I emphasized the importance of "straight replications," with methodology, analytical tests, and intent to publish established in advance, as in Richard Wiseman's registry of studies.
An article by Gregory Francis uses a standard test for publication bias on Bem's article: comparing the number of findings reaching significance to the number predicted by the power of the study to detect the claimed effect. 9 of the 10 experiments mentioned described in Bem's article1 find positive effects using Bem's measures and tests, but those 9 were all statistically significant despite the small size of the effects. Francis calculates a 5.8% probability of so many reaching significance by chance (given the estimated effect power and effect size).
Other complaints included declining effect size with sample size (driven mostly by one larger experiment), the use of one-tailed tests (Bem justified this as following an early hypothesis, but claims of "psi-missing" due to boredom or repelling stimuli are found in the literature and could have been mustered), and the failure to directly replicate a single experiment or concentrate subjects.
Subsequent replications
At the time of my first post, I was able to find several replication attempts already online. Richard Wiseman and his coauthors had not found psi, and were refused consideration for publication at the journal which had hosted the original article. Galak and Nelson had tried and failed to replicate experiment 8. A a pro-psi researcher had pulled a different 2006 experiment from the file drawer and retitled as a purported "replication" of the 2011 paper. Samuel Moulton, who previously worked with Bem, writes that he tried to replicate Bem with 200 subjects and found no effect (not just not a significant effect, but a significantly lower effect), but that Bem would not mention this in the 2011 publication. Bem confirms this in a video of a Harvard debate.
Since then, there have been more replications. This New Scientist article claims to have found 7 replications of Bem, with six failures and one success. The success is said to be by a researcher who has previously studied the effect of "geomagnetic pulsations" on ESP, but I could not locate it online.
Snodgrass (2011) failed to replicate Bem using a version of the Galak and Nelson experiment. Wagenmaker et al posted their methods in advance, but have not yet posted their results, although news media have reported that they also got a negative result Bem. Wiseman and his coauthors posted their abstract online, and claim to have performed a close replication of one of Bem's experiments with three times the subjects, finding no effect (despite 99%+ power to detect Bem's claimed effect). Another paper, "Correcting the Past: Failures to Replicate Psi," by Galak, LeBoeuf, Nelson, and Simmons, combines 6 experiments by the researchers (who are at four separate universities) with 820 subjects and finds no effect in a very straight replication. More on it in a moment.
I also found the abstracts of the 2011 Towards a Science of Consciousness conference. On page 166 Whitmarsh and Bierman claim to have conducted a replication of a Bem experiment involving meditators, but do not give their results, although it appears they may have looked for effects of meditation on the results. On page 176, there is an abstract from Franklin and Schooler, claiming success in a new and different precognition experiment, as well as predicting the outcome of a roulette wheel (n=204, hit rate 57%, p<.05). In the New Scientist article they claim to have replicated their experiment (with much reduced effect size and just barely above the 0.05 significance level), although past efforts to use psi in casino games have not been repeatable (nor have the experimenters become mysteriously wealthy, or easily able to fund their research, apparently). The move to a new and ill-described format prevents it from being used as a straight replication (in Shalizi's neutral model of inquiry using only publication bias, it is the move to new effects lets a field sustain itself in the absence of a subject matter), it was not registered, and the actual study is not available, so I will leave it be until publication.
Correcting the Past: Failures to Replicate Psi
Throughout this paper the researchers try to specify their procedures unambigously and as closely aligned with Bem as they can, for instance in transforming the data2 so as to avoid cherry-picking in the fashion they argue:
This prevents them from choosing the more favorable (or less favorable) of several transformations, as they seem to suggest Bem did in the next quote, bumping a result to significance in the original paper. This is a recurrent problem across many fields, and a reason to seek out raw data whenever possible, or datasets collected by neutral parties (on your question of interest):
They mention others which they did not have data to test:
Other elements providing degrees of freedom were left out of the Bem paper. A published paper can only provide so much confidence that it actually describes the experiment as it happened (or didn't!):
The experiments, with several times the collective sample size of the Bem experiments (8 and 9) they replicate, look like chance:
Perhaps the reported positive replication will hold up to scrutiny (with respect to sample size, power, closeness of replication, data mining, etc), or some other straight replication will come out convincingly positive (in light of the aggregate evidence). I doubt it.
Psi and science
Beating up on parapsychology may be cheap and easy in the scientific, skeptical, and Less Wrong communities, a low-status outgroup belief. But the abuse of many degrees of freedom, and shortage of close replication, is widespread in science and particularly in psychology. The heuristics and biases literature, studies of cognitive enhancement, social psychology and other areas often used in Less Wrong are not so different. This suggests a candidate hack to fight confirmation bias in assessing the evidentiary value of experiments that confirm one's views: ask yourself how much evidentiary weight (in log odds) you would place on the same methods and results showing a novel psi effect?3
Notes
1 In addition to the nine numbered experiments, there is a footnote referring to a small early tenth study which did not find an effect in Bem 2011.
2 One of the bigger differences is that some of the experiments were online rather than in the lab, but this didn't seem to matter much. They also switched from blind human coding of misspelled words to computerized coding.
3 This heuristic has not been tested, beyond the general (psychology!) results suggesting that arguing for a position opposite your own can help to see otherwise selectively missed considerations.
ETA: This blog post also discusses the signs of optional stopping, multiple hypothesis testing, use of one-tailed tests where a negative result could also have been reported as due to psi, etc.
ETA2: A post at the Bare Normality blog tracks down earlier presentation of some of the experiments going into Bem (2011), back in 2003, and notes that the data seem to bee selectively ported to the 2011 paper, described quite differently, and discusses other signs of unreported experiments. The post also expresses concern about reconciling these data with Bem's explicit denial of optional stopping, selective reporting, and similar.
ETA3: Bem's paper cites an experiment by Savva as evidence for precognition (by arachnophobes), but leaves out the fact that Savva's follow-up experiments failed to replicate the effect. Links and references are provided in a post at the James Randi forums. Savva also says that Bem had "extracted" several supposedly significant precognition correlations from Savva's data, and upon checking Savva found they were generated by calculation errors. Bem also is said to have claimed Savva's first result had passed the 0.05 significance test, when it was actually just short of doing so (0.051, not a substantial difference, and perhaps defensible, but another sign of bias).