Follow-up on ESP study: "We don't publish replications"
Related to: Parapsychology: the control group for science, Dealing with the high quantity of scientific error in medicine
Some of you may remember past Less Wrong discussion of the Daryl Bem study, which claimed to show precognition, and was published with much controversy in a top psychology journal, JPSP. The editors and reviewers explained their decision by saying that the paper was clearly written and used standard experimental and statistical methods so that their disbelief in it (driven by physics, the failure to show psi in the past, etc) was not appropriate grounds for rejection.
Because of all the attention received by the paper (unlike similar claims published in parapsychology journals) it elicited a fair amount of both critical review and attempted replication. Critics pointed out that the hypotheses were selected and switched around 'on the fly' during Bem's experiments, with the effect sizes declining with sample size (a strong signal of data mining). More importantly, Richard Wiseman established a registry for advance announcement of new Bem replication attempts.
A replication registry guards against publication bias, and at least 5 attempts were registered. As far as I can tell, at the time of this post the subsequent replications have, unsurprisingly, failed to replicate Bem's results.1 However, JPSP and the other high-end psychology journals refused to publish the results, citing standing policies of not publishing straight replications.
From the journals' point of view, this (common) policy makes sense: bold new claims will tend to be cited more and raise journal status (which depends on citations per article), even though this means most of the 'discoveries' they publish will be false despite their p-values. However, this means that overall the journals are giving career incentives for scientists to massage and mine their data for bogus results, but not to challenge bogus results by others. Alas.
1 A purported "successful replication" by a pro-psi researcher in Vienna turns out to be nothing of the kind. Rather, it is a study conducted in 2006 and retitled to take advantage of the attention on Bem's article, selectively pulled from the file drawer.
ETA: The wikipedia article on Daryl Bem makes an unsourced claim that one of the registered studies has replicated Bem.
ETA2: Samuel Moulton, who formerly worked with Bem, mentions an unpublished (no further details) failed replication of Bem's results conducted before Bem submitted his article (the failed replication was not mentioned in the article).
ETA3: There is mention of a variety of attempted replications at this blog post, with 6 failed replications, and 1 successful replication from a pro-psi researcher (not available online). It is based on this ($) New Scientist article.
ETA4: This large study performs an almost straight replication of Bem (same methods, same statistical tests, etc) and finds the effect vanishes.
ETA5: Apparently, the mentioned replication was again submitted to the British Journal of Psychology:
When we submitted it to the British Journal of Psychology, it was finally sent for peer review. One referee was very positive about it but the second had reservations and the editor rejected the paper. We were pretty sure that the second referee was, in fact, none other than Daryl Bem himself, a suspicion that the good professor kindly confirmed for us. It struck us that he might possibly have a conflict of interest with respect to our submission. Furthermore, we did not agree with the criticisms and suggested that a third referee be brought in to adjudicate. The editor rejected our appeal.
Simpson's Paradox
This is my first attempt at an elementary statistics post, which I hope is suitable for Less Wrong. I am going to present a discussion of a statistical phenomenon known as Simpson's Paradox. This isn't a paradox, and it wasn't actually discovered by Simpson, but that's the name everybody uses for it, so it's the name I'm going to stick with. Along the way, we'll get some very basic practice at calculating conditional probabilities.
A worked example
The example I've chosen is an exercise from a university statistics course that I have taught on for the past few years. It is by far the most interesting exercise in the entire course, and it goes as follows:
You are a doctor in charge of a large hospital, and you have to decide which treatment should be used for a particular disease. You have the following data from last month: there were 390 patients with the disease. Treatment A was given to 160 patients of whom 100 were men and 60 were women; 20 of the men and 40 of the women recovered. Treatment B was given to 230 patients of whom 210 were men and 20 were women; 50 of the men and 15 of the women recovered. Which treatment would you recommend we use for people with the disease in future?
The simplest way to represent these sort of data is to draw a table, we can then pick the relevant numbers out of the table to calculate the required conditional probabilities.
Overall
| A | B | |
| lived | 60 | 65 |
| died | 100 | 165 |
The probability that a randomly chosen person survived if they were given treatment A is 60/160 = 0.375
The probability that a randomly chosen person survived if they were given treatment B is 65/230 = 0.283
So a randomly chosen person given treatment A was more likely to surive than a randomly chosen person given treatment B. Looks like we'd better give people treatment A.
However, since were given a breakdown of the data by gender, let's look and see if treatment A is better for both genders, or if it gets all of its advantage from one or the other.
Human performance, psychometry, and baseball statistics
I. Performance levels and age
Human ambition for achievement in modest measure gives meaning to our lives, unless one is an existentialist pessimist like Schopenhauer who taught that life with all its suffering and cruelty simply should not be. Psychologists study our achievements under a number of different descriptions--testing for IQ, motivation, creativity, others. As part of my current career transition, I have been examining my own goals closely, and have recently read a fair amount on these topics which are variable in their evidence.
A useful collection of numerical data on the subject of human performance is the collection of Major League Baseball player performance statistics--the batting averages, number home runs, runs batted in, slugging percentage--of the many thousands of participants in the hundred years since detailed statistical records have been kept and studied by the players, journalists, and fans of the sport. The advantage of examining issues like these from the angle of Major League Baseball player performance statistics is the enormous sample size of accurately measured and archived data.
The current senior authority in this field is Bill James, who now works for the Boston Red Sox; for the first twenty-five years of his activity as a baseball statistician James was not employed by any of the teams. It took him a long time to find a hearing for his views on the inside of the industry, although the fans started buying his books as soon as he began writing them.
In one of the early editions of his Baseball Abstract, James discussed the biggest fallacies that managers and executives held regarding the achievements of baseball players. He was adamant about the most obvious misunderstood fact of player performance: it is sharply peaked at age 27 and decreases rapidly, so rapidly that only the very best players were still useful at the age of 35. He was able to observe only one executive that seemed to intuit this--a man whose sole management strategy was to trade everybody over the age of 30 for the best available player under the age of 30 he could acquire.
Frugality and working from finite data
The scientific method is wonderfully simple, intuitive, and above all effective. Based on the available evidence, you formulate several hypotheses and assign prior probabilities to each one. Then, you devise an experiment which will produce new evidence to distinguish between the hypotheses. Finally, you perform the experiment, and adjust your probabilities accordingly.
So far, so good. But what do you do when you cannot perform any new experiments?
This may seem like a strange question, one that leans dangerously close to unprovable philosophical statements that don't have any real-world consequences. But it is in fact a serious problem facing the field of cosmology. We must learn that when there is no new evidence that will cause you to change your beliefs (or even when there is), the best thing to do is to rationally re-examine the evidence you already have.
Bayesian Collaborative Filtering
I present an algorithm I designed to predict which position a person would report for an issue on TakeOnIt, through Bayesian updates on the evidence of other people's positions on that issue. Additionally, I will point out some potential areas of improvement, in the hopes of inspiring others here to expand on this method.
For those not familiar with TakeOnIt, the basic idea is that there are issues, represented by yes/no questions, on which people can take the positions Agree (A), Mostly Agree (MA), Neutral (N), Mostly Disagree (MD), or Disagree (D). (There are two types of people tracked by TakeOnIt: users who register their own opinions, and Experts/Influencers whose opinions are derived from public quotations.)
The goal is to predict what issue a person S would take on a position, based on the positions registered by other people on that question. To do this, we will use Bayes' Theorem to update the probability that person S takes the position X on issue I, given that person T has taken position Y on issue I:
Really, we will be updating on several people Tj taking positions Ty on I:
Case study: abuse of frequentist statistics
Recently, a colleague was reviewing an article whose key justification rested on some statistics that seemed dodgy to him, so he came to me for advice. (I guess my boss, the resident statistician, was out of his office.) Now, I'm no expert in frequentist statistics. My formal schooling in frequentist statistics comes from my undergraduate chemical engineering curriculum -- I wouldn't rely on it for consulting. But I've been working for someone who is essentially a frequentist for a year and a half, so I've had some hands-on experience. My boss hired me on the strength of my experience with Bayesian statistics, which I taught myself in grad school, and one thing reading the Bayesian literature voraciously will equip you for is critiquing frequentist statistics. So I felt competent enough to take a look.1
Practical rationality in surveys
"Statistically significant results" mean that there's a 5% chance that results are wrong in addition to chance that the wrong thing was measures, chance that sample was biased, chance that measurement instruments were biased, chance that mistakes were made during analysis, chance that publication bias skewed results, chance that results were entirely made up and so on.
"Not statistically significant results" mean all those, except chance of randomly mistaken results even if everything was ran correct is not 5%, but something else, unknown, and dependent of strength of the effect measured (if the effect is weak, you can have study where chance of false negative is over 99%).
So results being statistically significant or not, is really not that useful.
For example, here's a survey of civic knowledge. Plus or minus 3% measurement error? Not this time, they just completely made up the results.
Take home exercise - what do you estimate Bayesian chance of published results being wrong to be?
Joint Distributions and the Slow Spread of Good Ideas
A few years ago a well-known economist named David Romer published a paper in a top economics journal* arguing that professional football teams don't "go for it" nearly often enough on fourth down. The question, of course, is how this can persist in equilibrium. If Romer is correct, wouldn't teams have a strong incentive to change their strategies? Of course it's possible that he is correct, but that no one ever knew it before the paper was published. But then would the fact that the recommendation has not been widely adopted** constitute strong evidence that he is not correct? The paper points out two possible reasons why not. First, the objective function of the decision-makers may not be to maximize the probability of winning the game. Second and more relevant for our purposes, there may be some biases at work. The key point is this quote from the article (page 362):
"Many skills are more important to running a football team than a command of mathematical and statistical tools. And it would hardly be obvious to someone without knowledge of those tools that they could have any significant value in football."
Causality does not imply correlation
It is a commonplace that correlation does not imply causality, however eyebrow-wagglingly suggestive it may be of causal hypotheses. It is less commonly noted that causality does not imply correlation either. It is quite possible for two variables to have zero correlation, and yet for one of them to be completely determined by the other.
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)