"Two medical researchers use the same treatment independently, in different hospitals. Neither would stoop to falsifying the data, but one had decided beforehand that because of finite resources he would stop after treating N=100 patients, however many cures were observed by then. The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require. But in fact, both stopped with exactly the same data: n = 100 [patients], r = 70 [cures]. Should we then draw different conclusions from their experiments?" (Presumably the two control groups also had equal results.)
It both annoys and amuses me greatly that neither in the original post nor in the ensuing discussion there was a hint of a suggestion to actually perform the experiment by generating and analyzing multiple runs of data over this, rather small, parameter space. How long can it take to write a simulation in the language of your choice and let it run for a bit?
No logic, not even Bayesian logic, is a substitute for experimental verification.
Well, in this case you shouldn't need the experiment if you can just calculate the results mathematically. Its like demanding that even though we have a proof of Fermat's Last Theorem we should still pick random large integers and check it. Doing maths by the scientific method is somewhat ridiculous.
Well, in this case you shouldn't need the experiment if you can just calculate the results mathematically.
How do you know you did not forget some salient features of the real-life problem when constructing your proof? There is a good chance that modeling it would expose a previously missed angle of the problem.
How do you know you did not forget some salient features of the real-life problem when constructing your proof?
I fully agree that in the case of applying a mathematical model to the real world, it is worthwhile to test the predictions in case there is a false hidden assumption. However, what we are talking about here is applying a mathematical model to a computer simulation, any false assumptions that have come in will have done so in the step from real world to computer simulation rather than from computer simulation to mathematical analysis, the only assumption made in the analysis is that your computer works and your code is not buggy.
There is a good chance that modeling it would expose a previously missed angle of the problem.
This is false, the mathematical analysis is a complete solution of the problem as stated. Arguing that a salient feature may have been missed is like arguing the same for any mathematical proof, i.e. silly.
I never got this. Surely, the two researchers' data, while ostensibly the same, is in fact drawn from two different distributions? Let's make the example a bit more brutal. Two researchers are given, in turn, the same coin. The first one, by coincidence, gets 100 heads. The second one has staked his career on the coin being weighted and silently discards tails results, of which he gets plenty. The two report the same evidence - but surely, once we learn about the two scientists and their predilections, we would evaluate their evidence differently? I mean, the second scientist's results are most certainly not evidence for or against the coin being weighted, right? Similarly, if a scientist runs 500 different trials, and only reports those who are statistically significant and in favor of his point, we have a higher expectation of finding that his results support his point independent of whether his point is actually valid, no? How is the one-retrofitted-trial version any different?
The difference is that in your example we got different sets of data, and simply discarded some of the data from one of them to make them look the same, whereas in the original we got the same set of data by the same method, everything that happened in the real world was the same, the only difference was in counterfactual scenarios.
Yeah but our perception is the same, no? Besides, in a sense, the original researcher also discards bits of data - he discards all possible stopping points that do not confirm his hypothesis, and all those after his hypothesis has been "confirmed".
He does not discard anything that actually happened.
This is the key difference. We are evaluating the effectiveness of the drug by looking at what the drug actually did, not what it could have done.
I can give a much more precise mathematical proof if you want.
Let's imagine a scientist did 500 tests. Then he started discarding tests, from the end, until the remaining data supported some hypothesis (or he ran out of tests). Is this to be treated as evidence of the same strength as it would if he had precommitted to only doing that many tests?
I may be wrong here because I'm tired, but I think the way the maths comes out is that this would be as strong if he only removed tests from the end, whereas if he removed them from anywhere he chose depending on how they came out it would not be as strong.
Today's post, Beautiful Probability was originally published on 14 January 2008. A summary (taken from the LW wiki):
Discuss the post here (rather than in the comments to the original post).
This post is part of the Rerunning the Sequences series, where we'll be going through Eliezer Yudkowsky's old posts in order so that people who are interested can (re-)read and discuss them. The previous post was Is Reality Ugly?, and you can use the sequence_reruns tag or rss feed to follow the rest of the series.
Sequence reruns are a community-driven effort. You can participate by re-reading the sequence post, discussing it here, posting the next day's sequence reruns post, or summarizing forthcoming articles on the wiki. Go here for more details, or to have meta discussions about the Rerunning the Sequences series.