"This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is made out of atoms" should read "... is not made out of atoms."
Eliezer,
I like your essays, but I feel that you are really beating a naive and unsophisticated frequentist straw man (straw person, politically correctly speaking). I think that the answer to the question "Should we draw different conclusions?" depends on some further assumptions about the process and about the type of conclusions we want to make. What kind of frequentist would think that his research is free of subjective assumptions? A naive one.
I admit that I am out of my depth and I would like to know more about Jaynes example.
To answer your story about data:
One person decides on a conclusion and then tries to write the most persuasive argument for that conclusion.
Another person begins to write an argument by considering evidence, analyzing it, and then comes to a conclusion based on the analysis.
Both of those people type up their arguments and put them in your mailbox. As it happens, both arguments happen to be identical.
Are you telling me the first person's argument carries the exact same weight as the second?
In other words, yes, the researcher's private thoughts do matter, because P(observation|researcher 1) != P(observation|researcher 2) even though the observations are the same.
95% confidence means that if you repeat the experiment you get the right answer 95% of the time.
That depends on your thoughts because what counts as a success comes up in the repeats.
The experiment itself does not tell you what would have counted as a success. It simply is. No confidence concept applies.
Emil, thanks, fixed.
Doug, your analogy is not valid because a biased reporting method has a different likelihood function to the possible prior states, compared to an unbiased one. In this case the single, fixed dataset that we see, has a different likelihood to the possible prior states, depending on the reporting method.
If a researcher who happens to be thinking biased thoughts carries out a fixed sequence of experimental actions, the resulting dataset we see does not have a different likelihood function to the possible prior states. All that a Bayesian needs to know is the experimental actions that were actually carried out and the data that was actually observed - not what the researcher was thinking at the time, or what other actions the researcher might have performed if things had gone differently, or what other dataset might then have been observed. We need only consider the actual experimental results.
Londenio, see Ron's comment - it's not a strawperson.
Something popped into my mind while I was reading about the example in the very beginning. What about research that goes out to prove one thing, but discovers something else?
Group of scientists want to see if there's a link between the consumption of Coca-Cola and stomach cancer. They put together a huge questionnaire full of dozens of questions and have 1000 people fill it out. Looking at the data they discover that there is no correlation between Coca-Cola drinking and stomach cancer, but there is a correlation between excessive sneezing and having large...
That's a useful heuristic to combat our tendency to see patterns that aren't there. It's not strictly necessary.
Another way to solve the same problem is to look at the first 500 questionnaires first. The scientists then notice that there is a correlation between excessive sneezing and large ears. Now the scientists look at the last 500 questionnaires -- an independent experiment. If these questionnaires also show correlation, that is also evidence for the hypothesis, although it's necessarily weaker than if another 1000-person poll were conducted.
So this shows that a second experiment isn't necessary if we think ahead. Now the question is, if we've already foolishly looked at all 1000 results, is there any way to recover?
It turns out that what can save us is math. There's a bunch of standard tests for significance when lots of variables are compared. But the basic idea is the following: we can test if the correlation between sneezing and ears is high, by computing our prior for what sort of correlation the two most closely correlated variables would show.
Note that although our prior for two arbitrary variables might be centered at 0 correlation, our prior for two variables that ar...
Doug S., I agree on principle, but disagree on your particular example because it is not statistical in nature. Should we not be hugging the query "Is the argument sound?" If a random monkey typed up a third identical argument and put it in the envelope, it's just as true. The difference between this and the a medical trial is that we have an independent means to verify the truth. Argument screens off Methodology...
If evidence is collected in violation of the fourth amendment rights of the accused, it's inadmissable in court, yes, but that doesn'...
Woops, looks like I may have shot myself in the foot. The same way argument screens off authority, the actual experiment that was run screens off the intentions of the researcher.
Efficacy of the drug -> Results of the experiment <- Bias of the researcher
Efficacy, Bias -> Results of the experiment -> Our analysis of the efficacy of the drug
Leo: "...the actual experiment that was run screens off the intentions of the researcher."
As long as the validity and reliability of the experiment itself aren't affected by the bias, then the findings are your territory. Analysis is the creation of the map, where all sorts of things can go awry.
Data may be confusing, or even misleading, but this is a fact about us, not the data. Data acquired from valid experiment does not lie, whatever your motives. It might just be telling you something you're not listening to.
"The most exciting phrase to...
I am sorry that I am too lazy to read this thoroughly, but to me the original problem seems a mere illusion and a strawman. A priori, the two experiments are different, but who cares? The experiment with its stopping condition yields a distribution of results only if you have some assumed a priori distribution over the patient population. If you change the stopping condition without changing this distribution, you change the experiment and you get a different distribution for the result. This has nothing to do with evidential impact. Frequentists don't, as far as I can tell, claim anything like that.
I am sorry that I am too lazy to read this thoroughly, but to me the original problem seems a mere illusion and a strawman. A priori, the two experiments are different, but who cares? The experiment with its stopping condition yields a distribution of results only if you have some assumed a priori distribution over the patient population. If you change the stopping condition without changing this distribution, you change the experiment and you get a different distribution for the result. This has nothing to do with evidential impact. Frequentists don't, as far as I can tell, claim anything like that.
Are P(r>70|effective) and P(r>70|~effective) really the same in those two experiments? Trivially, at least, in the second one P(r<60)=0, unlike in the first, so the distribution of r over successive runs must be different. The sequences of experimental outcomes happened to be the same in this case, but not in the counterfactual case where fewer than 60 of the first 100 patients were cured, and it seems that in fact that would affect the likelihood ratio. (I may run a simulation when I have the time.)
Oh, wait: assuming the second researcher stops as soon as (r >= 60) AND (N >= 100) (the latter expression to explain that they kept going until r=70), then the distribution above 60 will actually not be any different (all the probability mass that was in r100, well, only the second experimenter could possibly have generated that result.
Are you telling me the first person's argument carries the exact same weight as the second?
Yes. It's the arguments that matter.
Now, if we know that one person was trying to support a thesis and the other presenting the data and drawing a conclusion, we can weight them differently, if we only have access to one. The first case might leave out contrary data and alternative hypotheses in an attempt to make the thesis look better. We expect the second case to mention all relevant data and the obvious alternatives, if only briefly, so the absence of contra...
Eliezer_Yudkowsky: As you described the scenario at the beginning ... you're right. But realistically? You need to think about P(~2nd researcher tained the experiment|2nd researcher has enormous stake in the result going a certain way). :-P
It is not completely unreasonable to believe that the big problem in medical research is not a lack of data or a lack of efficient statistical procedures to translate the data into conclusions, but rather the domination of the process by clever arguers. The old-fashioned procedure seems to penalize clever arguers. Although it is of course regrettable that the penalization is an accidental side effect of a misunderstanding of math, the old-fashioned procedure might in fact work better than a procedure whose sole objective is to squeeze as much knowledge from the data as possible.
A good post on a profoundly beautiful subject, and a nice bit of jujutsu the way it works against the backdrop of the meta-commentary.
A minor quibble: Have you considered that use of "Law", like "The Way", while perhaps appropriately elevating, might work against your message by obscuring appreciation of increasingly general "principles"?
For some good mockery of orthodox statistical concepts I recommend the writings of Guernsey McPearson.
Elizer says:
"We aren't enchanted by Bayesian methods merely because they're beautiful. The beauty is a side effect. Bayesian theorems are elegant, coherent, optimal, and provably unique because they are laws."
This seems deeply mistaken. Why should we believe that bayesian formulations are any more inherently "lawlike" than frequentist formulations? Both derive their theorems from within strict formal systems which begin with unchanging first principles. The fundamental difference between Bayesians and Frequentists seems to stem fro...
Elizer says:
"We aren't enchanted by Bayesian methods merely because they're beautiful. The beauty is a side effect. Bayesian theorems are elegant, coherent, optimal, and provably unique because they are laws."
This seems deeply mistaken. Why should we believe that bayesian formulations are any more inherently "lawlike" than frequentist formulations? Both derive their theorems from within strict formal systems which begin with unchanging first principles. The fundamental difference between Bayesians and Frequentists seems to stem fro...
Eliezer,
I'm afraid that I too was seduced by Doug's analogy, and for some reason am a little too slow to follow your response. Any chance you could try again to explain why the analogy doesn't work?
I am by no means an expert in statistics, but I do appreciate Eliezer Yudkowsky's essay, and think I get his point that, given only experiment A and experiment B, as reported, there may be no reason to treat them differently IF WE DON'T KNOW of the difference in protocol (if those thoughts are truly private). But It does seem rather obvious that, if there were a number of independent experiments with protocol A and B, and we were attempting to do a meta-analysis to combine the results of all such experiments, there would be quite a number of experiments wh...
I second conchis's request. Shouldn't the second method cut against assumption of a randomized sample?
I'm also thinking of an analogy to the problem of only reporting studies that demonstrate the effectiveness of a drug, even if each of those studies on its own is fair. It seems to me as if stopping when and only when one gets the results one wants is similarly problematic, once again even if everything else about the experiment is strictly ok; outcomes that show 60%+ effectiveness are favored under that method, so P(real effectiveness!=60%|experimental ...
Conchis and Benquo: Eliezer's response to Doug was that the probability of a favorable argument is greater, given a clever arguer, than the prior probability of a favorable argument. But the probability of a 60% effectiveness given 100 trials, given an experimenter who intended to keep going until he had a 60% effectiveness, is no greater than the prior probability of a 60% effectiveness given 100 trials. This should be obvious, and does distinguish the case of the biased intentions from the case of the clever arguer.
To make that claim more obvious: suppose I am involved in the argument between the Greens and the Blues, and after seeing the blue sky, I intend to keep looking up at the sky until it looks green. This won't make it any more probable that when I look at the sky tomorrow, it will look green. This probability is determined by objective facts, not by my intentions, and likewise with the probability of getting a 60% effectiveness from 100 trials.
I just saw an incredibly beautiful sunset. I also see the beauty in some of EY's stuff. Does that mean the sunset was Bayesian, or indeed subject to underlying lawfulness ? No, it only means my enhanced primate brain has a tendency to see beauty in certain things. Not that there is any more epistemic significance in a sunset than there is in a theorem.
I admit that I am still not quite sure what a "Bayesian" is as opposed to and "Old style" statistician (though I am very familiar with Bayes theorem, prior probabilities, likelihood ratios, etc).
That being said, the example at the beginning of the post is a great example of "after the fact" reasoning. If researcher number #2 had required 1,000 trials, then you could say that our interpretation of his results are the same as, say, "researcher #3" who set out to have 1,000 trials no matter how many cures were observed....
Had to actually think about it a bit, and I think it comes down to this:
The thing that determines the strength of evidence in favor of some hypothesis vs another is "what's the likelihood we would have seen E if H were true vs what's the likelihood we would have seen E if H were false"
Now. experimenter B is not at all filtering based on H being true or false, but merely the properties of E.
So the fact of the experimenter presenting the evidence E to us can only (directly) potentially give us additional information on the properties of the total e...
There are some rather baroque kinds of prior information which would require a Bayesian to try to model the researcher's thought processes. They pretty much rely on the researcher having more information about the treatment effectiveness than is available to the Bayesian, and that the stopping rule depends on that extra information. This idea could probably be expressed more elegantly as a presence or absence of an edge in a Bayesian network, twiddling the d-separation of the stop-decision node with the treatment effectiveness node.
"So now we have a group of scientists who set out to test correlation A, but found correlation B in the data instead. Should they publish a paper about correlation B?"
Since you testing multiple hypotheses simultaneously, it is not comparable to Eliezer's example. Still, it is an interesting question...
Sure. The more papers you publish the better. If you are lucky the correlation may hold in other test populations and you've staked your claim on the discovery. Success is largely based on who gets credit.
Should a magazine publish papers reporting c...
Unknown, I still find it difficult to accept that there should be literally zero modification. It's important not just that n=100, but that n=100 random trials. Suppose both researchers reported 100% effectiveness with the same n, but researcher 2 threw out all the data points that suggested ineffectiveness? You still have an n=100 and a 100% effectiveness among that set, but any probability judgment that doesn't account for the selective method of picking the population is inadequate. I would suggest that either less or a different kind of information...
Suppose both researchers reported 100% effectiveness with the same n, but researcher 2 threw out all the data points that suggested ineffectiveness?
Yes, but that is not the case that was described to us.
The mindset of the researchers doesn't matter - only the procedure they follow does. And as unlikely as it may be, in the examples we're provided, the second researcher does not violate the proper procedure.
I have to say, the reason the example is convincing is because of its artificiality. I don't know many old-school frequentists (though I suppose I'm a frequentist myself, at least so far as I'm still really nervous about the whole priors business -- but not quite so hard as all that), but I doubt that, presented with a stark case like the one above, they'd say the results would come out differently. For one thing, how would the math change?
But the case would never come up -- that's the thing. It's empty counterfactual analysis. Nobody who is following ...
Uh, strike the "how would the math change?" question -- I just read the relevant portion of Jaynes's paper, which gives a plausible answer to that. Still, I deny that an actual practicing frequentist would follow his logic and treat n as the random variable.
(ALSO: another dose of unreality in the scenario: what experimenter who decided to play it like that would ever reveal the quirky methodology?)
Maybe in private notes?
And as far as thinking that N was the random variable in the second case, I had, I'd thought it through, and basically concluded that since no data at all was being hidden from us by experimenter B, and since A and B followed the same procedure, the probability that specific outcome would be published by B was the same as that of A
now, there is a partial caveat. One might say "but... What if B's rule was to only publish the moment he had at least 70%?"
So one might think there's more possible ways it could have come out like...
Caledonian,
I agree that the two cases are not precisely the same. I also agree that they are not, as a matter of degree, very close. But it seems to me that stopping at a desired result is implicitly the same as "throwing out" other possible results, if the desired result is one of the several results possible in the range of all feasible "n"s. In other words, what I meant by my "more concrete" example is that researcher 2's experiment is properly a member of the set of all possible type-2 experiments (all of which will pro...
But it seems to me that stopping at a desired result is implicitly the same as "throwing out" other possible results
You did not speak about throwing out possible results. You spoke of throwing out data that went against the desired conclusion.
These are very, very different actions, with different implications.
Cyan, that source is slightly more convincing.
Although I'm a little concerned that it, too, is attacking another strawman. At the beginning of chapter 37, it seems that the author just doesn't understand what good researchers do. In the medical example given at the start of the chapter (458-462ish), many good researchers would use a one-sided hypothesis rather than a two-sided hypothesis (I would), which would better catch the weak relationship. One can also avoid false negatives by measuring the power of one's test. McKay also claims that "this a...
It's worth noting that hypothesis testing as it's normally taught is a messy, confused hybrid of two approaches (Fischer and Neyman/Pearson), each of which is individually somewhat more elegant (but still doesn't make philosophical sense):
http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf
http://marketing.wharton.upenn.edu/ideas/pdf/Armstrong/StatisticalSignificance.pdf
Caledonian,
I guess "the same" was a bit of an unintentional exaggeration. I will try to be precise. What I meant was, same in an important way -- that is, likewise excluding only a kind of counterexample.
Paul Gowder,
I agree with you that MacKay's Chi-squared example fails to criticize frequentist best practice. That said, all of the improvements you suggest seem to me to highlight the problem -- you have lots of tools in the toolbox, but only training and subjective experience can tell you which ones are most appropriate. On the question of "which approach is more subjective?", the frequentist advantage is illusory. (On the question of "which approach has the best philosophical grounding?" I go with the Cox theorems.)
Cyan, I've been mulling this over for the last 23 hours or so -- and I think you've convinced me that the frequentist approach has worrisome elements of subjectivity too. Huh. Which doesn't mean I'm comfortable with the the whole priors business either. I'll think about this some more. Thanks.
As a full-blown Bayesian, I feel that the bayesian approach is almost perfect. It was a revelation when I first realized that instead of having this big frequentist toolbox of heuristics, one can simply assume that every involved entity is a random variable. Then everything is solved! But then pretty quickly I came to the catch, namely that to be able to do anything, the probability distributions must be parameterized. And then you start to wonder what the pdf's of the parameters should be, and off we go into infinite regress.
But the biggest catch is of co...
..."Two medical researchers use the same treatment independently [...] one had decided beforehand [...] he would stop after treating N=100 patients, [...]. The other [...] decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, [...]. But in fact, both stopped with exactly the same data: n = 100 [patients], r = 70 [cures]. Should we then draw different conclusions from their experiments?"
[...]
If Nature is one way, the likelihood of the data coming out the way we have seen will be one thing. If
"If anyone should ever succeed in deriving a real contradiction from Bayesian probability theory [...] then the whole edifice goes up in smoke. Along with set theory, 'cause I'm pretty sure ZF provides a model for probability theory."
If you think of probability theory as a form of logic, as Jaynes advocates, then the laws and theorems of probability theory are the proof theory for this logic, and measure theory is the logic's model theory, with measure-theoretic probability spaces (which can be defined entirely with ZF, as you suggest) being the models.
We have additional information about researcher 2's experiment. If researcher 2 didn't look at the data before that point, then the procedures were the same, so the data should be treated the same.
If researcher 2 did check the data along the way--a reasonable enough assumption, given researcher 2's goal--then there were other tests which all came out below 60%. There was an upswing in successes at the end, and we know it. The other experiment may well have experienced the same thing, but in experiment 2, I don't have to look; I see it. Was there an importa...
Eliezer, I accept your point about the underlying laws of probability. However, your example is extremely flawed.
Of course what the researcher operates by should affect our interpretation of the evidence; it is, in itself, another piece of evidence! Specifically in this case, publishing your research only when you reach a certain conclusion implies that any similar researches that did not reach this threshold did not get published, and are thus not available to our evidence pool. This is filtered evidence.
So without knowing how many similar researches were conducted, the conclusion from the one research that did get published can't be seen as very strong. Do I need to draw the Bayesian analysis that shows why?
Should we then draw different conclusions from their experiments?
I assume you mean if you only saw one of them (knowing the researcher's intentions ineither case)? In that case, I would say yes. For the first, the N is random, while for the second N is the smallest N were r>=60. In the second case, the question is: what is the probability that the cure rate will ever reach 60%, while the first case answers the Q: what is the cure rate probability accoding to a sample on N=100
Yes, I would say, draw very differenct conclusions since you ar answering very different questions!
If there is a difference, it is not because the experiments went differently, it is because the experiments could have gone differently, and so the likelihoods of them happening the way they did happen is different.
The Monty Hall problem was mentioned above. I pick a door, Monty opens a door to reveal a goat, I can stick or switch (but can't take the goat). Whether Monty is picking a random door or picking the door he knows doesn't have the goat, the evidence is the same - Monty opened a door and revealed a goat. But if Monty what matters is what might ...
The difference is that depending on Monty's algorithm, there is a different probability of getting the exact result we saw, namely seeing a goat. The exact event we actually saw happens with different probability depending on Monty's rule, so Monty's rule changes the meaning of that result.
The researchers don't get a given exact sequence of 100 results with different probability depending on their state of mind - their state of mind is not part of the state of the world that the result sequence tells us about, the way Monty's state of mind is part of the world that generates the exact goat.
To look at it another way, a spy watching Monty open doors and get goats would determine that Monty was deliberately avoiding the prize. Watching a researcher stop at 100 results doesn't tell you anything about whether the researcher planned to stop at 100 or after getting a certain number of successes. So, just like that result doesn't tell you anything about the researcher's state of mind, knowing about the researcher's state of mind doesn't tell you anything about the result.
I was confused by this post for some time, and I feel I have an analagous but clearer example: Suppose scientist A says "I believe in proposition A, and will test it at the 95% confidence level", and scientist B says "I believe in proposition B, and will test it at the 99% confidence level". They go away and do their tests, and each comes back from their experiment with a p-value of 0.03. Do we now believe proposition A more or less than proposition B? The traditional scientific method, with its emphasis on testability, prefers A to B; ...
Tentatively, I think we must treat the two differently, in some respect somewhere, or we are vulnerable to manipulation. Where does the flaw lie in the following?
If the second researcher had instead said, "I am going to run 1000 experiments with 100 people each, and publish only those whose cure rate exceeds 60%", there is a huge selection bias in the data we see and our update should be tiny if we can't get access to the discarded data.
If the researcher decided instead "I am going to run 1000 experiments in parallel, adding one person at a...
The argument that confused me at first was: "Wouldn't the second researcher always be able to produce a >60% result given enough time and resources, no matter what the actual efficacy of the treatment is?"
But this is not true. If the true efficacy is < 60%, then the probability of observing a ">60%" result at least once in a sequence of N experiments does not tend to 1 as N goes to infinity.
In the real world, Eliezer's example simply doesn't work.
In the real world you only hear about the results when they are published. The prior probability of the biased researcher publishing a positive result is higher than the prior probability of the unbiased researcher publishing a positive result.
The example only works if you are an omniscient spy who spies on absolutely all treatments. It's true that an omniscient spy should just collate all the data regardless of the motivations of the researcher spied upon. However unless you are an omniscient spy yo...
Elezier:
The results of the two experimenters in the example are different: to begin with, the 2nd experimenter's first result is a non-cure (otherwise he would have stopped there with a 100% success); one of the three following results is also a non-cure (otherwise he would have stopped with a 75%); etc. Also, his last result is a cure (otherwise he would have stopped one patient earlier).
The first experimenter certainly got different results -- or you may as well win the lottery: the odds that a Bernoulli trial produces a sequence x1..x100 in which no pre...
I don't feel sufficiently comfortable with statistics to tear apart the given example. I do have a different example with which to refute the point that the evidential impact of a fixed set of data should be independent of the researchers prior private thoughts.
Suppose I have two researchers, both looking at the correlations between acne and colored jelly beans. Alfred does twenty tests each with X subjects. Each test will feed subjects jelly beans of a single color for a week and then look at incidences of acne. Boris theorizes that green jelly beans...
I think you're misunderstanding probability theory a little. Probability theory is the chance that x result means that the universe operates on principle y. The fact that the second researcher had no reason to stop until he tested 100 people means that the % of patients cured did not exceed 60% until his last batch of test subjects. Which significantly alters the chance that the universe operates on principle y. The first researcher could have had a % over 60 at any time during the experiment. Which is a physical difference. The probability used is not based on the researcher's private thoughts, it is based on their experimental procedure, which is different, regardless of the fact they ended with the same results.
Old School statisticians thought in terms of tools, tricks to throw at particular problems.
This reminds me of a joke posted on a bulletin board in the stats department at UC Riverside. It was part of a list of humorous definitions of statistical terms. For "confidence interval," it said that the phrase uses a particular, euphemistic meaning of the word "interval;" that meaning could be used to construct similar phrases such as "hat interval," "card interval," or "interval or treat."
And yet... should rationality be math? It is by no means a foregone conclusion that probability should be pretty. The real world is messy - so shouldn't you need messy reasoning to handle it?
And in a way, you do, even doing Bayesian statistics. The messiness is just in the actual numerical calculations, not in the definitions of the rules.
Suppose you're trying to find a good model for some part of the real world, and you've got your set of models you're considering. When you see data, and you use Bayes' Theorem to find the posterior probabilities, the...
"Bayesianism's coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox's coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains)."
I've never understood why I should be concerned about dynamic Dutch books (which are the justificati...
Incidentally, Eliezer, I don't think you're right about the example at the beginning of the post. The two frequentist tests are asking distinct questions of the data, and there is not necessarily any inconsistency when we ask two different questions of the same data and get two different answers.
Suppose A and B are tossing coins. A and B both get the same string of results -- a whole bunch of heads (let's say 9999) followed by a single tail. But A got this by just deciding to flip a coin 10000 times, while B got it by flipping a coin until the first tai...
I believe the example in this post is fundamentally flawed. Some of the other commenters have hinted at the reasons, but I want to add my own thoughts on this.
Before we go into the difference between the frequentist and the Bayesian approach to the problem, we first have to be clear about whether the investigators acknowledge publicly that they use different stopping rules. I am going to cover both cases.
If the stopping rule is not publicly acknowledged, the frequentist data analyst can not take it into account. He will therefore have to use the same t...
You know what really helps me accept a counterintuitive conclusion? Doing the math. I spent an hour reading and rereading this post and the arguments without being fully convinced of Eliezer's position, and then I spent 15 minutes doing the math (R code attached at the end). And once the math came out in favor of Eliezer, the conclusion suddenly doesn't seem so counterintuitive :)
Here we go, I'm diving all the numbers by five to make the code work but it's pretty convincing either way.
This example needs to be refined, two experiments do not produce the same data, second one has different probability space and additional data point - stopping position, and computing probabilities you should also condition on that stopping point N, fact that this N is screened by other data is nontrivial and waiving it just on assumption of beauty could lead to mistake.
It turned out that in this case that is correct move, but could be a mistake quite easily.
Fixing my predictions now, before going to investigate this issue further (I have Mackay's book within the hand's reach and would also like to run some Monte-Carlo simulations to check the results; going to post the resolution later):
a) It seems that we ought to treat the results differently, because the second researcher in effect admits to p-hacking his results. b) But on the other hand, what if we modify the scenario slightly: suppose we get the results from both researchers 1 patient at a time. Surely we ought to update the priors by the same amo...
Interesting article. The Carnot engine, real car and the second law simile is unfortunate. The second law of thermodynamics applies to every physical system (a real car included). The Carnot engine is an idealization of an engine that obeys the laws of thermodynamics. A real engine DOES obey the second law of thermodynamics. It is just not as efficient as the idealized Carnot engine.
The second law gives an upper bound to the efficiency of physical processes. Not that that is the expected efficiency of a real system.
As long as the reputation doctor had committed to publishing the results regardless of what he found, then, yes, the data has equal evidential weight.
However, the story seems to imply he would have continued testing indefinitely until he got it right, and if he didn't, he would have faded into obscurity.
The issue here is that we must SEE the data in the possible world where he has a 58% cure rate with N=1000 (kept trying, kept trying, kept trying, eventually published), if we are to accept his 70/100 results in this world.
If, on the other hand, we would on...
RE "Should we then draw different conclusions from their experiments?"
I think, depending on the study's hypothesis and random situational factors, a study like the first can be in the garden of forking paths. A study which stops at n=100 when it reaches a predefined statistical threshold isn't guaranteed to have also reached that statistical threshold if it had kept running until n=900.
Suppose a community of researchers is split in half (this is intended to match the example in this article but increase the imagined sample size of studies to more than 1 st...
That's the fundamental difference in mindset. Old School statisticians thought in terms of tools, tricks to throw at particular problems. Bayesians - at least this Bayesian, though I don't think I'm speaking only for myself - we think in terms of laws.
I never realized that until re-reading this many years later. Very cool.
Should we expect rationality to be, on some level, simple? Should we search and hope for underlying beauty in the arts of belief and choice?
Let me introduce this issue by borrowing a complaint of the late great Bayesian Master, E. T. Jaynes (1990):
According to old-fashioned statistical procedure - which I believe is still being taught today - the two researchers have performed different experiments with different stopping conditions. The two experiments could have terminated with different data, and therefore represent different tests of the hypothesis, requiring different statistical analyses. It's quite possible that the first experiment will be "statistically significant", the second not.
Whether or not you are disturbed by this says a good deal about your attitude toward probability theory, and indeed, rationality itself.
Non-Bayesian statisticians might shrug, saying, "Well, not all statistical tools have the same strengths and weaknesses, y'know - a hammer isn't like a screwdriver - and if you apply different statistical tools you may get different results, just like using the same data to compute a linear regression or train a regularized neural network. You've got to use the right tool for the occasion. Life is messy -"
And then there's the Bayesian reply: "Excuse you? The evidential impact of a fixed experimental method, producing the same data, depends on the researcher's private thoughts? And you have the nerve to accuse us of being 'too subjective'?"
If Nature is one way, the likelihood of the data coming out the way we have seen will be one thing. If Nature is another way, the likelihood of the data coming out that way will be something else. But the likelihood of a given state of Nature producing the data we have seen, has nothing to do with the researcher's private intentions. So whatever our hypotheses about Nature, the likelihood ratio is the same, and the evidential impact is the same, and the posterior belief should be the same, between the two experiments. At least one of the two Old Style methods must discard relevant information - or simply do the wrong calculation - for the two methods to arrive at different answers.
The ancient war between the Bayesians and the accursèd frequentists stretches back through decades, and I'm not going to try to recount that elder history in this blog post.
But one of the central conflicts is that Bayesians expect probability theory to be... what's the word I'm looking for? "Neat?" "Clean?" "Self-consistent?"
As Jaynes says, the theorems of Bayesian probability are just that, theorems in a coherent proof system. No matter what derivations you use, in what order, the results of Bayesian probability theory should always be consistent - every theorem compatible with every other theorem.
If you want to know the sum of 10 + 10, you can redefine it as (2 * 5) + (7 + 3) or as (2 * (4 + 6)) or use whatever other legal tricks you like, but the result always has to come out to be the same, in this case, 20. If it comes out as 20 one way and 19 the other way, then you may conclude you did something illegal on at least one of the two occasions. (In arithmetic, the illegal operation is usually division by zero; in probability theory, it is usually an infinity that was not taken as a the limit of a finite process.)
If you get the result 19 = 20, look hard for that error you just made, because it's unlikely that you've sent arithmetic itself up in smoke. If anyone should ever succeed in deriving a real contradiction from Bayesian probability theory - like, say, two different evidential impacts from the same experimental method yielding the same results - then the whole edifice goes up in smoke. Along with set theory, 'cause I'm pretty sure ZF provides a model for probability theory.
Math! That's the word I was looking for. Bayesians expect probability theory to be math. That's why we're interested in Cox's Theorem and its many extensions, showing that any representation of uncertainty which obeys certain constraints has to map onto probability theory. Coherent math is great, but unique math is even better.
And yet... should rationality be math? It is by no means a foregone conclusion that probability should be pretty. The real world is messy - so shouldn't you need messy reasoning to handle it? Maybe the non-Bayesian statisticians, with their vast collection of ad-hoc methods and ad-hoc justifications, are strictly more competent because they have a strictly larger toolbox. It's nice when problems are clean, but they usually aren't, and you have to live with that.
After all, it's a well-known fact that you can't use Bayesian methods on many problems because the Bayesian calculation is computationally intractable. So why not let many flowers bloom? Why not have more than one tool in your toolbox?
That's the fundamental difference in mindset. Old School statisticians thought in terms of tools, tricks to throw at particular problems. Bayesians - at least this Bayesian, though I don't think I'm speaking only for myself - we think in terms of laws.
Looking for laws isn't the same as looking for especially neat and pretty tools. The second law of thermodynamics isn't an especially neat and pretty refrigerator.
The Carnot cycle is an ideal engine - in fact, the ideal engine. No engine powered by two heat reservoirs can be more efficient than a Carnot engine. As a corollary, all thermodynamically reversible engines operating between the same heat reservoirs are equally efficient.
But, of course, you can't use a Carnot engine to power a real car. A real car's engine bears the same resemblance to a Carnot engine that the car's tires bear to perfect rolling cylinders.
Clearly, then, a Carnot engine is a useless tool for building a real-world car. The second law of thermodynamics, obviously, is not applicable here. It's too hard to make an engine that obeys it, in the real world. Just ignore thermodynamics - use whatever works.
This is the sort of confusion that I think reigns over they who still cling to the Old Ways.
No, you can't always do the exact Bayesian calculation for a problem. Sometimes you must seek an approximation; often, indeed. This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms. Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation - and fails to the extent that it departs.
Bayesianism's coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox's coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains).
You may not be able to compute the optimal answer. But whatever approximation you use, both its failures and successes will be explainable in terms of Bayesian probability theory. You may not know the explanation; that does not mean no explanation exists.
So you want to use a linear regression, instead of doing Bayesian updates? But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.
You want to use a regularized linear regression, because that works better in practice? Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.
Sometimes you can't use Bayesian methods literally; often, indeed. But when you can use the exact Bayesian calculation that uses every scrap of available knowledge, you are done. You will never find a statistical method that yields a better answer. You may find a cheap approximation that works excellently nearly all the time, and it will be cheaper, but it will not be more accurate. Not unless the other method uses knowledge, perhaps in the form of disguised prior information, that you are not allowing into the Bayesian calculation; and then when you feed the prior information into the Bayesian calculation, the Bayesian calculation will again be equal or superior.
When you use an Old Style ad-hoc statistical tool with an ad-hoc (but often quite interesting) justification, you never know if someone else will come up with an even more clever tool tomorrow. But when you can directly use a calculation that mirrors the Bayesian law, you're done - like managing to put a Carnot heat engine into your car. It is, as the saying goes, "Bayes-optimal".
It seems to me that the toolboxers are looking at the sequence of cubes {1, 8, 27, 64, 125, ...} and pointing to the first differences {7, 19, 37, 61, ...} and saying "Look, life isn't always so neat - you've got to adapt to circumstances." And the Bayesians are pointing to the third differences, the underlying stable level {6, 6, 6, 6, 6, ...}. And the critics are saying, "What the heck are you talking about? It's 7, 19, 37 not 6, 6, 6. You are oversimplifying this messy problem; you are too attached to simplicity."
It's not necessarily simple on a surface level. You have to dive deeper than that to find stability.
Think laws, not tools. Needing to calculate approximations to a law doesn't change the law. Planes are still atoms, they aren't governed by special exceptions in Nature for aerodynamic calculations. The approximation exists in the map, not in the territory. You can know the second law of thermodynamics, and yet apply yourself as an engineer to build an imperfect car engine. The second law does not cease to be applicable; your knowledge of that law, and of Carnot cycles, helps you get as close to the ideal efficiency as you can.
We aren't enchanted by Bayesian methods merely because they're beautiful. The beauty is a side effect. Bayesian theorems are elegant, coherent, optimal, and provably unique because they are laws.
Addendum: Cyan directs us to chapter 37 of MacKay's excellent statistics book, free online, for a more thorough explanation of the opening problem.
Jaynes, E. T. (1990.) Probability Theory as Logic. In: P. F. Fougere (Ed.), Maximum Entropy and Bayesian Methods. Kluwer Academic Publishers.
MacKay, D. (2003.) Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge University Press.