Should we expect rationality to be, on some level, simple? Should we search and hope for underlying beauty in the arts of belief and choice?
Let me introduce this issue by borrowing a complaint of the late great Bayesian Master, E. T. Jaynes (1990):
"Two medical researchers use the same treatment independently, in different hospitals. Neither would stoop to falsifying the data, but one had decided beforehand that because of finite resources he would stop after treating N=100 patients, however many cures were observed by then. The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require. But in fact, both stopped with exactly the same data: n = 100 [patients], r = 70 [cures]. Should we then draw different conclusions from their experiments?" (Presumably the two control groups also had equal results.)
According to old-fashioned statistical procedure - which I believe is still being taught today - the two researchers have performed different experiments with different stopping conditions. The two experiments could have terminated with different data, and therefore represent different tests of the hypothesis, requiring different statistical analyses. It's quite possible that the first experiment will be "statistically significant", the second not.
Whether or not you are disturbed by this says a good deal about your attitude toward probability theory, and indeed, rationality itself.
Non-Bayesian statisticians might shrug, saying, "Well, not all statistical tools have the same strengths and weaknesses, y'know - a hammer isn't like a screwdriver - and if you apply different statistical tools you may get different results, just like using the same data to compute a linear regression or train a regularized neural network. You've got to use the right tool for the occasion. Life is messy -"
And then there's the Bayesian reply: "Excuse you? The evidential impact of a fixed experimental method, producing the same data, depends on the researcher's private thoughts? And you have the nerve to accuse us of being 'too subjective'?"
If Nature is one way, the likelihood of the data coming out the way we have seen will be one thing. If Nature is another way, the likelihood of the data coming out that way will be something else. But the likelihood of a given state of Nature producing the data we have seen, has nothing to do with the researcher's private intentions. So whatever our hypotheses about Nature, the likelihood ratio is the same, and the evidential impact is the same, and the posterior belief should be the same, between the two experiments. At least one of the two Old Style methods must discard relevant information - or simply do the wrong calculation - for the two methods to arrive at different answers.
The ancient war between the Bayesians and the accursèd frequentists stretches back through decades, and I'm not going to try to recount that elder history in this blog post.
But one of the central conflicts is that Bayesians expect probability theory to be... what's the word I'm looking for? "Neat?" "Clean?" "Self-consistent?"
As Jaynes says, the theorems of Bayesian probability are just that, theorems in a coherent proof system. No matter what derivations you use, in what order, the results of Bayesian probability theory should always be consistent - every theorem compatible with every other theorem.
If you want to know the sum of 10 + 10, you can redefine it as (2 * 5) + (7 + 3) or as (2 * (4 + 6)) or use whatever other legal tricks you like, but the result always has to come out to be the same, in this case, 20. If it comes out as 20 one way and 19 the other way, then you may conclude you did something illegal on at least one of the two occasions. (In arithmetic, the illegal operation is usually division by zero; in probability theory, it is usually an infinity that was not taken as a the limit of a finite process.)
If you get the result 19 = 20, look hard for that error you just made, because it's unlikely that you've sent arithmetic itself up in smoke. If anyone should ever succeed in deriving a real contradiction from Bayesian probability theory - like, say, two different evidential impacts from the same experimental method yielding the same results - then the whole edifice goes up in smoke. Along with set theory, 'cause I'm pretty sure ZF provides a model for probability theory.
Math! That's the word I was looking for. Bayesians expect probability theory to be math. That's why we're interested in Cox's Theorem and its many extensions, showing that any representation of uncertainty which obeys certain constraints has to map onto probability theory. Coherent math is great, but unique math is even better.
And yet... should rationality be math? It is by no means a foregone conclusion that probability should be pretty. The real world is messy - so shouldn't you need messy reasoning to handle it? Maybe the non-Bayesian statisticians, with their vast collection of ad-hoc methods and ad-hoc justifications, are strictly more competent because they have a strictly larger toolbox. It's nice when problems are clean, but they usually aren't, and you have to live with that.
After all, it's a well-known fact that you can't use Bayesian methods on many problems because the Bayesian calculation is computationally intractable. So why not let many flowers bloom? Why not have more than one tool in your toolbox?
That's the fundamental difference in mindset. Old School statisticians thought in terms of tools, tricks to throw at particular problems. Bayesians - at least this Bayesian, though I don't think I'm speaking only for myself - we think in terms of laws.
Looking for laws isn't the same as looking for especially neat and pretty tools. The second law of thermodynamics isn't an especially neat and pretty refrigerator.
The Carnot cycle is an ideal engine - in fact, the ideal engine. No engine powered by two heat reservoirs can be more efficient than a Carnot engine. As a corollary, all thermodynamically reversible engines operating between the same heat reservoirs are equally efficient.
But, of course, you can't use a Carnot engine to power a real car. A real car's engine bears the same resemblance to a Carnot engine that the car's tires bear to perfect rolling cylinders.
Clearly, then, a Carnot engine is a useless tool for building a real-world car. The second law of thermodynamics, obviously, is not applicable here. It's too hard to make an engine that obeys it, in the real world. Just ignore thermodynamics - use whatever works.
This is the sort of confusion that I think reigns over they who still cling to the Old Ways.
No, you can't always do the exact Bayesian calculation for a problem. Sometimes you must seek an approximation; often, indeed. This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms. Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation - and fails to the extent that it departs.
Bayesianism's coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox's coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains).
You may not be able to compute the optimal answer. But whatever approximation you use, both its failures and successes will be explainable in terms of Bayesian probability theory. You may not know the explanation; that does not mean no explanation exists.
So you want to use a linear regression, instead of doing Bayesian updates? But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.
You want to use a regularized linear regression, because that works better in practice? Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.
Sometimes you can't use Bayesian methods literally; often, indeed. But when you can use the exact Bayesian calculation that uses every scrap of available knowledge, you are done. You will never find a statistical method that yields a better answer. You may find a cheap approximation that works excellently nearly all the time, and it will be cheaper, but it will not be more accurate. Not unless the other method uses knowledge, perhaps in the form of disguised prior information, that you are not allowing into the Bayesian calculation; and then when you feed the prior information into the Bayesian calculation, the Bayesian calculation will again be equal or superior.
When you use an Old Style ad-hoc statistical tool with an ad-hoc (but often quite interesting) justification, you never know if someone else will come up with an even more clever tool tomorrow. But when you can directly use a calculation that mirrors the Bayesian law, you're done - like managing to put a Carnot heat engine into your car. It is, as the saying goes, "Bayes-optimal".
It seems to me that the toolboxers are looking at the sequence of cubes {1, 8, 27, 64, 125, ...} and pointing to the first differences {7, 19, 37, 61, ...} and saying "Look, life isn't always so neat - you've got to adapt to circumstances." And the Bayesians are pointing to the third differences, the underlying stable level {6, 6, 6, 6, 6, ...}. And the critics are saying, "What the heck are you talking about? It's 7, 19, 37 not 6, 6, 6. You are oversimplifying this messy problem; you are too attached to simplicity."
It's not necessarily simple on a surface level. You have to dive deeper than that to find stability.
Think laws, not tools. Needing to calculate approximations to a law doesn't change the law. Planes are still atoms, they aren't governed by special exceptions in Nature for aerodynamic calculations. The approximation exists in the map, not in the territory. You can know the second law of thermodynamics, and yet apply yourself as an engineer to build an imperfect car engine. The second law does not cease to be applicable; your knowledge of that law, and of Carnot cycles, helps you get as close to the ideal efficiency as you can.
We aren't enchanted by Bayesian methods merely because they're beautiful. The beauty is a side effect. Bayesian theorems are elegant, coherent, optimal, and provably unique because they are laws.
Addendum: Cyan directs us to chapter 37 of MacKay's excellent statistics book, free online, for a more thorough explanation of the opening problem.
Jaynes, E. T. (1990.) Probability Theory as Logic. In: P. F. Fougere (Ed.), Maximum Entropy and Bayesian Methods. Kluwer Academic Publishers.
MacKay, D. (2003.) Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge University Press.
You know what really helps me accept a counterintuitive conclusion? Doing the math. I spent an hour reading and rereading this post and the arguments without being fully convinced of Eliezer's position, and then I spent 15 minutes doing the math (R code attached at the end). And once the math came out in favor of Eliezer, the conclusion suddenly doesn't seem so counterintuitive :)
Here we go, I'm diving all the numbers by five to make the code work but it's pretty convincing either way.
In this setup, it's clear to see that Pa and Pb aren't equal for every thing you want to measure. For example, for any evidence E that doesn't contain 20 observations Pa(E)=0. However, Reverend Bayes reminds us that the strength of our EVIDENCE depends on the odds ratio, and not on all the sub probabilities:
P(H1|A) / P(H0|B) = P(H1)/P(H0) P(E|H1)/P(E|H0) aka posterior odds = prior odds odds ratio of evidence. Assuming that the prior odds are the same, let's calculate the odds ratio for both Pa and Pb and see if they are different.
Pa(E|H0) = 12.4%, as a simple binomial distribution: dbinom(14,20,0.6). Pa(E|H1) = 19.1%. The odds ratio: Pa(E|H1)/Pa(E|H0) = 1.54. That's the only measure of how much our posterior should change. If originally we gave each hypothesis an equal chance (1:1), we now favor H1 at a ratio of 1.54:1. In terms of probability, we changed our credence in H1 from 50% to 60.6%.
What about researcher B? I simulated researcher B a million times in each possible world, the H0 world and the H1 world. In the H0 world, evidence E occurred only 5974 times out of a million, for Pb(E|H0) = 0.597% which is very far from 12.4%. It makes sense: researcher 2 usually stops after the first trial, and occasionally goes on for zillions! What about the H1 world? Pb(E|H1) = 0.919%. The odds ratio: Pb(E|H1) / Pb(E|H0) = wait for it = 1.537. Exactly the same!
I think all the other posts explain quite well why this was obviously the case, but if you like to see the numbers back up one side of an argument, you got 'em. I personally am now converted, amen.
R code for simulating a single researcher B:
resb<-function(p=0.6){
cures<-0
tries<-0
while(tries < 21) { # Since we only care whether B stops after 20 trials, we don't need to simulate past 21.
}
tries }
R code for simulating a million researchers B in H1 world:
x<-sapply(1:1000000,function(i) {resb(0.7)})
length(x[x==20])