There once lived a great man named E.T. Jaynes. He knew that Bayesian inference is the only way to do statistics logically and consistently, standing on the shoulders of misunderstood giants Laplace and Gibbs. On numerous occasions he vanquished traditional "frequentist" statisticians with his superior math, demonstrating to anyone with half a brain how the Bayesian way gives faster and more correct results in each example. The weight of evidence falls so heavily on one side that it makes no sense to argue anymore. The fight is over. Bayes wins. The universe runs on Bayes-structure.
Or at least that's what you believe if you learned this stuff from Overcoming Bias.
Like I was until two days ago, when Cyan hit me over the head with something utterly incomprehensible. I suddenly had to go out and understand this stuff, not just believe it. (The original intention, if I remember it correctly, was to impress you all by pulling a Jaynes.) Now I've come back and intend to provoke a full-on flame war on the topic. Because if we can have thoughtful flame wars about gender but not math, we're a bad community. Bad, bad community.
If you're like me two days ago, you kinda "understand" what Bayesians do: assume a prior probability distribution over hypotheses, use evidence to morph it into a posterior distribution over same, and bless the resulting numbers as your "degrees of belief". But chances are that you have a very vague idea of what frequentists do, apart from deriving half-assed results with their ad hoc tools.
Well, here's the ultra-short version: frequentist statistics is the art of drawing true conclusions about the real world instead of assuming prior degrees of belief and coherently adjusting them to avoid Dutch books.
And here's an ultra-short example of what frequentists can do: estimate 100 independent unknown parameters from 100 different sample data sets and have 90 of the estimates turn out to be true to fact afterward. Like, fo'real. Always 90% in the long run, truly, irrevocably and forever. No Bayesian method known today can reliably do the same: the outcome will depend on the priors you assume for each parameter. I don't believe you're going to get lucky with all 100. And even if I believed you a priori (ahem) that don't make it true.
(That's what Jaynes did to achieve his awesome victories: use trained intuition to pick good priors by hand on a per-sample basis. Maybe you can learn this skill somewhere, but not from the Intuitive Explanation.)
How in the world do you do inference without a prior? Well, the characterization of frequentist statistics as "trickery" is totally justified: it has no single coherent approach and the tricks often give conflicting results. Most everybody agrees that you can't do better than Bayes if you have a clear-cut prior; but if you don't, no one is going to kick you out. We sympathize with your predicament and will gladly sell you some twisted technology!
Confidence intervals: imagine you somehow process some sample data to get an interval. Further imagine that hypothetically, for any given hidden parameter value, this calculation algorithm applied to data sampled under that parameter value yields an interval that covers it with probability 90%. Believe it or not, this perverse trick works 90% of the time without requiring any prior distribution on parameter values.
Unbiased estimators: you process the sample data to get a number whose expectation magically coincides with the true parameter value.
Hypothesis testing: I give you a black-box random distribution and claim it obeys a specified formula. You sample some data from the box and inspect it. Frequentism allows you to call me a liar and be wrong no more than 10% of the time reject truthful claims no more than 10% of the time, guaranteed, no prior in sight. (Thanks Eliezer for calling out the mistake, and conchis for the correction!)
But this is getting too academic. I ought to throw you dry wood, good flame material. This hilarious PDF from Andrew Gelman should do the trick. Choice quote:
Well, let me tell you something. The 50 states aren't exchangeable. I've lived in a few of them and visited nearly all the others, and calling them exchangeable is just silly. Calling it a hierarchical or multilevel model doesn't change things - it's an additional level of modeling that I'd rather not do. Call me old-fashioned, but I'd rather let the data speak without applying a probability distribution to something like the 50 states which are neither random nor a sample.
As a bonus, the bibliography to that article contains such marvelous titles as "Why Isn't Everyone a Bayesian?" And Larry Wasserman's followup is also quite disturbing.
Another stick for the fire is provided by Shalizi, who (among other things) makes the correct point that a good Bayesian must never be uncertain about the probability of any future event. That's why he calls Bayesians "Often Wrong, Never In Doubt":
The Bayesian, by definition, believes in a joint distribution of the random sequence X and of the hypothesis M. (Otherwise, Bayes's rule makes no sense.) This means that by integrating over M, we get an unconditional, marginal probability for f.
For my final quote it seems only fair to add one more polemical summary of Cyan's point that made me sit up and look around in a bewildered manner. Credit to Wasserman again:
Pennypacker: You see, physics has really advanced. All those quantities I estimated have now been measured to great precision. Of those thousands of 95 percent intervals, only 3 percent contained the true values! They concluded I was a fraud.
van Nostrand: Pennypacker you fool. I never said those intervals would contain the truth 95 percent of the time. I guaranteed coherence not coverage!
Pennypacker: A lot of good that did me. I should have gone to that objective Bayesian statistician. At least he cares about the frequentist properties of his procedures.
van Nostrand: Well I'm sorry you feel that way Pennypacker. But I can't be responsible for your incoherent colleagues. I've had enough now. Be on your way.
There's often good reason to advocate a correct theory over a wrong one. But all this evidence (ahem) shows that switching to Guardian of Truth mode was, at the very least, premature for me. Bayes isn't the correct theory to make conclusions about the world. As of today, we have no coherent theory for making conclusions about the world. Both perspectives have serious problems. So do yourself a favor and switch to truth-seeker mode.
I've had some training in Bayesian and Frequentist statistics and I think I know enough to say that it would be difficult to give a "simple" and satisfying example. The reason is that if one is dealing with finite dimensional statistical models (this is where the parameter space of the model is finite) and one has chosen a prior for those parameters such that there is non-zero weight on the true values then the Bernstein-von Mises theorem guarantees that the Bayesian posterior distribution and the maximum likelihood estimate converge to the same probability distribution (although you may need to use improper priors). The covers cases where we consider finite outcomes such as a toss of a coin or rolling a die.
I apologize if that's too much jargon, but for really simple models that are easy to specify you tend to get the same answer. Bayesian stats starts to behave different than frequentist statistics in noticeable ways when you consider infinite outcome spaces. An example here might be where you are considering probability distributions over curves (this arises in my research on speech recognition). In this case even if you have a seemingly sensible prior you can end up in the case where, in the limit of infinite data, you will end up with a posterior distribution that is different from the true distribution.
In practice if I am learning a Gaussian Mixture Model for speech curves and I don't have much data then Bayesian procedures tend to be a bit more robust and frequentist procedures end up over-fitting (or being somewhat random). When I start getting more data using frequentist methods tend to be algorithmically more tractable and get better results. So I'll end with faster computation time and say on the task of phoneme recognition I'll make fewer errors.
I'm sorry if I haven't explained it well, the difference in performance wasn't really evident to me until I spent some time actually using them in machine learning. Unfortunately, most of the disadvantage of Bayesian approaches aren't evident for simple statistical problems, but they become all too evident in the case of complex statistical models.
Thanks much!
What do "non-zero weight" and "improper priors" mean?
EDIT: Improper priors mean priors that don't sum to one. I would guess "non-zero weight" means "non-zero probability". But then I wo... (read more)