Okay, Cyan, I have parsed your posts. I don't know any statistics whatsoever except what I've learned over the last ten hours, but pretty much everything you say seems to be correct, except maybe the last paragraph of this post which still looks foggy to me. The Jean Perrin example in the other comments section was especially illuminating. Let me rephrase it here for the benefit of future readers:
Suppose you're Jean Perrin trying to determine the value of the Avogadro number. This means you have a family of probability distributions depending on a single parameter, and some numbers that you know were sampled from the distribution with the true parameter value. Now estimate it.
If you're a frequentist, you calculate a 90% confidence interval for the parameter. Briefly, this means you calculate a couple numbers ("statistics") from the data - like, y'know, average them and stuff - in such a way that, for any given value of the parameter, if you'd imagined calculating those statistics from random values sampled under this parameter, they'd have a 90% chance of lying on different sides of it. If a billion statisticians do the same, about 90% of them will be right - not much more and not much less. This is, presumably, good calibration.
On the other hand, if you're a Bayesian, you pick an uninformative prior, then use your samples to morph it into a posterior and get a credible interval. Different priors lead to different intervals and God only knows what proportion out of a billion people like you is going to actually catch the actual Avogadro number with their interval, even though all of you used the credence value of 90%. This is, presumably, poor calibration.
This sounds like an opportune moment to pull a Jaynes and demonstrate conclusively why one side is utterly dumb and the other is forever right, but I don't yet feel the power. Let's someone else do that, please? (Eliezer, are you listening?)
The classic answer is that your confidence intervals are liable to occasionally tell you that mass is a negative number, when a large error occurs. Is this interval allowing only negative masses, 90% likely to be correct? No, even if you used an experimental method that a priori was 90% likely to yield an interval covering the correct answer. In other words, using the confidence interval as the posterior probability and plugging it into the expected-utility decision function doesn't make sense. Frequentists think that ignoring this problem means it goes away.
I already gave Cyan that classic answer, complete with a link to Jaynes, in this very comment thread. :-) But it doesn't settle the problem completely for me. It feels like finger-pointing. Yes, frequentists have lower quality answers; but why isn't the average calibration of a billion Bayesians in any way related to that 90% number that they all use?
I pulled a little switcheroo in the Avogadro's number example: calibration is a property of one agent considering multiple estimation problems, not multiple agents considering one estimation problem. But I think the argument still goes through, i.e., your summary above could be rewritten to take this into account just by changing a few words.
Hmm. I hadn't noticed that; stupidity strikes again. But regardless of the semantics of the word "calibration", the property outlined in my summary seems like a nice property to have, and I feel kinda left out for not possessing it.
The absence of comments here doesn't reflect well on us, but this is a tricky topic. I'm honestly trying to get to the bottom of this and the bottom ain't in sight yet.
EDIT: I'm not sure a prior that matched confidence intervals would be a good thing. See point III.b "Truncated exponential distribution" in this pdf for an example where a 90% confidence interval gives a result that's actually logically ruled out by the sample. (Cyan, am I restating obvious stuff? Too stupid to say for sure yet.)
To be honest, I'm not shocked that most people aren't equipped to or interested in grappling with this stuff. If I weren't a Bayesian working for a frequentist I wouldn't be thinking so much about why frequentists do what they do. I was hoping that the more mathematically inclined folks would find this argument startling enough to try to knock it down -- I'd be happy to be wrong.
It isn't so much that we want posterior intervals to match some crappy-arsed confidence interval. We just want them to be calibrated, and as near as I can tell, calibration is equivalent to valid confidence coverage. We know from results in the literature on matching priors that posterior intervals aren't calibrated in general (provided I've got the equivalence of calibration and valid confidence coverage right) . So we can have calibration, or we can have rational decisions, but not both (?).
I suspect it's an issue of jargon or technical difficulty. Like I mentioned, my math background is at least decent, and I have some serious trouble wrapping my mind around what issue is being debated here. I strongly suspect that has more to do with how the issue is presented and explained than with the issue itself, though I could be quite wrong.
There's got to be a way to express this in plain English (or even plain math); how, for example, do a frequentist and a Bayesian see the same problem differently, and why should we care?
If you pick out some of the specific jargon words that are opaque to you, I can taboo them or provide links and we'll can see if I can revise the posts into comprehensibility.
I've posted the link before, but it's most appropriate to this thread:
Bayesian or Frequentist, Which Are You? Video lecture by Michael Jordan covering a lot of the same points.
Another intriguing point for the discussion.
Jaynes cites Zellner and Thornber's experiments comparing the performance of Bayesian vs frequentist methods. Bayes won in both cases, I presume on coverage too. The reason for that was pretty funny: quote, "By the time all necessary provisions for a 'fair' contest have been incorporated into the experiment, all the ingredients of the Bayesian theory (prior distribution, loss function, etc.) will necessarily be present... The simulation can only demonstrate the mathematical theorem." In other words, frequentist confidence coverage might sometimes win on real-world examples like the Avogadro number, but Bayes will win any arranged contests precisely because they're arranged. :-)
To those who feel anti-Bayesian today I recommend Shalizi's blog, and also the following joke I found on the net:
Prior to the birth of Thomas Bayes, the proud parents, Mr Joshua Bayes and Mrs. Ann Carpenter Bayes, had 11 daughters (Anne, Rebecca, Mary, etc). While Mrs. Bayes was pregnant with Thomas, she REALLY, REALLY wanted a son. So they went to the local seer, who placed her hands on Mrs. Bayes' stomach and pronounced that without a doubt, the next baby would be a boy. Well, Mrs. Bayes really, really believed that this next baby would be a boy. So when the baby actually arrived, the actual physical evidence that the baby was a girl was not strong enough to overcome her prior (ahem) belief that the baby would be a boy, and so Joshua and Anne named their new baby daughter Thomas and raised her to be the son they had always wanted.
"Taken together, the calibration results of the previous post and the complete class theorem suggest (to me, anyway) that irrespective of one's philosophical views on frequentism versus Bayesianism, perfect calibration is not possible in full generality for a rational decision-making agent."
Huh? I feel like there's a giant chunk missing just before this paragraph, which seems to have nothing to do with anything you said prior to it.
In my previous post, I alluded to a result that could potentially convince a frequentist to favor Bayesian posterior distributions over confidence intervals. It’s called the complete class theorem, due to a statistician named Abraham Wald. Wald developed the structure of frequentist decision theory and characterized the class of decision rules that have a certain optimality property.
Frequentist decision theory reduces the decision process to its basic constituents, i.e., data, actions, true states, and incurred losses. It connects them using mathematical functions that characterize their dependencies, i.e., the true state determines the probability distribution of the data, the decision rule maps data to a particular action, and the chosen action and true states together determine the incurred loss. To evaluate potential decision rules, frequentist decision theory uses the risk function, which is defined as the expected loss of a decision rule with respect to the data distribution. The risk function therefore maps (decision rule, true state)-pairs to the average loss under a hypothetical infinite replication of the decision problem.
Since the true state is not known, decision rules must be evaluated over all possible true states. A decision rule is said to be “dominated” if there is another decision rule whose risk is never worse for any possible true state and is better for at least one true state. A decision rule which is not dominated is deemed “admissible”. (This is the optimality property alluded to above.) The punch line is that under some weak conditions, the complete class of admissible decision rules is precisely the class of rules which minimize a Bayesian posterior expected loss.
(This result sparked interest in the Bayesian approach among statisticians in the 1950s. This interest eventually led to the axiomatic decision theory that characterizes rational agents as obeying certain fundamental constraints and proves that they act as if they had a prior distribution and a loss function.)
Taken together, the calibration results of the previous post and the complete class theorem suggest (to me, anyway) that irrespective of one's philosophical views on frequentism versus Bayesianism, perfect calibration is not possible in full generality for a rational decision-making agent.