In my previous post, I alluded to a result that could potentially convince a frequentist to favor Bayesian posterior distributions over confidence intervals. It’s called the complete class theorem, due to a statistician named Abraham Wald. Wald developed the structure of frequentist decision theory and characterized the class of decision rules that have a certain optimality property.
Frequentist decision theory reduces the decision process to its basic constituents, i.e., data, actions, true states, and incurred losses. It connects them using mathematical functions that characterize their dependencies, i.e., the true state determines the probability distribution of the data, the decision rule maps data to a particular action, and the chosen action and true states together determine the incurred loss. To evaluate potential decision rules, frequentist decision theory uses the risk function, which is defined as the expected loss of a decision rule with respect to the data distribution. The risk function therefore maps (decision rule, true state)-pairs to the average loss under a hypothetical infinite replication of the decision problem.
Since the true state is not known, decision rules must be evaluated over all possible true states. A decision rule is said to be “dominated” if there is another decision rule whose risk is never worse for any possible true state and is better for at least one true state. A decision rule which is not dominated is deemed “admissible”. (This is the optimality property alluded to above.) The punch line is that under some weak conditions, the complete class of admissible decision rules is precisely the class of rules which minimize a Bayesian posterior expected loss.
(This result sparked interest in the Bayesian approach among statisticians in the 1950s. This interest eventually led to the axiomatic decision theory that characterizes rational agents as obeying certain fundamental constraints and proves that they act as if they had a prior distribution and a loss function.)
Taken together, the calibration results of the previous post and the complete class theorem suggest (to me, anyway) that irrespective of one's philosophical views on frequentism versus Bayesianism, perfect calibration is not possible in full generality for a rational decision-making agent.
Okay, Cyan, I have parsed your posts. I don't know any statistics whatsoever except what I've learned over the last ten hours, but pretty much everything you say seems to be correct, except maybe the last paragraph of this post which still looks foggy to me. The Jean Perrin example in the other comments section was especially illuminating. Let me rephrase it here for the benefit of future readers:
Suppose you're Jean Perrin trying to determine the value of the Avogadro number. This means you have a family of probability distributions depending on a single parameter, and some numbers that you know were sampled from the distribution with the true parameter value. Now estimate it.
If you're a frequentist, you calculate a 90% confidence interval for the parameter. Briefly, this means you calculate a couple numbers ("statistics") from the data - like, y'know, average them and stuff - in such a way that, for any given value of the parameter, if you'd imagined calculating those statistics from random values sampled under this parameter, they'd have a 90% chance of lying on different sides of it. If a billion statisticians do the same, about 90% of them will be right - not much more and not much less. This is, presumably, good calibration.
On the other hand, if you're a Bayesian, you pick an uninformative prior, then use your samples to morph it into a posterior and get a credible interval. Different priors lead to different intervals and God only knows what proportion out of a billion people like you is going to actually catch the actual Avogadro number with their interval, even though all of you used the credence value of 90%. This is, presumably, poor calibration.
This sounds like an opportune moment to pull a Jaynes and demonstrate conclusively why one side is utterly dumb and the other is forever right, but I don't yet feel the power. Let's someone else do that, please? (Eliezer, are you listening?)
The classic answer is that your confidence intervals are liable to occasionally tell you that mass is a negative number, when a large error occurs. Is this interval allowing only negative masses, 90% likely to be correct? No, even if you used an experimental method that a priori was 90% likely to yield an interval covering the correct answer. In other words, using the confidence interval as the posterior probability and plugging it into the expected-utility decision function doesn't make sense. Frequentists think that ignoring this problem means it goes away.