This is a question I asked on Physics Stack Exchange a while back, and I thought it would be interesting to hear people's thoughts on it here. You can find the original question here.
What do we mean when we say that we have a probabilistic theory of some phenomenon?
Of course, we know from experience that probabilistic theories "work", in the sense that they can (somehow) be used to make predictions about the world, they can be considered to be refuted under appropriate circumstances and they generally appear to be subject to the same kinds of principles that govern other kinds of explanations of the world. The Ising model predicts the ferromagnetic phase transition, scattering amplitude computations of quantum field theories predict the rates of transition between different quantum states, and I can make impressively sharp predictions of the ensemble properties of a long sequence of coin tosses by using results such as the central limit theorem. Regardless, there seem to be a foundational problem at the center of the whole enterprise of probabilistic theorizing - the construction of what is sometimes called "an interpretation of the probability calculus" in the philosophical literature, which to me seems to be an insurmountable problem.
A probabilistic theory comes equipped with an event space and a probability measure attached to it, both of which are fixed by the theory in some manner. However, the probability measure occupies a strictly epiphenomenal position relative to what actually happens. Deterministic theories have the feature that they forbid some class of events from happening - for instance, the second law of thermodynamics forbids the flow of heat from a cold object to a hot object in an isolated system. The probabilistic component in a theory has no such character, even in principle. Even if we observed an event of zero probability, formally this would not be enough to reject the theory; since a set of zero probability measure need not be empty. (This raises the question of, for instance, whether a pure quantum state in some energy eigenstate could ever be measured to be outside of that eigenstate - is this merely an event of probability , or is it in fact forbidden?)
The legitimacy of using probabilistic theories then rests on the implicit assumption that events of zero (or sufficiently small) probability are in some sense negligible. However, it's not clear why we should believe this as a prior axiom. There are certainly other types of sets we might consider to be "negligible" - for instance, if we are doing probability theory on a Polish space, the collection of meager sets and the collection of null measure sets are both in some sense "negligible", but these notions are in fact perpendicular to each other: can be written as the union of a meager set and a set of null measure. This result forces us to make a choice as to which class of sets we will neglect, or otherwise we will end up neglecting the whole space !
Moreover, ergodic theorems (such as the law of large numbers) which link spatial averages to temporal averages don't help us here, even if we use versions of them with explicit estimates of errors (like the central limit theorem), because these estimates only hold with a probability for some small , and even in the infinite limit they hold with probability , and we're back to the problems I discussed above. So while these theorems can allow one to use some hypothesis test to reject the theory as per the frequentist approach, for the theory to have any predictive power at all this hypothesis test has to be put inside the theory.
The alternative is to adopt a Bayesian approach, in which case the function of a probabilistic theory becomes purely normative - it informs us about how some agent with a given expected utility should act. I certainly don't conceive of the theory of quantum mechanics as fundamentally being a prescription for how humans should act, so this approach seems to simply define the problem out of existence and is wholly unsatisfying. Why should we even accept this view of decision theory when we have given no fundamental justification for the use of probabilities to start with?
I think it is not circular, though I can imagine why it seems so. Let me try to elaborate the order of operations as I see it.
I think the key confusion here is that it may seem like one needs the decision theory set up already in order to justify the scoring rule (to establish that it incentivizes honest revelation), but the decision theory also depends on the scoring rule. I claim that the scoring rule can be justified on other grounds than honest revelation. If you don't buy the argument of invariance under observation orderings, I can probably come up with other justifications, e.g. from coding theory. Closing the decision-theoretic loop also does provide some justificatory force, even if it is circular, since being able to set up a revelation theorem is certainly a nice feature of this logP(A) norm.
But fundamentally, whether in this system or Aristotle's, one doesn't identify the epistemic norms by trying to incentivize honest reporting of beliefs, but rather by trying to validate reports that align with reality. The logP(A) rule stands as a way of extending the desire for reports that align with reality to the non-Boolean logic of probability, so that we can talk rationally about sea-battles and other uncertain events, without having to think about in what order we find things out.
I haven't studied this difference, but I want to register my initial intuition that to the extent other proper scoring rules give different value-of-information incentives than the log scoring rule, they are worse and the incentives from the log rule are better. In particular, I expect the incentives of the log rule to be more invariant to different ways of asking multiple questions that basically add up to one composite problem domain, and that being sensitive to that would be a misfeature.
Even if a question never resolves fully enough to make all observables either True or False (i.e., if the possibility space is Hausdorff, resolves to a Dirac delta), but just resolves incrementally to more and more precise observations A0⊃A1⊃⋯⊃Ak⊃⋯, the log scoring rule remains proper, since
logP(Ak)+logP(Ak+1|Ak)=logP(Ak)+logP(Ak+1∩Ak)P(Ak)=logP(Ak)+logP(Ak+1)P(Ak)=logP(Ak)+logP(Ak+1)−logP(Ak)=logP(Ak+1)I don't think the same can be said for the Brier scoring rule; it doesn't even seem to have a well-defined generalization to this case.
There are a couple fiddly assumptions here I should bring out explicitly: