Great summary!
P7 looks self-evident to me. I'm less comfortable with the P6. Unbounded utility depends on P6 requiring a partition into an arbitrarily large number of parts - is this used in the proof of bounded utility? In general, I don't think Archimedian axioms are safe given our current level of understanding of Pascal's mugger-like problems.
EDIT: P7 doesn't look as self-evident anymore. Consider the St. Petersburg lottery. Any particular payout from buying a ticket for $2 is worse than the expected value of buying a ticket for $1, but obviously it is...
Copy of the Savage book: https://www.gwern.net/docs/statistics/1972-savage-foundationsofstatistics.djvu
Peter Wakker apparently thinks he found a way to have unbounded utilities and obey most of Savage's axioms. See Unbounded utility for Savage's "Foundations of Statistics," and other models. I'll say more if and when I understand that paper.
This is very nice.
One point I find less than perfectly convincing: the motivation of the "total" part of P1 by saying that if our preorder were partial then we'd have two different kinds of indifference.
First off, I don't see anything bad about that in terms of mathematical elegance. Consider, e.g., Conway's beautiful theory of numbers and (two player perfect-information) games, in which the former turn out to be a special case of the latter. When you extend the <= relation on numbers to games, you get a partial preorder with, yes, two kinds o...
Excellent summary. Savage's founding of statistics is nice because it only assumes that agents have to make choices between actions, making no assumptions about whether they have to have beliefs or goals. This is important because agents in general don't have to use beliefs or goals, but they do all have to chose actions.
Thanks for the info about boundedness, I didn't notice that on my quick skim through the book.
Here is a small counterexample to P2. States = { Red, Green, Blue }. Outcomes = { Win, Lose }. Since there are only two outcomes, we can write actions as the subset of states that Win. My preferences are: {} < { Green } = { Blue } < { Red } < { Red,Green } = { Red,Blue } < { Green,Blue } < { Red,Green,Blue }
This contradicts P2 because { Green } < { Red } but { Red,Blue } < { Green,Blue }.
Here is a situation where this may apply: There is an urn with 300 balls. 100 of them are red. The rest are either green or blue. You draw a ball f...
I don't think P2 can be justified by Dutch Book type arguments. I don't think it can be justified, as a rational requirement of choice, at all. My reservations are similar to Edward McClennen's in "Sure Thing Doubts".
Here's a quick issue I only just noticed but which fortunately is easily fixed:
Above I mentioned you probably want to restrict to a sigma-algebra of events and only allow measurable functions as actions. But, what does measurable mean here? Fortunately, the ordering on outcomes (even without utility) makes measurability meaningful. Except this puts a circularity in the setup, because the ordering on outcomes is induced from the ordering on actions.
Fortunately this is easily patched. You can start with the assumption of a total preorder on outcomes (con
...I don't think transitivity is a reasonable assumption.
Suppose an agent is composed of simpler submodules--this, to a very rough approximation, is how actual brains seem to function--and its expressed preferences (i.e. actions) are assembled by polling its submodules.
Bam, voting paradox. Transitivity is out.
Here is where I would like to suggest a small modification to this setup
I do not like this modification. His way is more elegant because it starts with less information. It is less robust because it does not depend on an initial concept of "knowledge". "knowledge" does not make sense in all instances.
P1 is Dutch Book justifiable, I think. For instance x has to not be preferred to x, or else trading x for x would be a benefit.
This thought isn't original to me, but it's probably worth making. It feels like there are two sorts of axioms. I am following tradition in describing them as "rationality axioms" and "structure axioms". The rationality axioms (like the transitivity of the order among acts) are norms on action. The structure axioms (like P6) aren't normative at all. (It's about structure on the world, how bizarre is it to say "The world ought to be such that P6 holds of it"?)
Given this, and given the necessity of the structure axioms for the proof, it feels like Savage's theorem can't serve as a justification of Bayesian epistemolgy as a norm of rational behaviour.
Here is a small counterexample to P2. States = { Red, Green, Blue }. Outcomes = { Win, Lose }. Since there are only two outcomes, we can write actions as the subset of states that Win. My preferences are: {} < { Green } = { Blue } < { Red } < { Red,Green } = { Red,Blue } < { Green,Blue } < { Red,Green,Blue }
Here is a situation where this may apply: There is an urn with 300 balls. 100 of them are red. The rest are either green or blue. You draw a ball from this urn.
So Red represents definite probability 1/3, while Green and Blue are unknow...
the first 6 axioms are enough to handle finite gambles, 7 is only needed for more general situations
You mean P7 is implied already by P1-6 for finite B, I assume.
I'm confused about P2b.
Suppose C = {s1, s2}, B = {s2},
f(s1) = a, g(s1) = g(s2) = b, f(s2) = c,
with a < b < c.
It seems reasonable that there could be a situation where f≤g given C, but g<f given B. Am I missing something?
Edit: I think the P2c I wrote originally may have been a bit too weak; fixed that. Nevermind, rechecking, that wasn't needed.
More edits (now consolidated): Edited nontriviality note. Edited totality note. Added in the definition of numerical probability in terms of qualitative probability (though not the proof that it works). Also slight clarifications on implications of P6' and P6''' on partitions into equivalent and almost-equivalent parts, respectively.
One very late edit, June 2: Even though we don't get countable additivity, we still want a σ-algebra rather than just an algebra (this is needed for some of the proofs in the "partition conditions" section that I don't go into here). Also noted nonemptiness of gambles.
The idea that rational agents act in a manner isomorphic to expected-utility maximizers is often used here, typically justified with the Von Neumann-Morgenstern theorem. (The last of Von Neumann and Morgenstern's axioms, the independence axiom, can be grounded in a Dutch book argument.) But the Von Neumann-Morgenstern theorem assumes that the agent already measures its beliefs with (finitely additive) probabilities. This in turn is often justified with Cox's theorem (valid so long as we assume a "large world", which is implied by e.g. the existence of a fair coin). But Cox's theorem assumes as an axiom that the plausibility of a statement is taken to be a real number, a very large assumption! I have also seen this justified here with Dutch book arguments, but these all seem to assume that we are already using some notion of expected utility maximization (which is not only somewhat circular, but also a considerably stronger assumption than that plausibilities are measured with real numbers).
There is a way of grounding both (finitely additive) probability and utility simultaneously, however, as detailed by Leonard Savage in his Foundations of Statistics (1954). In this article I will state the axioms and definitions he gives, give a summary of their logical structure, and suggest a slight modification (which is equivalent mathematically but slightly more philosophically satisfying). I would also like to ask the question: To what extent can these axioms be grounded in Dutch book arguments or other more basic principles? I warn the reader that I have not worked through all the proofs myself and I suggest simply finding a copy of the book if you want more detail.
Peter Fishburn later showed in Utility Theory for Decision Making (1970) that the axioms set forth here actually imply that utility is bounded.
(Note: The versions of the axioms and definitions in the end papers are formulated slightly differently from the ones in the text of the book, and in the 1954 version have an error. I'll be using the ones from the text, though in some cases I'll reformulate them slightly.)
Primitive notions; preference given a set of states
We will use the following primitive notions. Firstly, there is a set S of "states of the world"; the exact current state of the world is unknown to the agent. Secondly, there is a set F of "consequences" - things that can happen as a result of the agent's actions. Actions or acts will be interpreted as functions f:S→F, as two actions which have the same consequences regardless of the state of the world are indistinguishable and hence considered equal. While the agent may be uncertain as to the exact results of its actions, this can be folded into his uncertainty about the state of the world. Finally, we introduces as primitive a relation ≤ on the set of actions, interpreted as "is not preferred to". I.e., f≤g means that given a choice between actions f and g, the agent will either prefer g or be indifferent. As usual, sets of states will be referred to as "events", and for the usual reasons we may want to restrict the set of admissible events to a boolean σ-subalgebra of ℘(S), though I don't know if that's really necessary here (Savage doesn't seem to do so, though he does discuss it some). [Edit nine years later: This actually introduces a slight issue I didn't realize before, but fortunately it's easily fixed.]
In any case, we then have the following axiom:
P1. The relation ≤ is a total preorder.
The intuition here for transitivity is pretty clear. For totality, if the agent is presented with a choice of two acts, it must choose one of them! Or be indifferent. Perhaps we could instead use a partial preorder (or order?), though this would give us two different indistinguishable flavors of indifference, which seems problematic. But this could be useful if we wanted intransitive indifference. So long as indifference is transitive, though, we can collapse this into a total preorder.
As usual we can then define f≥g, f<g (meaning "it is false that g≤f"), and g>f. I will use f≡g to mean "f≤g and g≤f", i.e., the agent is indifferent between f and g. (Savage uses an equals sign with a dot over it.)
Note that though ≤ is defined in terms of how the agent chooses when presented with two options, Savage later notes that there is a construction of W. Allen Wallis that allows one to adduce the agent's preference ordering among a finite set of more than two options (modulo indifference): Simply tell the agent to rank the options given, and that afterward, two of them will be chosen uniformly at random, and it will get whichever one it ranked higher.
The second axiom states that if two actions have the same consequences in some situation, just what that equal consequence is does not affect their relative ordering:
P2. Suppose f≤g, and B is a set of states such f and g agree on B. If f' and g' are another pair of acts which, outside of B, agree with f and g respectively, and on B, agree with each other, then f'≤g'.
In other words, to decide between two actions, only the cases where they actually have different consequences matter.
With this axiom, we can now define:
D1. We say "f≤g given B" to mean that if f' and g' are actions such that f' agrees with f on B, g' agrees with g on B, and f' and g' agree with each other outside of B, then f'≤g'.
Due to axiom P2, this is well-defined.
Here is where I would like to suggest a small modification to this setup. The notion of "f≤g given B" is implicitly taken to be how the agent makes decisions if it knows that B obtains. However it seems to me that we should actually take "f≤g given B", rather than f≤g, to be the primitive notion, explicitly interpeted as "the agent does not prefer f to g if it knows that B obtains". The agent always has some state of prior knowledge and this way we have explicitly specified decisions under a given state of knowledge - the acts we are concerned with - as the basis of our theory. Rather than defining f≤g given B in terms of ≤, we can define f≤g to mean "f≤g given S" and then add additional axioms governing the relation between "≤ given B" for varying B, which in Savage's setup are theorems or part of the definition D1.
(Specifically, I would modify P1 and P2 to talk about "≤ given B" rather than ≤, and add the following theorems as axioms:
P2a. If f and g agree on B, then f≡g given B.
P2b. If B⊆C, f≤g given C, and f and g agree outside B, then f≤g given B.
P2c. If B and C are disjoint, and f≤g given B and given C, then f≤g given B∪C.
This is a little unwieldy and perhaps there is an easier way - these might not be minimal. But they do seem to be sufficient.)
In any case, regardless which way we do it, we've now established the notion of preference given that a set of states obtains, as well as preference without additional knowledge, so henceforth I'll freely use both as Savage does without worrying about which makes a better foundation, since they are equivalent.
Ordering on preferences
The next definition is simply to note that we can sensibly talk about f≤b, b≤f, b≤c where here b and c are consequences rather than actions, simply by interpreting consequences as constant functions. (So the agent does have a preference ordering on consequences, it's just induced from its ordering on actions. We do it this way since it's its choices between actions we can actually see.)
However, the third axiom reifies this induced ordering somewhat, by demanding that it be invariant under gaining new information.
P3'. If b and c are consequences and b≤c, then b≤c given any B.
Thus the fact that the agent may change preferences given new information, just reflects its uncertainty about the results of their actions, rather than actually preferring different consequences in different states (any such preferences can be done away with by simply expanding the set of consequences).
Really this is not strong enough, but to state the actual P3 we will first need a definition:
D3. An event B is said to be null if f≤g given B for any actions f and g.
Null sets will correspond to sets of probability 0, once numerical probability is introduced. Probability here is to be adduced from the agent's preferences, so we cannot distinguish between "the agent is certain that B will not happen" and "if B obtains, the agent doesn't care what happens".
Now we can state the actual P3:
P3. If b and c are consequences and B is not null, then b≤c given B if and only if b≤c.
P3', by contrast, allowed some collapsing of preference on gaining new information; here we have disallowed that except in the case where the new information is enough to collapse all preferences entirely (a sort of "end of the world" or "fatal error" scenario).
Qualitative probability
We've introduced above the idea of "probability 0" (and hence implicitly probability 1; observe that "¬B is null" is equivalent to "for any f and g, f≤g given B if and only if f≤g"). Now we want to expand this to probability more generally. But we will not initially get numbers out of it; rather we will first just get another total preordering, A≤B, "A is at most as probable as B".
How can we determine which of two events the agent thinks is more probable? Have it bet on them, of course! First, we need a nontriviality axiom so it has some things to bet on.
P5. There exist consequences b and c such that b>c.
(I don't know what the results would be if instead we used the weaker nontriviality axiom "there exist actions f and g such that f<g", i.e., "S is not null". That we eventually get that expected utility for comparing all acts suggests that this should work, but I haven't checked.)
So let us now consider a class of actions which I will call "wagers". (Savage doesn't have any special term for these.) Define "the wager on A for b over c" to mean the action that, on A, returns b, and otherwise, returns c. Denote this by wA,b,c. Then we postulate:
P4. Let b>b' be a pair of consequences, and c>c' another such pair. Then for any events A and B, wA,b,b′≤wB,b,b′ if and only if wA,c,c′≤wB,c,c′.
That is to say, if the agent is given the choice between betting on event A and betting on event B, and the prize and booby prize are the same regardless of which it bets on, then it shouldn't just matter just what the prize and booby prize are - it should just bet on whichever it thinks is more probable. Hence we can define:
D4. For events A and B, we say "A is at most as probable as B", denoted A≤B, if wA,b,b′≤wB,b,b′, where b>b' is a pair of consequences.
By P4, this is well-defined. We can then show that the relation on events ≤ is a total preorder, so we can use the usual notation when talking about it (again, ≡ will denote equivalence).
In fact, ≤ is not only a total preorder, but a qualitative probability:
(There is no condition corresponding to countable additivity; as mentioned above, we simply won't get countable additivity out of this.) Note also that under this, A≡∅ if and only if A is null in the earlier sense. Also, we can define "A≤B given C" by comparing the wagers given C; this is equivalent to the condition that A∩C≤B∩C. This relation is too a qualitative probability.
Partition conditions and numerical probability
In order to get real numbers to appear, we are of course going to have to make some sort of Archimedean assumption. In this section I discuss what some of these look like and then ultimately state P6, the one Savage goes with.
First, definitions. We will be considering finitely-additive probability measures on the set of states, i.e. a function P from the set of events to the interval [0,1] such that P(S)=1, and for disjoint B and C, P(B∪C)=P(B)+P(C). We will say "P agrees with ≤" if for every A and B, A≤B if and only if P(A)≤P(B); and we will say "P almost agrees with ≤" if for every A and B, A≤B implies P(A)≤P(B). (I.e., in the latter case, numerical probability is allowed to collapse some distinctions between events that the agent might not actually be indifferent between.)
We'll be considering here partitions of the set of states S. We'll say a partition of S is "uniform" if the parts are all equivalent. More generally we'll say it is "almost uniform" if, for any r, the union of any r parts is at most as probable as the union of any r+1 parts. (This is using ≤, remember; we don't have numerical probabilities yet!) (Note that any uniform partition is almost uniform.) Then it turns out that the following are equivalent:
(Definitely not going into the proof of this here. However, the actual definition of the numerical probability P(A) is not so complicated: Let k(A,n) denote the largest r such that there exists an almost-uniform partition of S into n parts, for which there is some union of r parts, C, such that C≤A. Then the sequence k(A,n)/n always converges, and we can define P(A) to be its limit.)
So we could use this as our 6th axiom:
P6'''. For any B>∅, there exists a partition of S with each part less probable than B.
Savage notes that other authors have assumed the stronger
P6''. There exist uniform partitions of S into arbitrarily large numbers of parts.
since there's an obvious justification for this: the existence of a fair coin! If a fair coin exists, then we can generate a uniform partition of S into 2n parts simply by flipping it n times and considering the result. We'll actually end up assuming something even stronger than this.
So P6''' does get us numerical probabilities, but they don't necessarily reflect all of the qualitative probability; P6''' is only strong enough to force almost agreement. Though it is stronger than that when ∅ is involved - it does turn out that P(B)=0 if and only if B≡∅. (And hence also P(B)=1 if and only if B≡S.) But more generally it turns out that P(B)=P(C) if and only if B and C are "almost equivalent", which I will denote B≈C (Savage uses a symbol I haven't seen elsewhere), which is defined to mean that for any E>∅ disjoint from B, B∪E≥C, and for any E>∅ disjoint from C, C∪E≥B.
(It's not obvious to me that ≈ is in general an equivalence relation, but it certainly is in the presence of P6'''; Savage seems to use this implicitly. Note also that another consequence of P6''' is that for any n there exists a partition of S into n almost-equivalent parts; such a partition is necessarily almost-uniform.)
However the following stronger version of P6''' gets rid of this distinction:
P6'. For any B>C, there exists a partition of S, each part D of which satisfies C∪D<B.
(Observe that P6''' is just P6' for C=∅.) Under P6', almost equivalence is equivalence, and so numerical probability agrees with qualitative probability, and we finally have what we wanted. (So by earlier, P6' implies P6'', not just P6'''. Indeed by above it implies the existence of uniform partitions into n parts for any n, not just arbitrarily large n.)
In actuality, Savage assumes an even stronger axiom, which is needed to get utility and not just probability:
P6. For any acts g<h, and any consequence b, there is a partition of S such that if g is modified on any one part to be constantly b there, we would still have g<h; and if h is modified on any one part to be constantly b there, we would also still have g<h.
Applying P6 to wagers yields the weaker P6'.
We can now also get conditional probability - if P6' holds, it also holds for the preorderings "≤ given C" for non-null C, and hence we can define P(B|C) to be the probability of B under the quantitative probability we get corresponding to the qualitative probabilty "≤ given C". Using the uniqueness of agreeing probability measures, it's easy to check that indeed, P(B|C)=P(B∩C)/P(C).
Utility for finite gambles
Now that we have numerical probability, we can talk about finite gambles. If we have consequences b1, ..., bn, and probabilities λ1, ..., λn summing to 1, we can consider the gamble ∑λibi, represented by any action which yields b1 with probability λ1, b2 with probability λ2, etc. (And with probability 0 does anything; we don't care about events with probability 0.) Note that by above such an action necessarily exists. It can be proven that any two actions representing the same gamble are equivalent, and hence we can talk about comparing gambles. We can also sensibly talk about mixing gambles - taking ∑λifi where the fi are finite gambles, and the λi are probabilities summing to 1 - in the obvious fashion.
With these definitions, it turns out that Von Neumann and Morgenstern's independence condition holds, and, using axiom P6, Savage shows that the continuity (i.e. Archimedean) condition also holds, and hence there is indeed a utility function, a function U:F→R such that for any two finite gambles represented by f and g respectively, f≤g if and only if the expected utility of the first gamble is less than or equal to that of the second. Furthermore, any two such utility functions are related via an increasing affine transformation.
We can also take expected value knowing that a given event C obtains, since we have numerical probability; and indeed this agrees with the preference ordering on gambles given C.
Expected utility in general and boundedness of utility
Finally, Savage shows that if we assume one more axiom, P7, then we have that for any essentially bounded actions f and g, we have f≤g if and only if the expected utility of f is at most that of g. (It is possible to define integration with respect to a finitely additive measure similarly to how one does with respect to a countably additive measure; the result is linear and monotonic but doesn't satisfy convergence properties.) Similarly with respect to a given event C.
The axiom P7 is:
P7. If f and g are acts and B is an event such that f≤g(s) given B for every s∈B, then f≤g given B. Similarly, if f(s)≤g given B for every s in B, then f≤g given B.
So this is just another variant on the "sure-thing principle" that I earlier labeled P2c.
Now in fact it turns out as mentioned above that P7, when taken together with the rest, implies that utility is bounded, and hence that we do indeed have that for any f and g, f≤g if and only if the expected utility of f is at most that of g! This is due to Peter Fishburn and postdates the first edition of Foundations of Statistics, so in there Savage simply notes that it would be nice if this worked for f and g not necessarily essentially bounded (so long as their expected values exist, and allowing them to be ±∞), but that he can't prove this, and then adds a footnote giving a reference for bounded utility. (Though he does prove using P7 that if you have two acts f and g such that f,g≤b for all consequences b, then f≡g; similarly if f,g≥b for all b. Actually, this is a key lemma in proving that utility is bounded; Fishburn's proof works by showing that if utility were unbounded, you could construct two actions that contradict this.)
Of course, if you really don't like the conclusion that utility is bounded, you could throw out axiom 7! It's pretty intuitive, but it's not clear that ignoring it could actually get you Dutch-booked. After all, the first 6 axioms are enough to handle finite gambles, 7 is only needed for more general situations. So long as your Dutch bookie is limited to finite gambles, you don't need this.
Questions on further justification
So now that I've laid all this out, here's the question I originally meant to ask: To what extent can these axioms be grounded in more basic principles, e.g. Dutch book arguments? It seems to me that most of these are too basic for that to apply - Dutch book arguments need more working in the background. Still, it seems to me axioms P2, P3, and P4 might plausibly be grounded this way, though I have not yet attempted to figure out how. P7 presumably can't, for the reasons noted in the previous section. P1 I assume is too basic. P5 obviously can't (if the agent doesn't care about anything, that's its own problem).
P6 is an Archimedean condition. Typically I've seen those (specifically Von Neumann and Morgenstern's continuity condition) justified on this site with the idea that infinitesimals will never be relevant in any practical situation - if c has only infinitesimally more utility than b, the only case when the distinction would be relevant is if the probabilities of accomplishing them were exactly equal, which is not realistic. I'm guessing infinitesimal probabilities can probably be done away with in a similar manner?
Or are these not good axioms in the first place? You all are more familiar with these sorts of things than me. Ideas?