Math appendix for: "Why you must maximize expected utility"
This is a mathematical appendix to my post "Why you must maximize expected utility", giving precise statements and proofs of some results about von Neumann-Morgenstern utility theory without the Axiom of Continuity. I wish I had the time to make this post more easily readable, giving more intuition; the ideas are rather straight-forward and I hope they won't get lost in the line noise!
The work here is my own (though closely based on the standard proof of the VNM theorem), but I don't expect the results to be new.
*
I represent preference relations as total preorders on a simplex
; define
,
,
and
in the obvious ways (e.g.,
iff both
and
, and
iff
but not
). Write
for the
'th unit vector in
.
In the following, I will always assume that satisfies the independence axiom: that is, for all
and
, we have
if and only if
. Note that the analogous statement with weak preferences follows from this:
holds iff
, which by independence is equivalent to
, which is just
.
Lemma 1 (more of a good thing is always better). If and
, then
.
Proof. Let . Then,
and
. Thus, the result follows from independence applied to
,
,
, and
.
Lemma 2. If and
, then there is a unique
such that
for
and
for
.
Proof. Let be the supremum of all
such that
(note that by assumption, this condition holds for
). Suppose that
. Then there is an
such that
. By Lemma 1, we have
, and the first assertion follows.
Suppose now that . Then by definition of
, we do not have
, which means that we have
, which was the second assertion.
Finally, uniqueness is obvious, because if both and
satisfied the condition, we would have
.
Definition 3. is much better than
, notation
or
, if there are neighbourhoods
of
and
of
(in the relative topology of
) such that we have
for all
and
. (In other words, the graph of
is the interior of the graph of
.) Write
or
when
(
is not much better than
), and
(
is about as good as
) when both
and
.
Theorem 4 (existence of a utility function). There is a such that for all
,
Unless for all
and
, there are
such that
.
Proof. Let be a worst and
a best outcome, i.e. let
be such that
for all
. If
, then
for all
, and by repeated applications of independence we get
for all
, and therefore
again for all
, and we can simply choose
.
Thus, suppose that . In this case, let
be such that for every
,
equals the unique
provided by Lemma 2 applied to
and
. Because of Lemma 1,
. Let
.
We first show that implies
. For every
, we either have
, in which case by Lemma 2 we have
for arbitrarily small
, or we have
, in which case we set
and find
. Set
. Now, by independence applied
times, we have
; analogously, we obtain
for arbitrarily small
. Thus, using
and Lemma 1,
and therefore
as claimed. Now note that if
, then this continues to hold for
and
in a sufficiently small neighbourhood of
and
, and therefore we have
.
Now suppose that . Since we have
and
, we can find points
and
arbitrarily close to
and
such that the inequality becomes strict (either the left-hand side is smaller than one and we can increase it, or the right-hand side is greater than zero and we can decrease it, or else the inequality is already strict). Then,
by the preceding paragraph. But this implies that
, which completes the proof.
Corollary 5. is a preference relation (i.e., a total preorder) that satisfies independence and the von Neumann-Morgenstern continuity axiom.
Proof. It is well-known (and straightforward to check) that this follows from the assertion of the theorem.
Corollary 6. is unique up to affine transformations.
Proof. Since is a VNM utility function for
, this follows from the analogous result for that case.
Corollary 7. Unless for all
, for all
the set
has lower dimension than
(i.e., it is the intersection of
with a lower-dimensional subspace of
).
Proof. First, note that the assumption implies that . Let
be given by
,
, and note that
is the intersection of the hyperplane
with the closed positive orthant
. By the theorem,
is not parallel to
, so the hyperplane
is not parallel to
. It follows that
has dimension
, and therefore
can have at most this dimension. (It can have smaller dimension or be the empty set if
only touches or lies entirely outside the positive orthant.)
Why you must maximize expected utility
This post explains von Neumann-Morgenstern (VNM) axioms for decision theory, and what follows from them: that if you have a consistent direction in which you are trying to steer the future, you must be an expected utility maximizer. I'm writing this post in preparation for a sequence on updateless anthropics, but I'm hoping that it will also be independently useful.
The theorems of decision theory say that if you follow certain axioms, then your behavior is described by a utility function. (If you don't know what that means, I'll explain below.) So you should have a utility function! Except, why should you want to follow these axioms in the first place?
A couple of years ago, Eliezer explained how violating one of them can turn you into a money pump — how, at time 11:59, you will want to pay a penny to get option B instead of option A, and then at 12:01, you will want to pay a penny to switch back. Either that, or the game will have ended and the option won't have made a difference.
When I read that post, I was suitably impressed, but not completely convinced: I would certainly not want to behave one way if behaving differently always gave better results. But couldn't you avoid the problem by violating the axiom only in situations where it doesn't give anyone an opportunity to money-pump you? I'm not saying that would be elegant, but is there a reason it would be irrational?
It took me a while, but I have since come around to the view that you really must have a utility function, and really must behave in a way that maximizes the expectation of this function, on pain of stupidity (or at least that there are strong arguments in this direction). But I don't know any source that comes close to explaining the reason, the way I see it; hence, this post.
I'll use the von Neumann-Morgenstern axioms, which assume probability theory as a foundation (unlike the Savage axioms, which actually imply that anyone following them has not only a utility function but also a probability distribution). I will assume that you already accept Bayesianism.
*
Epistemic rationality is about figuring out what's true; instrumental rationality is about steering the future where you want it to go. The way I see it, the axioms of decision theory tell you how to have a consistent direction in which you are trying to steer the future. If my choice at 12:01 depends on whether at 11:59 I had a chance to decide differently, then perhaps I won't ever be money-pumped; but if I want to save as many human lives as possible, and I must decide between different plans that have different probabilities of saving different numbers of people, then it starts to at least seem doubtful that which plan is better at 12:01 could genuinely depend on my opportunity to choose at 11:59.
So how do we formalize the notion of a coherent direction in which you can steer the future?
Pascal's Mugging for bounded utility functions
Followup to: Pascal's Mugging: Tiny probabilities of vast utilities; The Lifespan Dilemma
This is Pascal's Mugging: Someone comes to you and says, "Give me five dollars, and I'll use my powers from outside the matrix to grant you 4^^^^4 years of fun." And they're lying, of course, but under a Solomonoff prior, the probability that they're not, though surely very small, isn't going to be less than one in 3^^^3; and so if you shut up and multiply, it's clear that the expected utility of paying up outweighs the expected utility of anything sensible you might be doing with those five dollars, and therefore—
Well, fortunately, if you're afraid that your utility-maximizing AI will end up paying all its money to the first clever mugger to come along and ask: never to worry! It will do so only if it can't think of anything better to do with five dollars, after all. So to avoid being mugged, all it has to do is to think of a harebrained scheme for spending $5 that has more than a one-in-4^^^4 chance of providing 5^^^^5 years of fun. Problem solved.
If, however, you would like to be there be a chance greater than one-in-hell that your AI ends up doing something actually useful, you'll need to do something else. And the simplest answer is to adopt a bounded utility function: any positive singularity gives at least 50 utils, a billion years gives 80 utils, a googol years gives 99 utils, a googolplex years gives 99.9 utils, and 4^^^^4 years of fun give 100 utils (minus epsilon).
This will, indeed, solve the problem. Probability of getting mugged: used to be one (minus epsilon, of course); has now been brought down to zero. That's right: zero.
(Plus epsilon.)
But let's suppose that the impossible happens, and the universe turns out to be able to support TREE(100) years of fun, and we've already lived out 4^^^^4 of them, and the AI has long since folded up operations and faded out of existence because humanity has become sufficiently sane that we no longer need it—
And lo, someone comes to you and says, "Alas, you're not really experiencing 4^^^^4 years of fun here; you're really a mere billion-year-old living in a very convincing simulation. Give me five dollars, and I'll use my powers from outside the matrix to extend your lifespan to a googol years."
And they're lying, of course — but it has been a long time indeed since you last faced a choice that could make a difference of nineteen whole utils...
*
If you truly have a bounded utility function, you must agree that in this situation, paying up is exactly what you'd want to do. Even though it means that you will not experience 4^^^^4 years of fun, even conditional on the universe being capable of supporting TREE(100) of them.
[ETA: To clarify, by "4^^^^4", I really mean any number so large that your utility function assigns (100 - epsilon) utils to it. It's possible to have a utility function where this is only true for infinite numbers which are so incredibly infinite that, given a particular formal language, their definition is so long and complicated that no mere human-sized mind could comprehend it. See this comment thread for discussion of bounded utility functions that assign significant weight to very large lifetimes.]
A model of UDT with a concrete prior over logical statements
I've been having difficulties with constructing a toy scenario for AI self-modification more interesting than Quirrell's game, because you really want to do expected utility maximization of some sort, but currently our best-specified decision theories search through the theorems of one particular proof system and "break down and cry" if they can't find one that tells them what their utility will be if they choose a particular option. This is fine if the problems are simple enough that we always find the theorems we need, but the AI rewrite problem is precisely about skirting that edge. It seems natural to want to choose some probability distribution over the possibilities that you can't rule out, and then do expected utility maximization (because if you don't maximize EU over some prior, it seems likely that someone could Dutch-book you); indeed, Wei Dai's original UDT has a "mathematical intuition module" black box which this would be an implementation of. But how do you assign probabilities to logical statements? What consistency conditions do you ask for? What are the "impossible possible worlds" that make up your probability space?
Recently, Wei Dai suggested that logical uncertainty might help avoid the Löbian problems with AI self-modification, and although I'm sceptical about this idea, the discussion pushed me into trying to confront the logical uncertainty problem head-on; then, reading Haim Gaifman's paper "Reasoning with limited resources and assigning probabilities to logical statements" (which Luke linked from So you want to save the world) made something click. I want to present a simple suggestion for a concrete definition of "impossible possible world", for a prior over them, and for an UDT algorithm based on that. I'm not sure whether the concrete prior is useful—the main point in giving it is to have a concrete example we can try to prove things about—but the definition of logical possible worlds looks like a promising theoretical tool to me.
How to cheat Löb's Theorem: my second try
In his open problems talk, Eliezer explains how Löb's theorem prevents you from having a consistent proof system P with an axiom schema that anything P proves is actually true, and asks how we can then "build an AI that could completely rewrite itself, without decreasing the amount of trust it had in math every time it executed that self-rewrite" (18:46).
Recently, I posted about an attempt to apply a general trick for avoiding diagonalization problems to a minimal toy version of this problem. Since then, Wei Dai has posted an interesting quining approach to the same toy problem, and Giles had a promising idea for doing something similar in a different way and will hopefully do a write-up filling in the details. Unfortunately my own "proof" turned out to be broken.
I think I've fixed the problem and made the proof more comprehensible and intuitive in the process. (To avoid confusion, note that what I'm proving is slightly different from, though related to, what I did in the previous post.) However, getting the details right seems to be far from trivial, so I would very much appreciate if people checked my new argument, and told me that it looks okay / where it goes wrong / where they get lost. Thanks in advance!
I'll be more explicit about quoting/unquoting than before, which means I'll need to introduce some notation. However, to sustain you through the schlep of preliminaries, I thought I'd start with an informal summary.
An angle of attack on Open Problem #1
There is a problem with the proof here and I have to think about whether I can fix it. Thanks to I have posted a new and hopefully correct proof attempt. Thanks again to vi21maobk9vp!
In his talk on open problems in Friendly AI, Eliezer's first question is how, given Löb's theorem, an AI can replace itself with a better expected utility maximizer that believes in as much mathematics as the original AI. I know exactly one trick for that sort of problem, so I decided to try that on a toy variant. To my surprise, it more or less just worked. Therefore:
Professor Quirrell proposes a game. You start with a score of one. Professor Quirrell moves first, by choosing a computer program and showing you its source code. You then have three options: Take your winnings; double down; or self-destruct.
If you take your winnings, the game ends, and your score is converted to Quirrell points.
If you self-destruct, the game ends, your score is lost, you'll be sent to bed without dinner, you'll lose 150 House points, Rita Skeeter will write a feature alleging that you're a Death Eater, and Professor Quirrell will publicly critique your performance. You are advised not to pick this option.
If you double down, your score doubles, and you advance to the next round. Professor Quirrell again moves first by choosing a computer program. Then, it's your turn—except that this time, you don't get to choose your move yourself: instead, it'll be chosen by Professor Quirrell's program from the previous round.
Professor Quirrell will endeavor to present an educational sequence of programs.
Self-modification is the correct justification for updateless decision theory
Reply to: Late great filter is not bad news
Suppose that you build an AI, and Omega appears to it and says:
Here's a button. A million years ago I calculated the umpteenth digit of pi. If it is even, I calculated whether you would press this button (in such a way that your human creator was never simulated as a conscious being). If I predicted that you wouldn't press the button, I destroyed Earth right then and there.* If it is odd, I created a doomsday device that will destroy the solar system if you press this button.
[* ETA: Assume that if the digit is even and the AI is predicted to press the button, Omega does not destroy Earth, but does turn Alpha Centauri purple (say). The point is for this to be a scenario that you, the AI creator, know not to have come to pass.]
Suppose you're the kind of AI creator whose AI is time consistent in a certain sense from the beginning of time and presses the button. Then you have an AI that satisfies a certain kind of philosopher, wins big in a certain logically impossible world, and destroys humanity.
Suppose, on the other hand, that you're a very similar kind of AI creator, only you program your AI not to take into account impossible possible worlds that had already turned out to be impossible (when you created the AI | when you first became convinced that timeless decision theory is right). Then you've got an AI that most of the time acts the same way, but does worse in worlds we know to be logically impossible, and destroys humanity less often in worlds we do not know to be logically impossible.
Wei Dai's great filter post seems to suggest that under UDT, you should be the first kind of AI creator. I don't think that's true, actually; I think that in UDT, you should probably not start with a "prior" probability distribution that gives significant weight to logical propositions you know to be false: do you think the AI should press the button if it was the first digit of pi that Omega calculated?
But obviously, you don't want tomorrow's you to pick the prior that way just after Omega has appeared to it in a couterfactual mugging (because according to your best reasoning today, there's a 50% chance this loses you a million dollars).
The most convincing argument I know for timeless flavors of decision theory is that if you could modify your own source code, the course of action that maximizes your expected utility is to modify into a timeless decider. So yes, you should do that. Any AI you build should be timeless from the start; and it's reasonable to make yourself into the kind of person that will decide timelessly with your probability distribution today (if you can do that).
But I don't think you should decide that updateless decision theory is therefore so pure and reflectively consistent that you should go and optimize your payoff even in worlds whose logical impossibility was clear before you first decided to be a timeless decider (say). Perhaps it's less elegant to justify UDT through self-modification at some arbitrary point in time than through reflective consistency all the way from the big bang on; but in the worlds we can't rule out yet, it's more likely to win.
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)