Or: The Problem with Naive Decision Theory
Previously: Decision Theories: A Less Wrong Primer
Summary of Sequence: In the context of a tournament for computer programs, I give almost-explicit versions of causal, timeless, ambient, updateless, and several other decision theories. I explain the mathematical considerations that make decision theories tricky in general, and end with a bunch of links to the relevant recent research. This sequence is heavier on the math than the primer was, but is meant to be accessible to a fairly general audience. Understanding the basics of game theory (and Nash equilibria) will be essential. Knowing about things like Gödel numbering, quining and Löb's Theorem will help, but won't be required.
Summary of Post: I introduce a context in which we can avoid most of the usual tricky philosophical problems and formalize the decision theories of interest. Then I show the chief issue with what might be called "naive decision theory": the problem of spurious counterfactual reasoning. In future posts, we'll see how other decision theories get around that problem.
In my Decision Theory Primer, I gave an intuitive explanation of decision theories; now I'd like to give a technical explanation. The main difficulty is that in the real world, there are all sorts of complications that are extraneous to the core of decision theory. (I'll mention more of these in the last post, but an obvious one is that we can't be sure that our perception and memory match reality.)
In order to avoid such difficulties, I'll need to demonstrate decision theory in a completely artificial setting: a tournament among computer programs.
You're a computer programmer entering a tournament for spectacular stakes. But you won't be competing in person: instead, you'll be submitting code for a program to represent you, and that program will be competing one-on-one with other programs in a series of games.
You don't know what the specific games are, but the contest has specified some ground rules:
- In each game, your program and its opponent will have to choose separately between several options. (There are independent sources of randomness available to each program, so mixed strategies are legal.)
- The games are pure strategic conflicts: the expected payouts to each programmer depend on the outputs of both programs, and can be calculated simply if you know those outputs. (In particular, you can represent each game as a payoff matrix in terms of the programs' outputs.) For example, your program might play the Prisoner's Dilemma against another program.
- Both programs have access to the source code of each other and of the game they're playing.
- The programs don't get to carry any memories from round to round, or modify their source code at any stage. (This makes the analysis much simpler, and makes it even more impressive if we find unexploitable programs which are capable of mutual cooperation.)
- While many of the programs were written by other programmers trying to win prizes for themselves, there are also some special algorithms included to test your reactions; for example, there could be programs that always cooperate in the Prisoner's Dilemma, or an instance of Newcomb's Predictor.
- The tournament may also include some exact copies of your own program, but you won't get any of the prizes won by these extra copies, only the prizes won by your actual entry. (There's no way for your program to distinguish whether it's the original or a copy.)
- There are more than enough prizes to go around, so you don't have to make anyone else lose, you just want your program to win as much as it can.
- Also, there will be enough games played that most of the variance should wash out, so you should aim for the highest expected value per round rather than worry about the concave utility of more prizes.
So, what kind of program should you write?
In the next few posts, we'll examine several ideas, problems and decision theories, increasing our sophistication as we go. We'll use X to denote your program, and x1, . . . , xn to denote its possible outputs; likewise, your opponent in the current round is Y with possible outputs y1, . . . , ym. We'll let U(xi,yj) denote the resulting payout to you if X outputs xi and Y outputs yj.
Idea 1: Play defense with a Nash equilibrium
In our example, we know what utility function the opponent should have: its own expected payout.0 Any such game has at least one Nash equilibrium, a pair of strategies (which may be mixed) such that if X and Y both adopted them, then neither would be better off unilaterally switching strategies. In that sense, at least, if X plays a Nash equilibrium, it can be sure of not being exploited by Y. (In particular, X will never end up tricked into cooperating in the Prisoner's Dilemma while Y defects.)
Of course, there may be more than one Nash equilibrium in a game, and these may be of unequal value if the game is non-zero-sum. (Coordination problems are tricky in general.) So this is underspecified; still, choosing an arbitrary Nash equilibrium is a decent backup strategy. But we can often do better:
Idea 2: Inference
The most basic intuition a human being has in this situation is to start trying to deduce things about Y's output from its source code, or even deduce things directly about U. This idea is best illustrated by playing Rock, Paper, Scissors against a player who always throws Rock: if you figure this out, then you should of course play Paper rather than the Nash equilibrium of 1/3 Rock, 1/3 Paper, 1/3 Scissors. (And in a coordination game, you'd prefer to settle on the same Nash equilibrium that Y outputs.)
Automating inference is an exceedingly difficult problem in general, though researchers have made substantial progress. All the decision theories we'll talk about will include some sort of "inference module", which can be applied to the source code of Y to deduce its output, applied to the code of the full round (including X, Y, and the payoff matrix) to deduce the value of U, etc.
Problem: You can't deduce everything
Gödel's First Incompleteness Theorem and the Halting Problem both imply that it's impossible to write a program that correctly deduces in general1 the output of arbitrary other programs. So we have to be prepared for our inference module to fail sometimes.
A well-written inference module will either return a correct answer for a question or return "Unknown"; a sloppily-written module can get stuck in an infinite process, and a badly-written one will return an incorrect answer sometimes. It should be clear that we'll want our inference module to be of the first sort.
It seems we have enough already to define our first candidate decision theory:
Naive Decision Theory
Let's first consider the approach that seems most obvious. Since we know the source code of the entire round (including X and Y), we could implement the following program:
- For each xi, assume the output of X is xi, and try to deduce the expected value of U. (That is, try and deduce statements of the form "if (output X)=xi then U=ui" for some ui).
- If this succeeds for each xi, output the xi for which ui is the largest.
- If this does not succeed for some xi, output a Nash equilibrium strategy.
This "naive decision theory" certainly qualifies for our tournament; it may be a bit trickier to write an inference module that does an open-ended search for the value of U, but it's not impossible (since human mathematicians solve open-ended deduction problems all the time). And it looks like the worst-case scenario is a Nash equilibrium, not total exploitation. What could possibly go wrong?
Problem: Beware self-fulfilling prophecies!
There's a reason that we don't normally ask an automated theorem prover to consider questions about its own mathematical structure: if we ask the question in a certain way, any choice becomes a self-fulfilling prophecy.
If X deduces its own output by a valid process, then it's created a self-fulfilling prophecy for its output, and the problem with that is that a bad self-fulfilling prophecy is just as consistent as a good one. If we want to use statements like "if (output X)=xi then U=ui" to make our final choice, then we have to beware the other half of logical implication, that "not P" implies "if P then Q". This allows for what we might call spurious counterfactuals, which can throw off the actual decision in a perfectly self-consistent way.
Consider, for example, the one-player game where X gets $1 for outputting a, or $10 for outputting b. We want X to do the following:
- Prove "if (output X)=a then U=1"
- Prove "if (output X)=b then U=10"
- Output b.
But it's just as consistent for X to do the following:
- Prove "(output X)=a"
- Prove "if (output X)=a then U=1"
- Prove "if (output X)=b then U=0"
- Output a.
How could that possibly work? Since (output X)=a, the third line is a true logical statement! It's like the fact that you can prove anything if you assume a falsehood (though rather than unconditionally accepting a false premise, X is using a false premise as the antecedent of a material conditional). In this example, the very order in which X looks for proofs (which is part of the definition of X) affects which counterfactuals X can and cannot prove. (One important thing to note is that X cannot prove a spurious counterfactual about the action that it does output, only about the ones that it doesn't!)
I don't need to tell you that the second chain of proofs is not what we want X to do. Worse, if this is a real bug, then it could also be an exploitable vulnerability: if your source code for X were released in advance of the tournament, then other programmers might write programs that cause X to generate spurious counterfactuals for all but the moves that are most favorable to Y.
Can NDT be salvaged?
Let's consider some quick fixes before we give up on Naive Decision Theory.
Can we simply prohibit X from ever deducing "(output X)=xi" as a step? This doesn't work because of the possibility of indirect self-reference; X could end up deducing some seemingly innocuous statements which happen to correspond to its own Gödel numbering, and the spurious counterfactuals would follow from those- without X ever having noticed that it had done anything of the sort. And it's provably impossible for X to recognize every possible Gödel numbering for its own inference module!
Next, it might seem like an inference module should stumble on the "genuine" counterfactuals before running into spurious ones, since the "genuine" ones seem necessarily simpler. However, it turns out (as proved by cousin_it) that one can write a valid but malicious inference module which returns and acts on a spurious proof, and (as proved by gRR) that a game with malicious code can similarly dupe a NDT agent with a good inference module!
Lastly, it seems safer to deduce counterfactuals "if (output X)=xi then (output Y)=yj" and apply the U(xi,yj) afterwards. And indeed, I can't see how to make Y exploit X in a straight-up Prisoner's Dilemma if that algorithm is used. There are still two problems, though. First, this algorithm now depends on the values U(xi,yj) being given to it by authority- it can't safely deduce them from the source code for the game. And secondly, it could two-box on Newcomb's Problem and defect against itself in the Prisoner's Dilemma if it finds spurious counterfactuals there.
Thus it seems we'll need to do something substantially different.
Well, now what?
There are several ways we could write a decision theory to do inference without risking spurious counterfactuals, and indeed the decision theories we discuss on Less Wrong correspond to different valid approaches. The differences in their decisions come not from better-written inference modules, but from more effective strategies for using their inference module. In the posts to come, I'll show you how they work in detail.
Next: Part II: Causal Decision Theory and Substitution
Notes:
0. Things get wacky if we don't know the utility function of the opponent; fortunately, even the special cases like Newcomb's predictor can be expressed as expected utility maximizers for some payoff matrix (in this case, one where the predictor gets rewarded when it matches your decision exactly).
1. That "in general" is important: it's quite possible to write programs that deduce the outputs of plenty of other programs. It just so happens that there's always some program that your inference module will fail on. The classic way to generate a failure case is to run the inference module on a modified version of itself, such that returning a correct answer would induce a contradiction. This isn't just a hypothetical disability: if X is trying to deduce the output of Y in this round, and Y is trying to deduce the output of X in this round, then we might have exactly this problem!
That is what I meant. My problem is that I don't see how to express that strategy correctly with the vocabulary in this post, and I don't understand the notation in those posts well enough to tell if they express that idea correctly.
For example, I'm not comfortable with how Eliezer framed it. If my strategy is "cooperate iff they play cooperate iff I play cooperate," then won't I never cooperate with myself, because I don't have the strategy "cooperate iff they play cooperate"?
(I do think that general idea might be formalizable, but you need models of outcome dependencies that can fit inside themselves. Which maybe the notation I don't understand is powerful enough to do.)
"Cooperate iff they play cooperate iff I play cooperate" is imprecise. Your complaint seems to stem from interpreting it to pay too much attention to the details of the algorithms.* One reason to use counterfactuals is to avoid this. One interpretation of the slogan is "A()=C iff A()=B()." This has symmetry, so the agent cooperates with itself. However, it this is vulnerable to self-fulfilling prophecy against the cooperation rock (ie, it can choose either option) and its actions against the defection rock depend on implementation detai... (read more)