Markup note: Footnote links should be made relative so they don't force a reload of the page due to its URL having been changed.
if your source code for X were released in advance
But everyone's source code is communicated at each round to everyone. What do you mean, "in advance"? What for?
...The problem is that we have an exploitable vulnerability here: if your source code for X were released in advance, then other programmers could write a Gödel-numbering for X's inference module into their own programs in such a way that X has to take a circuitous route to the "genuine" counterfactuals (or so that the genuine ones come back "Unknown"), and in the me
- Actually, this is an open problem so far as I know: show that if X is a Naive Decision Theory agent as above, with some analyzable inference module like a halting oracle, then there exists an agent Y written so that X cooperates against Y in a Prisoner's Dilemma while Y defects.
Let me just spell out to myself what would have to happen in this instance. For definiteness, let's take the payoffs in prisoner's dilemma to be $0 (CD), $1 (DD), $10 (CC) and $11 (DC).
Now, if X is going to co-operate and Y is going to defect then X is going to prove "If I...
Well, this article puts my confusion from the primer into context, and I think I understand the issue now.
The "problem" (and it's ultimately not a problem, but I'll get to that) I have is that the game these programs are playing is ill-posed, in that it's not a closed system. It might appear as if they're playing, for example, Prisoners' Dilemna, but with access to each other's source code they're really playing some sort of Prisoners' Meta-Dilemna, a fundamentally different game for all that it might seem superficially similar. Now I'm not enoug...
I suspect the difficulty you've hit here is that Eliezer's theory (TDT and variants) involves consistent reasoning about counterpossible statements, not just counterfactual statements ("If my algorithm were to pick a in this situation, then my expected utility would be -10000000, so I won't do that then"). I can see what he's trying to achieve, but unfortunately I don't know any way in standard Mathematics to reason consistently about inconsistencies, or (in Modal Logic) to analyze impossible possible worlds. So this looks like a major problem.
I'...
Your self-fulfilling prophecy example works for the iteration of the for loop (described in "For each xi, assume the output of X is xi, and try to deduce the expected value of U.") in which the output is assumed to be a, but for the iteration in which the output is assumed to be b, proving that the output is a would be to prove a contradiction. "if (output X)=b then U=0" is one possible outcome, but U could also equal anything else.
I don't see how the NDT algorithm as given allows "(output X)=a" to be proved outside of the fo...
The programs don't get to carry any memories from round to round, or modify their source code at any stage.
! No memory is a big hole to leave in your explanation. It also destroys your ability to do inference, i.e. idea 2.
Overall, it looks like for NDT you don't want to have access to your opponent's source code, and you do want to have memory. You don't need to infer if you can read!
This branch of math is outside my training. I'm stumbling over the self-fulfilling prophecies section.
How can these two statements
if (output X)=b then U=10
if (output X)=b then U=0
both be true?
For footnote #1, it may be useful to give a couple of examples. When both programs are making decisions based on the others source code, it is pretty close to the undecidable situation, but it depends on the game.
When the game is Rock-Paper-Scissors (or parity), the programs have contradictory goals and are pretty much the textbook example of noncomputable.
But if the game is the Prisoner's Dilemma (or the Coordination Game), they are trying to cooperate to land on (C,C), so they want to be decidable.
I plan to put all the acknowledgments and references in the last part; but I should mention now that I first read about the self-fulfilling prophecy problem on the decision-theory mailing list, in a post by Vladimir Slepnev. (EDIT: He passes on the credit to Benja Fallenstein in this LW comment.)
Other people have used "naive decision theory" before, some of them on Less Wrong, but none of the usages seemed to stick. So I called this one (which was an early candidate for UDT before the problem was noticed) Naive Decision Theory. I can change the name if people prefer.
Gödel's First Incompleteness Theorem and the Halting Problem both imply that it's impossible to write a program that correctly deduces in general1 the output of arbitrary other programs. So we have to be prepared for our inference module to fail sometimes.
And here you fail the very basic thing, the domain of applicability: limited computational power. Unlimited computational power introduces nasty infinities into things. "Naive decision theory" is not interested in games where you confuse yourself using infinities. The contest organizers have ...
Or: The Problem with Naive Decision Theory
Previously: Decision Theories: A Less Wrong Primer
Summary of Sequence: In the context of a tournament for computer programs, I give almost-explicit versions of causal, timeless, ambient, updateless, and several other decision theories. I explain the mathematical considerations that make decision theories tricky in general, and end with a bunch of links to the relevant recent research. This sequence is heavier on the math than the primer was, but is meant to be accessible to a fairly general audience. Understanding the basics of game theory (and Nash equilibria) will be essential. Knowing about things like Gödel numbering, quining and Löb's Theorem will help, but won't be required.
Summary of Post: I introduce a context in which we can avoid most of the usual tricky philosophical problems and formalize the decision theories of interest. Then I show the chief issue with what might be called "naive decision theory": the problem of spurious counterfactual reasoning. In future posts, we'll see how other decision theories get around that problem.
In my Decision Theory Primer, I gave an intuitive explanation of decision theories; now I'd like to give a technical explanation. The main difficulty is that in the real world, there are all sorts of complications that are extraneous to the core of decision theory. (I'll mention more of these in the last post, but an obvious one is that we can't be sure that our perception and memory match reality.)
In order to avoid such difficulties, I'll need to demonstrate decision theory in a completely artificial setting: a tournament among computer programs.
You're a computer programmer entering a tournament for spectacular stakes. But you won't be competing in person: instead, you'll be submitting code for a program to represent you, and that program will be competing one-on-one with other programs in a series of games.
You don't know what the specific games are, but the contest has specified some ground rules:
So, what kind of program should you write?
In the next few posts, we'll examine several ideas, problems and decision theories, increasing our sophistication as we go. We'll use X to denote your program, and x1, . . . , xn to denote its possible outputs; likewise, your opponent in the current round is Y with possible outputs y1, . . . , ym. We'll let U(xi,yj) denote the resulting payout to you if X outputs xi and Y outputs yj.
Idea 1: Play defense with a Nash equilibrium
In our example, we know what utility function the opponent should have: its own expected payout.0 Any such game has at least one Nash equilibrium, a pair of strategies (which may be mixed) such that if X and Y both adopted them, then neither would be better off unilaterally switching strategies. In that sense, at least, if X plays a Nash equilibrium, it can be sure of not being exploited by Y. (In particular, X will never end up tricked into cooperating in the Prisoner's Dilemma while Y defects.)
Of course, there may be more than one Nash equilibrium in a game, and these may be of unequal value if the game is non-zero-sum. (Coordination problems are tricky in general.) So this is underspecified; still, choosing an arbitrary Nash equilibrium is a decent backup strategy. But we can often do better:
Idea 2: Inference
The most basic intuition a human being has in this situation is to start trying to deduce things about Y's output from its source code, or even deduce things directly about U. This idea is best illustrated by playing Rock, Paper, Scissors against a player who always throws Rock: if you figure this out, then you should of course play Paper rather than the Nash equilibrium of 1/3 Rock, 1/3 Paper, 1/3 Scissors. (And in a coordination game, you'd prefer to settle on the same Nash equilibrium that Y outputs.)
Automating inference is an exceedingly difficult problem in general, though researchers have made substantial progress. All the decision theories we'll talk about will include some sort of "inference module", which can be applied to the source code of Y to deduce its output, applied to the code of the full round (including X, Y, and the payoff matrix) to deduce the value of U, etc.
Problem: You can't deduce everything
Gödel's First Incompleteness Theorem and the Halting Problem both imply that it's impossible to write a program that correctly deduces in general1 the output of arbitrary other programs. So we have to be prepared for our inference module to fail sometimes.
A well-written inference module will either return a correct answer for a question or return "Unknown"; a sloppily-written module can get stuck in an infinite process, and a badly-written one will return an incorrect answer sometimes. It should be clear that we'll want our inference module to be of the first sort.
It seems we have enough already to define our first candidate decision theory:
Naive Decision Theory
Let's first consider the approach that seems most obvious. Since we know the source code of the entire round (including X and Y), we could implement the following program:
This "naive decision theory" certainly qualifies for our tournament; it may be a bit trickier to write an inference module that does an open-ended search for the value of U, but it's not impossible (since human mathematicians solve open-ended deduction problems all the time). And it looks like the worst-case scenario is a Nash equilibrium, not total exploitation. What could possibly go wrong?
Problem: Beware self-fulfilling prophecies!
There's a reason that we don't normally ask an automated theorem prover to consider questions about its own mathematical structure: if we ask the question in a certain way, any choice becomes a self-fulfilling prophecy.
If X deduces its own output by a valid process, then it's created a self-fulfilling prophecy for its output, and the problem with that is that a bad self-fulfilling prophecy is just as consistent as a good one. If we want to use statements like "if (output X)=xi then U=ui" to make our final choice, then we have to beware the other half of logical implication, that "not P" implies "if P then Q". This allows for what we might call spurious counterfactuals, which can throw off the actual decision in a perfectly self-consistent way.
Consider, for example, the one-player game where X gets $1 for outputting a, or $10 for outputting b. We want X to do the following:
But it's just as consistent for X to do the following:
How could that possibly work? Since (output X)=a, the third line is a true logical statement! It's like the fact that you can prove anything if you assume a falsehood (though rather than unconditionally accepting a false premise, X is using a false premise as the antecedent of a material conditional). In this example, the very order in which X looks for proofs (which is part of the definition of X) affects which counterfactuals X can and cannot prove. (One important thing to note is that X cannot prove a spurious counterfactual about the action that it does output, only about the ones that it doesn't!)
I don't need to tell you that the second chain of proofs is not what we want X to do. Worse, if this is a real bug, then it could also be an exploitable vulnerability: if your source code for X were released in advance of the tournament, then other programmers might write programs that cause X to generate spurious counterfactuals for all but the moves that are most favorable to Y.
Can NDT be salvaged?
Let's consider some quick fixes before we give up on Naive Decision Theory.
Can we simply prohibit X from ever deducing "(output X)=xi" as a step? This doesn't work because of the possibility of indirect self-reference; X could end up deducing some seemingly innocuous statements which happen to correspond to its own Gödel numbering, and the spurious counterfactuals would follow from those- without X ever having noticed that it had done anything of the sort. And it's provably impossible for X to recognize every possible Gödel numbering for its own inference module!
Next, it might seem like an inference module should stumble on the "genuine" counterfactuals before running into spurious ones, since the "genuine" ones seem necessarily simpler. However, it turns out (as proved by cousin_it) that one can write a valid but malicious inference module which returns and acts on a spurious proof, and (as proved by gRR) that a game with malicious code can similarly dupe a NDT agent with a good inference module!
Lastly, it seems safer to deduce counterfactuals "if (output X)=xi then (output Y)=yj" and apply the U(xi,yj) afterwards. And indeed, I can't see how to make Y exploit X in a straight-up Prisoner's Dilemma if that algorithm is used. There are still two problems, though. First, this algorithm now depends on the values U(xi,yj) being given to it by authority- it can't safely deduce them from the source code for the game. And secondly, it could two-box on Newcomb's Problem and defect against itself in the Prisoner's Dilemma if it finds spurious counterfactuals there.
Thus it seems we'll need to do something substantially different.
Well, now what?
There are several ways we could write a decision theory to do inference without risking spurious counterfactuals, and indeed the decision theories we discuss on Less Wrong correspond to different valid approaches. The differences in their decisions come not from better-written inference modules, but from more effective strategies for using their inference module. In the posts to come, I'll show you how they work in detail.
Next: Part II: Causal Decision Theory and Substitution
Notes:
0. Things get wacky if we don't know the utility function of the opponent; fortunately, even the special cases like Newcomb's predictor can be expressed as expected utility maximizers for some payoff matrix (in this case, one where the predictor gets rewarded when it matches your decision exactly).
1. That "in general" is important: it's quite possible to write programs that deduce the outputs of plenty of other programs. It just so happens that there's always some program that your inference module will fail on. The classic way to generate a failure case is to run the inference module on a modified version of itself, such that returning a correct answer would induce a contradiction. This isn't just a hypothetical disability: if X is trying to deduce the output of Y in this round, and Y is trying to deduce the output of X in this round, then we might have exactly this problem!