# Decision Theories: A Semi-Formal Analysis, Part III

### Or: Formalizing Timeless Decision Theory

**Previously:**

0. Decision Theories: A Less Wrong Primer

1. The Problem with Naive Decision Theory

2. Causal Decision Theory and Substitution

**WARNING: The main result of this post, as it's written here, is flawed. I at first thought it was a fatal flaw, but later found a fix. I'm going to try and repair this post, either by including the tricky bits, or by handwaving and pointing you to the actual proofs if you're curious. Carry on!**

**Summary of Post: ***Have you ever wanted to know how (and whether) Timeless Decision Theory works? Using the framework from the last two posts, this post shows you explicitly how TDT can be implemented in the context of our tournament, what it does, how it strictly beats CDT on fair problems, and a bit about why this is a Big Deal. But you're seriously going to want to read the previous posts in the sequence before this one.*

We've reached the frontier of decision theories, and we're ready at last to write algorithms that achieve mutual cooperation in Prisoner's Dilemma (without risk of being defected on, and without giving up the ability to defect against players who always cooperate)! After two substantial preparatory posts, it feels like it's been a long time, hasn't it?

But look at me, here, talking when there's Science to do...

Once again, we're programming an agent **X** to face another agent **Y** in a game **G**, where **X** and **Y** have access to each others' source codes and the source code of **G**. (In this post, we'll sweep the "player 1, player 2" bit back into the definition of **G** for simplicity in our code.) We may assume that at least one of the *x _{i}* is a Nash equilibrium, since

**X**and

**Y**can treat a mixed equilibrium like an extra option and define its expected payoffs as usual.

### Not Quite Timeless Decision Theory

Here's our first shot at TDT, based on the idea of substituting different things for the source code of **X** in our prediction of **Y**...

def **X**(code **G**, code **Y**):

- For each
*x*:_{i}- Write a very simple agent
**A**that always outputs_{i}*x*in game_{i}**G**. - Try to deduce output
**Y**(code**G**, code**A**)._{i} - If this succeeds, calculate
**U**(*x*, output_{i}**Y**(code**G**, code**A**)); call this_{i}*u*._{i}

- Write a very simple agent
- If
*u*has been deduced for all_{i}*x*, return the_{i}*x*for which_{i}*u*is largest._{i} - Otherwise, return the default Nash equilibrium.

To see what this algorithm does, let's try it on Newcomb's Problem as we formalized it last time; we'll let **A _{1}** denote the agent that always one-boxes and

**A**denote the agent that always two-boxes. If

_{2}**P**has a valid and sufficiently powerful inference module, the output of

**P**(code

**N**, code

**A**) is 1 and the output of

_{1}**P**(code

**N**, code

**A**) is 2. Why?

_{2}**A**and

_{1}**A**are very simple bots, so

_{2}**P**will be able to deduce their outputs, and we know what it does in that case.

And then, if **X** also has a valid and sufficiently powerful inference module, **X** will deduce those outputs for **P**. Thus **X** will deduce that *u _{1}*=100 and

*u*=1, so

_{2}**X**will indeed one-box; and if

**P**has a sufficiently powerful inference module, it will therefore predict that

**X**one-boxes, and fill the box, so that

**X**gets payoff 100!

Again, what we're doing here is not counterfactual reasoning, but instead reasoning about what your opponent would do against some *other* simpler agent; therefore we're again evading the problem of spurious counterfactuals. (On a philosophical level, this decision theory finds the most successful constant strategy **A _{i}** and mimics its output.)

Alright, I said there were going to be two problems with this attempt, and indeed I called it Not Quite Timeless Decision Theory. So what goes wrong?

### Problem: NQTDT won't cooperate against itself!

If you face two NQTDT agents against each other in the Prisoner's Dilemma, what happens? First, note that NQTDT will defect against either a CooperateBot (**A _{C}**) or a DefectBot (

**A**) in the Prisoner's Dilemma. If

_{D}**X**and

**Y**are both NQTDT, then

**X**deduces that

**Y**defects against both

**A**and

_{C}**A**, therefore

_{D}**X**calculates

*u*=0 and

_{C}*u*=1, so

_{D}**X**defects. (And of course

**Y**does the same.) Drat!

And the second problem?

### Problem: NQTDT might one-box and *not* be rewarded!

In Newcomb's Problem, what if **P** has a powerful enough inference module to show that **A _{1}** one-boxes and

**A**two-boxes, but not one powerful enough to figure out what

_{2}**X**is doing? (After all,

**X**is a bit more complicated than the

**A**are.) By the reasoning above,

_{i}**X**will one-box, but

**P**won't fill the box—the worst of all possible outcomes!

So NQTDT is broken. What can we do?

### Idea 5: Sanity Check

Well, instead of assuming that our opponent **Y** will behave in the real game as it did against the optimal **A _{i}**, we could try to deduce

**Y**'s actual behavior against

**X**as a "sanity check" before we make our final decision. That is,

def **X**(code **G**, code **Y**):

- For each
*x*:_{i}- Write a very simple agent
**A**that always outputs_{i}*x*._{i} - Try to deduce output
**Y**(code**G**, code**A**)._{i} - If this succeeds, calculate
**U**(*x*, output_{i}**Y**(code**G**, code**A**)); call this_{i}*u*._{i}

- Write a very simple agent
- If
*u*has been deduced for all_{i}*x*, take the_{i}*x*for which_{i}*u*is largest, and try to deduce "_{i}**U**(*x*, output_{i}**Y**(code**G**, code**X**)) ≥*u*"._{i} - If
*that*succeeds,*then*return*x*._{i} - Else, return the default Nash equilibrium.

Note that the new line checks the behavior of **Y** against **X** itself, not the **A _{i}**. This "sanity check" ensures that, when

**X**imitates the most successful of the

**A**, it achieves at least the payoff that

_{i}**A**does. Indeed, if our program does this then it's again un-exploitable in the same sense as CDT, and it still one-boxes on Newcomb's Problem (if and only if it deduces that

_{i}**P**is sufficiently powerful to deduce this fact), so it's no longer broken in the above sense.

But by the same argument as before, it still defects on the Prisoner's Dilemma against itself, so this is still not TDT.

What else could we do better? Well, this agent "mimics" only the very simplest of strategies- the strategies that correspond to one-player games for its opponent. What if we instead let it mimic more complicated (but still basic) strategies, like versions of Newcomb's Predictor?

### Idea 6: Mimic something a bit more clever

This is going to be a bit more complicated than the last one; we first need to define the agents that we'll examine and mimic. Consider each function *f* from the set {*y _{1}*, . . . ,

*y*} to the set {

_{m}*x*, . . . ,

_{1}*x*} (there are

_{n}*n*such functions), and define

^{m}**A**as follows:

_{f}def **A_{f}**(code

**G**, code

**Y**):

- Try and deduce the output of
**Y**(code**G**, code**A**)._{f} - If this succeeds, return
*f*(output**Y**). Else, return the Nash equilibrium.

Note that the constant strategies **A _{i}** are special cases of the

**A**, corresponding to the constant functions; note also that Newcomb's Predictor is an example of an

_{f}**A**that's not simply a constant strategy. Now we'll write a decision theory that looks for the most successful of the

_{f}**A**, applies a sanity check, and mimics it. This results in an instance of...

_{f}### Timeless Decision Theory

def **X**(code **G**, code **Y**):

- For each
*f*:- Write the agent
**A**._{f} - Try to deduce output
**Y**(code**G**, code**A**)._{f} - If this succeeds, calculate
**U**(output**A**, output_{f}**Y**(code**G**, code**A**)); call this_{f}*u*._{f}

- Write the agent
- If
*u*has been deduced for all_{f}*f*, take the*f*for which*u*is largest, and try to deduce "_{f}**U**(output**A**, output_{f}**Y**(code**G**, code**X**)) ≥*u*"._{f} - If that succeeds, then return (output
**A**)._{f} - Otherwise, return the default Nash equilibrium.

Let's try and parse what this algorithm does. It considers all possible Newcomblike agents (the **A_{f}**) and tests out how they fare against

**Y**. Then it takes the best result and does a sanity check: will

**Y**actually respond against

**X**as it does against this particular

**A**? And if that checks out, then

_{f}**X**goes ahead and outputs what

**A**does.

_{f}^{0}

We can sketch out some results that this decision theory gets (when equipped with a sufficiently powerful inference module). First, it correctly solves one-player games (since one of the constant strategies **A _{i}** is the optimum) and zero-sum two-player games (for the same reasons that CDT does). It one-boxes on Newcomb whenever

**P**is a powerful enough predictor (since the best of the

**A**are all one-boxers). And finally, it cooperates with itself on a simple Prisoner's Dilemma!

_{f}Why? If **X** and **Y** are both running this decision theory, consider the function *f* with *f*(C)=C and *f*(D)=D, it's clear that **Y** cooperates with this **A_{f}** (by the same logic that makes it one-box on Newcomb's Problem) and that no

**A**achieves a better result against

_{f}**Y**than mutual cooperation. Thus,

**X**will run the sanity check for this

**A**, which is equivalent to trying to deduce that

_{f}**Y**cooperates against

**X**, while

**Y**attempts to do the same. Here comes the delicate argument from mathematical logic; we'll phrase things in terms of logical provability, but the analogous arguments hold for other (valid and sufficiently powerful) kinds of inference modules:

Löb's Theorem states that, within a formal system, proving "if there's a proof of *S*, then *S* actually holds" leads to a proof of *S* itself. In our current context, let *S* be the statement "**U**(output **A_{f}**, output

**Y**(code

**G**, code

**X**)) ≥

*u*" that appears in

_{f}**X**'s sanity test, and

*T*be the corresponding statement "

**V**(output

**A**, output

_{f}**X**(code

**G**, code

**Y**)) ≥

*u*" that appears in

_{f}**Y**'s sanity-test. If there were a proof of

*S*, then cooperation passes

**X**'s sanity test

^{1}, so that output

**X**(code

**G**, code

**Y**))=C. But this directly implies the statement

*T*, so cooperation passes

**Y**'s sanity test, so that output

**Y**(code

**G**, code

**X**))=C, which implies

*S*itself. Thus, by Löb's Theorem, there must be a proof of

*S*, and therefore

**X**and

**Y**do in fact cooperate! (I do realize how confusing this process of logical bootstrapping is—it's helpful to think of it as a useful kind of self-fulfilling prophecy.)

This algorithm is a general decision theory rather than something with ad-hoc patches, so it satisfies my criteria for a better decision theory. It also behaves like Timeless Decision Theory is supposed to on the classical problems. Is it therefore justified to call this TDT? Well, let's again think about it on the philosophical level.

Each Newcomblike agent **A_{f}** represents a particular deal that it tries to strike with

**Y**(if I deduce you're doing

*this*, then I'll do

*that*).

**X**looks to see which deals

**Y**will accept, and offers

**Y**the most favorable (to

**X**) of those deals, and if

**Y**"accepts" (passes the sanity check) then

**X**fulfills its end of the bargain. Since some deals are better than the best Nash equilibrium, this gets better results than CDT in many games, while still being un-exploitable. The novelty (compared to what traditional game theorists already know) is the fact that the communication of the deal and its acceptance all happen via prediction of each other rather than via exchange of messages, which allows for deals to happen even when the agents are unable to explicitly bargain or guarantee their fulfilment of their end by external precommitments

That's a pretty good account of what TDT does, no? Also, I realized that I had the right description when I recalled Eliezer's description of when TDT cooperates on the Prisoner's Dilemma: if and only if the opponent would one-box with the TDT agent as Omega. This is *exactly* what this algorithm ends up doing.

It's fair to say that Yudkowsky's invention of Timeless Decision Theory was a triumph. In fact, I'll make a note here...

# TDT: HUGE SUCCESS

Before we move on, we should take a little while to remark on how awesome this is.

Since the Prisoner's Dilemma was first formulated, it's been held up as a dark and dismal portent for the prospects of cooperation among rational people. Douglas Hofstadter felt, on a philosophical level, that there *must* be a way to see cooperation as the rational choice, and called it superrationality; but his experiments with actual people were discouraging, and he could find no solid theoretical underpinning for that intuition, so game theorists continued to ignore it.

The insight of TDT (as well as the other decision theories we might term "reflexive") is that better results can be achieved when you have more information than the original setup allows; in particular, if it's common knowledge that agents have some ability to predict each others' thought process, then mutual cooperation really can arise out of rational self-interest.

Although it's difficult to make the right conditions obtain between individual humans (as in Hofstadter's experiments), it may be possible for a TDT-friendly setup to exist between different organizations, or different countries, if they have means of keeping tabs on each others' strategic planning. This may in fact have saved the world from nuclear destruction...

John von Neumann, the founder of game theory, advocated a pre-emptive nuclear strike on the Soviet Union throughout the 1950s; he was convinced that there was no other possibility of avoiding nuclear destruction ourselves. But here we have a good analogue to our setup; rather than exchanging source code, both governments had spies placed in the other, and so were likely to know what the other was thinking (albeit with some delay). This might have led to unconscious use of TDT thinking and thus spared us from nuclear war, despite the fact that the professional rationalists of the day hadn't conceived of such means to mutual cooperation!^{2}

But this sequence isn't done yet: TDT may surpass CDT, but it's still under-defined when it comes to bargaining, and there's one notable problem in which TDT falls short of common sense...

TO BE CONTINUED (WITH CAKE) IN PART IV

### Notes

**0.** As with CDT, there are multiple other variants to consider; this one insists on the best possible deal that **Y** would accept against an **A_{f}**, or else no deal at all. The downside is that two such agents will fail to reach a deal if they have different "favorite" deals, even if each would be willing to compromise against an

**A**. We'll say more about this in the context of bargaining games later.

_{f}Also, it's very important that we have the sanity check here, since otherwise **Y** could defect against **X** and have **X** cooperate (as long as Y gets Newcomb's Problem right). Without the sanity check, other agents take **X**'s lunch money, and rightly so!

**1.** Existence of a proof of S is not enough to conclude that X's sanity test succeeds, since provability is not decidable and so the condition for failing the check doesn't imply absence of the proof. One way out is to have a provability oracle perform the sanity check, which makes it impossible for players to be algorithms. Another is to prove some bounds on the length of the shortest proof of S, and write the algorithm in a way that guarantees the sanity check to succeed within these bounds. Yet another way is to never give up, work on the sanity check forever, which makes the agent completely unable to cope with many situations. (Thanks to Vladimir Nesov for this footnote.)

**2.** Thomas Schelling, among others, worked out a plausibly stable logic of mutually assured destruction, but it still invited all sorts of instability when limited conflict was involved. He also praised the idea of encouraging spies in certain departments rather than prosecuting them, in order to keep the Soviet Union from worrying that we were planning a first strike. If you've never read The Strategy of Conflict, I highly recommend it; it is simultaneously one of the most enlightening and most engaging books I've read in recent years.

## Comments (53)

BestThis formulation or variant of TDT (I'm not sure if Eliezer's TDT does or not) requires that before a decision problem is handed to it, the world is divided into the agent itself (

X), other agents (Y), and "dumb matter" (G). I think this is misguided, since the world doesn't really divide cleanly into these 3 parts. But if this is the approach a decision theory researcher wants to take, they should at least acknowledge the "recognize myself and other agents in the world" problem as a problem, and say something about how this might be done in principle.Given that this has no response to it, I'm curious as to whether Orthonormal has responded to you regarding this either off list or elsewhere?

We discussed it by email a bit more, but I don't think he came up with a very good answer. I'll forward you the exchange if you PM me your email address.

*6 points [-]Considering how much trouble it was to formulate the setups where this actually rigorously follows, I feel that this deserves a footnote, something like this:

Existence of a proof of S is not enough to conclude that X's sanity check succeeds, since provability is not decidable and so the condition for failing the check doesn't imply absence of a proof. One way out is to have a provability oracle perform the sanity check, which makes it impossible for players to be algorithms. Another is to prove some bounds on the length of the shortest proof of S, and write the algorithm in a way that guarantees the sanity check to succeed within these bounds. Yet another way is to never give up, work on the sanity check forever, which makes the agent completely unable to cope with many situations.

Good footnote— I'll add it.

There's a gap in the proof that X and Y cooperate. You may know how to close it, but if it's possible it's not obvious enough so the extra steps should be added to the article. More importantly, if it can't be closed the theorem might not be true.

The gap: We hypothesize that statement S is provable in (system of X). Therefore X will Cooperate. This guarantees that T is true, by definition, but not that Y will prove that T is true. Presumable Y can recreate the proof of S being true, but it cannot conclude that X will cooperate unless it also can prove that X will prove it.

I cannot see how to resolve this without stepping out of a specific formal system, which would make Lob's Theorem unusable.

Am I missing something?

It does work out, but you're right to raise that question; Vladimir Nesov's footnote unpacks those issues.

In general, a caught spy can be a useful resource. If you've identified a spy but the spy doesn't know he's been found out, you get to control what the spy's reports are.

*2 points [-][edit]: This kind of decision theory requires players to have a specific kind of knowledge about each other, thus it seems to me like "telepathic" decision theory more precisely describes it than "timeless" decision theory.

*2 points [-](If you remove the snark from this comment, it won't be the downvoted-to-hidden parent of most of the post's discussion.)

Please suggest a replacement? I don't see any snark, but I also don't have a negative view of the word "telepathic."

*5 points [-]Unpacking a bit (to more reliably communicate that you don't mean any snark if that's indeed the case) and avoiding the form of a rhetorical question (which is what suggested the snarkiness) could work: "This kind of decision theory assumes the players to have so much knowledge about each other that one wonders whether it should be called "telepathic" instead of "timeless"..."

Thanks for the suggestion; I went with a variation of that.

Telepathic is even worse.

Timeless because it allows that causality can go backwards in the presence of good simulation. I think.

What's worse about it? (I have a defense of the name I'll hold off on sharing until I hear your opnion.)

If you bend what you mean by 'causality,' sure, but I don't think there's value in bending the word that way.

Telepathic implies some kind of supernatural communication. TDT sortof behaves like that except that there is no actual communication going on. Timeless isn't a very good name, but telepathic has too much wierd baggage, I think. Curious to see your arguments.

Well it lets you vastly simplify the causality structures in your predictions, for one.

Huh? How could X possibly run TDT without having Y's source code communicated to it?

My model of TDT is that, rather than looking at action-action-outcome triplets, it looks at strategy-strategy-outcome triplets. The rest of game theory remains the same, and so once you have a strategy-strategy-outcome table, you find the Nash equilibrium and you're done. (If constructing that table fails, you revert to the Nash equilibrium from the action-action-outcome table.)

The material difference between this and regular game theory is that we now have access to strategies- i.e., we can read our opponent's mind, i.e. telepathy. (Maybe mind-reading decision theory is a better term, but it doesn't abbreviate to TDT like telepathic decision theory does.)

You can still run TDT in situations with time (like an iterated prisoner's dilemma), but you

can'trun TDT in situations where you don't have your opponent's source code. So calling it "timeless" when it can involve time seems odd, as is not referring to the necessity of source code.*3 points [-]This characterization/analogy doesn't fit, and doesn't seem to help with making the necessary-knowledge-about-opponent distinction. Knowing that performing action A implies that your opponent performs action B is a

weakerstatement than unconditionally knowing that your opponent performs action B. In standard game theory, you don't know what action your opponent performs, and with TDT you don't know that as well. But not knowing something doesn't (automatically) make it not happen. So if there is indeed a dependence of your opponent's action on your own action, it's useful to know it.The difference between considering opponent's actions in standard game theory and considering opponent's "strategy" (dependence of opponent's action on your action) is that while the former is usually unknown (to both TDT and standard game theory), the latter can in principle be known, and making use of this additional potential knowledge is what distinguishes TDT. So the actions in game theory and "strategies" in TDT are not analogous.

Okay. The first looks like a strategy to me, and the second looks like an action. Right?

I agree, and that matches my characterization of TDT.

I'm not understanding this, though. Are you just saying that knowing about your opponent's strategy gives you useful information?

How do you learn it?

The analogy is that both of them get put into a table, and then you find the Nash equilibria by altering the favored rows and columns and columns of the table, and then you pick the best of the equilibria. (TDT has known bargaining problems, right? That looks like it maps onto disagreeing over which Nash equilibrium to pick.) Would it help if I made a walkthrough of my model with actual tables?

*1 point [-]Y doesn't act according to the rule "Let's see what X does. If it does A, I'm going to do B, etc.", and so it's misleading to call that "a strategy". This is only something like what X infers about Y, but this is not how Y reasons, because Y can't infer what X does, and so it can't respond depending on what X does.

The actual strategy is to figure out an action based on the other player's code, not to figure out an action based on the other player's action. This strategy, which doesn't involve responding to actions, can be characterized as establishing a dependence between players' actions, and this characterization is instrumental to the strategy itself, a part of what makes the characterization correct.

So "play whatever I think X will play" does count as a strategy, but "play whatever X plays" does not count as a strategy because Y can't actually implement it. Limiting X and Y to the first sort of strategies was meant to be part of my characterization, but I could have made that clearer.

*2 points [-]It can't implement "play whatever I think X will play" either, because it doesn't know what X will play.

In one statement, if we are talking about ADT-like PD (the model of TDT in this post appears to be more complicated), Y could be said to choose the action such that provability of Y choosing that action implies X's choosing a good matching action. So Y doesn't act depending on what X does or what Y thinks X does etc., Y acts depending on what X can be inferred to do if we additionally assume that Y is doing a certain thing, and the thing we additionally assume Y to be doing is a specific action, not a strategy of responding to X's source code, or a strategy of responding to X's action. If you describe X's algorithm the same way, you can see that the additional assumption of Y's action is not what X uses in making its decision, for it similarly makes an additional assumption of its own (X's) action and then looks what can be inferred about Y's action (and not Y's "strategy").

I usually understand "mind-reading" to encompass being aware of the current state of a system. Two systems that simply know one another's strategies can't predict one another's behaviors if their strategies include random coin flips, for example, or depend on information that one system can observe but the other cannot; whereas I would expect telepathic agents to be aware of the results of such observations as well.

If you did have two telepaths playing any game, and one of them decided a mixed strategy was optimal, they wouldn't want to know what action they were playing until it was played- because otherwise they might leak that knowledge to the other player. That is, in a competitive situation I don't think mind-reading would extend to coin-reading, but if your understanding is common then 'mind-reading' is a bad phrase to use. Is there a good word for "has access to its opponent's source code"? Bonus points if it starts with a T.

(Also, my understanding is that TDT will defect against any mixed strategy in the prisoner's dilemma.)

Not necessarily. It will play a mixed strategy against an identical mixed strategy if that if what it needs to do to get them to play mixed rather than D. It's the

other guybeing weird and arbitrary in that case, not the TDT.*1 point [-]Well, it certainly will defect against any mixed strategy that is hard coded into the opponent’s source code. On the other hand, if the mixed strategy the opponent plays is dependent on what it predicts the TDT agent will play, then the TDT agent will figure out which outcome has a higher expected utility:

(I defect, Opponent runs "defection predicted" mixed strategy)

(I cooperate, Opponent runs "cooperation detected" mixed strategy)

Of course, this is still simplifying things a bit, since it assumes that the opponent can perfectly predict one's strategy, and it also rules out the possibility of the TDT agent using a mixed strategy himself.

Thus the actual computation is more like

ArgMax(Sum(ExpectedUtility(S,T)*P(T|S)))

where the argmax is over S: all possible mixed strategies for the TDT agent

the sum is over T: all possible mixed strategies for the opponent

and P(T|S) is the probability that opponent will play T, given that we choose to play S. (so this is essentially an estimate of the opponent's predictive power.)

Won't that let the opponent steal utility from you? Consider the case where you're going up against another TDTer which is willing to consider both the strategy "if they cooperate only if I cooperate, then cooperate with 99% probability" and "if they cooperate only if I cooperate, then cooperate." You want your response to the first strategy to be defection and your response to the second strategy to be cooperation, so it's in their interests to play the second strategy.

*1 point [-]You're right, if the opponent is a TDT agent. I was assuming that the opponent was simply a prediction=>mixed strategy mapper. (In fact, I always thought that the strategy 51% one-box 49% two box would game the system, assuming that Omega just predicts the outcome which is most likely).

If the opponent is a TDT agent, then it becomes more complex, as in the OP. Just as above, you have to take the argmax over all possible y->x

mappings, instead of simply taking the argmax over all outputs.Putting it in that perspective, essentially in this case we are adding all possible mixed strategies to the space of possible outputs. Hmmm... That's somewhat a better way of putting it than everything else I said.

In any case, two TDT agents will both note that the program which only cooperates 100% iff the opponent cooperates 100% dominates all other mixed strategies against such an opponent.

So to answer the original question:

Yes, it will defect against blind mixed strategies.No, it will not necessarily defect against simple (prediction =>mixed strategy) mappers.N/Aagainst another TDT agent, as neither will ever play a mixed strategy, so to ask what whether it would cooperate with a mixed strategy TDT agent is counterfactual.EDIT: Thinking some more, I realize that TDT agents will consider the sort of 99% rigging against each other — and will find that it is better than the cooperate IFF strategy. However, this is where the "sanity check" become important. The TDT agent will realize that although such a pure agent would do better against a TDT opponent, the opponent knows that you are a TDT agent as well, and thus will not fall for the trap.

Out of this I've reached two conclusions:

The sanity check outlined above is

notbroad enough, as it only sanity checks thebestagents, whereas even if the best possible agent fails the sanity check, there still could be animprovementover the nash equilibrium which passes.Eliezer's previous claim that a TDT agent will never regret being a TDT agent given full information is

wrong(hey, I thought it was right too). Either it gives in to a pure 99% rigger or it does not. If it does, then it regrets not being able to 99% rig another TDT agent. If it does not, then it regrets not being a simple hard-coded cooperator against a 99% rigger. This probably could be formalized a bit more, but I'm wondering if Eliezer et. al. have considered this?EDIT2: I realize I was a bit confused before. Feeling a bit stupid. Eliezer never claimed that a TDT agent won't regret being a TDT agent (which is obviously possible, just consider a clique-bot opponent), but that a TDT agent will never regret being given information.

*1 point [-]You've made it into a bargaining game with that mixed strategy, and indeed the version of TDT we introduced will defect against an opponent that outputs the mixed strategy (if that opponent would output a pure cooperate if that were the only way to get its adversary to cooperate). But bargaining games are complicated, and I'm saving that for another post.

It's a better analogy to say that TDT is capable of offering deals (and self-enforcing them) even without traditional means of communication and precommitment (as long as it has some means for inferring its opponent's thinking), and so it has a wider range of options than CDT to optimize over.

Well, offering deals isn't enough, right? The self-enforcing part is really important, and that's where the Nash equilibrium idea comes in: it's self-enforced because no party can gain by unilaterally changing their strategy (which is a somewhat different restriction than no party gaining from unilaterally changing their action).

*0 points [-][

EDIT:I see now what Vaniver was saying, thanks to the diagram in xer reply.] Try writing it out for the Prisoner's Dilemma in such a way that a CDT agent would cooperate with itself in the strategy-strategy game (I mean, such that they would each output a strategy that makes them both cooperate in the original PD), and you'll see the problem. You need toeliminate logically impossible squares of the matrix, not just add new rows.So, let's consider this with three strategies: 1) CooperateBot, 2) DefectBot, 3) CliqueBot.

We now get a table that looks like this (notice the letters are outcomes, not actions):

X has control over which row they get, and Y has control over which column they get. The two Nash equilibria are (2,2) and (3,3). Conveniently, (2,2) and (2,3) have the same result, and so it looks like it'll be easy to figure out that (3,3) is the better Nash equilibrium.

*2 points [-]Aha, I see now what you mean. Good insight!

[EDIT: The following is false.]A clever CDT would be able to act like TDT if it considered, not the choice of whether to output C or D, but the choice of which mathematical object to output (because it could output a mathematical object that evaluates to C or D in a particular way depending on the code of Y—this gives it the option of genuinely acting like TDT would).This has the interesting conclusion that even without the benefit of self-modification, a CDT agent with a good model of the world ends up acting more like TDT than traditional game theorists would expect. (Another example of this is here.) The version of CDT in the last post, contrariwise, is equipped with a very narrow model of the world and of its options.

[End falsehood.]I think these things are fascinating, but I think it's important to show that you can get TDT behavior without incorporating anthropic reasoning, redefinition of its actions, or anything beyond a basic kind of framework that human beings know how to program.

(By the way, I wouldn't call option 3 CliqueBot, because CliqueBots as I defined them have problems mutually cooperating with anything whose outputs aren't identical to theirs. I think it's better for Option 3 to be the TDT algorithm defined in the post.)

After further review, I was wrong that CDT would be capable of making use of this to act like TDT. If CDT treats its output as separate from the rest of the causal graph (in the sense of the previous post), then it would still prefer to output an always-defect rather than a Löbian mathematical object. So it does take a different kind of agent to think of Nash equilibria among strategies.

Also, the combinatorics of enumerating strategies and looking for Nash equilibria are kind of awful: there are 16 different inputs that such a strategy has to deal with (i.e. what the opponent does against CooperateBot, DefectBot, NewcombBot and AntiNewcombBot), so there are 2^16 variant strategies in the same class. The one we call TDT is the Nash equilibrium, but it would take a

longwhile to establish that in a naive implementation.To be a trustworthy tool rather than a potential trap, the source code to Y has to be completely accurate and has to have final decision making authority. Y's programmer has to be able to accurately say "Here is enough information to predict any decision that my irrevokably delegated representative would ever possibly make in this interaction." Saying that this is "without traditional means of communication" is technically accurate but very deceptive. Saying that this is "no actual communication" is outright backwards; if anything it's much

morecommunication than is traditionally imagined.Unlimited communication, even. In the hypothetical "AGIs can inspect each others' source code" case, the AGIs could just as well run that source code and have two separate conversations, one between each original and its counterpart's copy. If the AGIs' senses of ethics were sufficiently situation-dependent, then to generate useful proofs they'd need to be able to inspect copies of each others' current state as well, in which case the two separate conversations might well be identical.

This is all true, but you can come up with situations where exchanging source code is more relevant. For instance, Robin Hanson has frequently argued that agents will diverge from each other as they explore the universe, and that

someonewill start burning the cosmic commons. This is a Prisoner's Dilemma without traditional communication (since signals are limited by lightspeed, it would be too late to stop someone distant from defecting once you see they've started). But the "exchange of source code" kind of coordination is more feasible.Also, I don't know whether anyone polled traditional game theorists, but I'd bet that some of them would have expected it to be impossible, even with exchange of source codes, to achieve anything better than what CliqueBot does.

You're right, and in my humble opinion 'Updateless' is only slightly better.

*2 points [-]"Updateless" does describe the weird attitude of that decision theory to observations. It really does not learn from evidence, and it is a problem, so the weird connotations correspond to actual weirdness of the algorithm. In that sense, "telepathic" also somewhat fits, in that the current models of this style of decision making do require the players to have unreasonable amount of knowledge about each other (although merely reading each other's thoughts

is not enough), but this seems more like a limitation of current models (i.e. concrete examples of algorithms) than the limitation of the overall decision-making style (it's not yet known to what extent it's so).Yes, just not as well as I would like.