- Actually, this is an open problem so far as I know: show that if X is a Naive Decision Theory agent as above, with some analyzable inference module like a halting oracle, then there exists an agent Y written so that X cooperates against Y in a Prisoner's Dilemma while Y defects.

Let me just spell out to myself what would have to happen in this instance. For definiteness, let's take the payoffs in prisoner's dilemma to be $0 (CD), $1 (DD), $10 (CC) and $11 (DC).

Now, if X is going to co-operate and Y is going to defect then X is going to prove "If I co-operate then I get $0". Therefore, in order to co-operate, X must also prove the spurious counterfactual "If I defect then I get $x" for some negative value of x.

But suppose I tweak the definition of the NDT agent so that whenever it can prove (1) "if output = a then utility >= u" and (2) "if output != a then utility <= u" it will immediately output a. (And if several statements of the forms (1) and (2) have been proved then the agent searches for them in the order that they were proved) Note that our agent will quickly prove "if output = 'defect' then utility >= $1". So if it ever managed to prove "if output = 'co-operate' then utility = $0" it would defect right away.

Since I have tweaked the definition, this doesn't address your 'open problem' (which I think is a very interesting one) but it does show that if we replace the NDT agent with something only slightly less naive, then the answer is that no such Y exists.

(We could replace Prisoner's Dilemma with an alternative game where each player has a third option called "nuclear holocaust", such that if either player opts for nuclear holocaust then both get (say) -$1, and ask the same question as in your note 2. Then even for the tweaked version of X it's not clear that no such Y exists.)

**ETA:** I'm afraid my idea doesn't work: The problem is that the agent will *also* quickly prove "if 'co-operate' then I receive at least $0." So if it can prove the spurious counterfactual "if 'defect' then receive -1" before proving the 'real' counterfactual "if 'co-operate' then receive 0" then it will co-operate.

We could patch this up with a rule that said "if we deduce a contradiction from the assumption 'output = a' then immediately output a" which, if I remember rightly, is Nesov's idea about "playing chicken with the inconsistency". Then on deducing the spurious counterfactual "if 'defect' then receive -1" the agent would immediately defect, which could only happen if the agent itself were inconsistent. So if the agent is consistent, it will never deduce this spurious counterfactual. But of course, this is getting even further away from the original "NDT".

Your intuition that it gets deduced before any of the spurious claims like "if output = 'defect' then utility <= -$1" is taking advantage of an authoritative payoff matrix that X can't safely calculate xerself. I'm not sure that this tweaked version is any safer from exploitation...

*0 points [-]Why not? Can't the payoff matrix be "read off" from the "world program" (assuming X isn't just 'given' the payoff matrix as an argument.)