Followup to: The True Prisoner's Dilemma
For everyone who thought that the rational choice in yesterday's True Prisoner's Dilemma was to defect, a follow-up dilemma:
Suppose that the dilemma was not one-shot, but was rather to be repeated exactly 100 times, where for each round, the payoff matrix looks like this:
Humans: C | Humans: D | |
Paperclipper: C | (2 million human lives saved, 2 paperclips gained) | (+3 million lives, +0 paperclips) |
Paperclipper: D | (+0 lives, +3 paperclips) | (+1 million lives, +1 paperclip) |
As most of you probably know, the king of the classical iterated Prisoner's Dilemma is Tit for Tat, which cooperates on the first round, and on succeeding rounds does whatever its opponent did last time. But what most of you may not realize, is that, if you know when the iteration will stop, Tit for Tat is - according to classical game theory - irrational.
Why? Consider the 100th round. On the 100th round, there will be no future iterations, no chance to retaliate against the other player for defection. Both of you know this, so the game reduces to the one-shot Prisoner's Dilemma. Since you are both classical game theorists, you both defect.
Now consider the 99th round. Both of you know that you will both defect in the 100th round, regardless of what either of you do in the 99th round. So you both know that your future payoff doesn't depend on your current action, only your current payoff. You are both classical game theorists. So you both defect.
Now consider the 98th round...
With humanity and the Paperclipper facing 100 rounds of the iterated Prisoner's Dilemma, do you really truly think that the rational thing for both parties to do, is steadily defect against each other for the next 100 rounds?
Hi. Found the site about a week ago. I read the TDT paper and was intrigued enough to start poring through Eliezer's old posts. I've been working my way through the sequences and following backlinks. The material on rationality has helped me reconstruct my brain after a Halt, Melt and Catch Fire event. Good stuff.
I observe that comments on old posts are welcome, and I notice no one has yet come back to this post with the full formal solution for this dilemma since the publication of TDT. So here it is.
Whatever our opponent's decision algorithm may be, it will either depend to some degree on a prediction of our behavior, or it will not. It can only rationally base its decision on a prediction of our behavior to the extent that it believes a) we will attempt to predict its own behavior; and b) we will only cooperate to the extent that we believe it will cooperate. It will thus be incentivized to cooperate to the extent that it believes we can and will successfully condition our behavior on its own. To the extent that it chooses independently of any prediction of our behavior, its only rational choice is to defect. Any other choices it could make will do worse than the above decisions in all cases, and the following strategy will gain extra utility against any such suboptimal choices, as will become clear.
There are thus two unknown probabilities for us to condition on: The probability that the opponent will choose to cooperate iff it believes we will cooperate, which I'll call P(c), and the probability that the opponent will be able to successfully predict our action, which I'll call P(p).
We want to calculate the utility of cooperating, u(C), and the utility of defecting, u(D), for each relevant case. So we shut up and multiply.
If the opponent is uncooperative (~c), they always defect. Thus u(C|~c) = 0 and u(D|~c) = 1.
In cases where a potentially cooperative opponent successfully predicts our action, we have u(C|c,p) = 2 and u(D|c,p) = 1. When such an opponent guesses our action incorrectly, we have u(C|c,~p) = 0 and u(D|c,~p) = 3.
Thus we have:
u(C) = 2 P(c) P(p)
u(D) = P(~c) + P(c) P(p) + 3 P(c) P(~p) = 1 - P(c) + P(c) P(p) + 3 P(c) (1 - P(p)) = 1 + 2 P(c) - 2 P(c) * P(p)
We consider the one-shot dilemma first. An intelligent opponent can be assumed to have behavioral predictive capabilities at least better than chance (P(p) > 0.5), and perhaps approaching perfection (P(p) ~ 1) if it is a superintelligence. In the worst case, u(C) ~ P(c), and u(D) ~ 1 + P(c), and we should certainly defect. In the best case, u(C) ~ 2 * P(c) and u(D) ~ 1, so we should defect if P(c) < 0.5, that is, if we assess that our opponent is even slightly more likely to automatically defect than to consider cooperation. If we have optimistic priors for both probabilities due to applicable previous experiences or any immediate observational cues, we may choose to cooperate; we plug in our numbers, and shut up and multiply.
In the iterated case, we have the opportunity to observe our opponent's behavior and update priors as we go. We are incentivized to cooperate when we believe it will do so, and to defect when we believe it will defect or when we believe we can do so without it anticipating us. Both players are incentivized to cooperate more often than defecting when they believe the other is good at predicting them. A player with a dominating edge in predictive capabilities can potentially attain a better result than pure mutual cooperation against an opponent with weak capabilities, through occasional strategic defections; the weaker player may find themselves incentivized not to punish the defector if they realize that they cannot do so without being anticipated and losing just as many utilons as the superior player would lose from the punishment. To the extent that the superior predictor can ascertain that their opponent is savvy enough to know when it's dominated and would choose not to lose further utilons through vindictive play, such a strategy may be profitable.
Thus the spoils go to the algorithm with the best ability to predict an opponent. Skilled poker players or experts at "Rock-Paper-Scissors" could perform quite well in such contests against the average human. That could be fun to watch.
Nice analysis. One small tweak: I would precommit to being vindictive as hell if I believe I'm dominated by my opponent in modeling capability.