[SEQ RERUN] The Truly Iterated Prisoner's Dilemma

MinibearRex

[SEQ RERUN] The Truly Iterated Prisoner's Dilemma — LessWrong

7 [SEQ RERUN] The Truly Iterated Prisoner's Dilemma

22nd Aug 2012

1 min read

7

Today's post, The Truly Iterated Prisoner's Dilemma was originally published on 04 September 2008. A summary (taken from the LW wiki):

According to classic game theory, if you know how many iterations there are going to be in the iterated prisoner's dilemma, then you shouldn't use tit for tat. Does this really seem right?

Discuss the post here (rather than in the comments to the original post).

This post is part of the Rerunning the Sequences series, where we'll be going through Eliezer Yudkowsky's old posts in order so that people who are interested can (re-)read and discuss them. The previous post was The True Prisoner's Dilemma, and you can use the sequence_reruns tag or rss feed to follow the rest of the series.

Sequence reruns are a community-driven effort. You can participate by re-reading the sequence post, discussing it here, posting the next day's sequence reruns post, or summarizing forthcoming articles on the wiki. Go here for more details, or to have meta discussions about the Rerunning the Sequences series.

Personal Blog

7

New Comment

20 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:44 AM

[-]Andreas_Giger14y30

We need another tournament!

With humanity and the Paperclipper facing 100 rounds of the iterated Prisoner's Dilemma, do you really truly think that the rational thing for both parties to do, is steadily defect against each other for the next 100 rounds?

No, because the rationalist thing to do is not to assume that the other player is a perfect rationalist; or, to be more precise, that the other player is a perfect rationalist who assumes that you're a perfect rationalist who assumes that the other player is a perfect rationalist who...

Defecting the first round only makes sense if you assume that the other player also defects that round, because the benefit from mutual cooperation is much larger than 1 million human lives (the difference between CC and DD), since it goes without saying that DD will lead to DD in the next round, while CC may lead to another CC, and another CC, and so on, even if not until the end.

IPD is not about defeating the "opponent", it is about maximizing points (or lives saved or whatever). If your opponent defects every round, and you cooperate the first round, you effectively save one million less lives than you reasonably could have. If you always defect, and your opponent plays TFT, you "save" 1 million additional lives in the first round—and lose 1 million for each round your opponent would have cooperated. If you expect your opponent to either always defect, or to play some kind of TFT-nD (that is, tit for tat and defect the last n rounds with unknown n>0), then by defecting the first round you save 100 million respectively 101 million lives, while by not defecting first you save 99 million respectively 199-n million lives. That means that for you to rationally defect the first round, you must have very, very high confidence in your opponent not playing some sort of TFT-nD with low n.

[-]MinibearRex14y00

If I know this is your strategy, I don't have to be a perfect rationalist to know that I can defect on the first round.

[-]Andreas_Giger14y00

I don't understand what you mean. Do you mean that if you know Clippy plays TFT-nD, you would defect the first round? Because that would be a very irrational thing to do; sacrificing slightly less than 100 million lives in the process.

[-]Mass_Driver14y00

Are there strategies that, if publicly announced, will let a more sophisticated player defect on the first round and get away with it? Sure. There are also slightly better strategies that can be publicly announced without allowing for useful first-round defection. Either way, though, the gains from even shaky cooperation in a 100-round game are on the order of 70 or 80 million lives -- letting those gains slip by because you're worried about losing 1 million lives on the first round is a mistake. There's a tendency to worry about losing face, or, as Andreas puts it, not being defeated. But with real stakes on the table, you should only worry about maximizing points. Your pride isn't worth millions of lives.

[-]Andreas_Giger14y00

Any strategy that takes being publicly announced (and precommitted to) into account and still allows the opponent to get away with defecting the first round ist a terribly horrible strategy.
Publicly announcing is not actually precommitting. If Clippy says it plays TFT-1D, for how long would you really cooperate?

It seems to me there is little benefit to playing the strategy you announced, or vice versa.

[-]Mass_Driver14y00

I mean, "terribly horrible" on what scale? If the criterion is "can it be strictly dominated by another strategy in terms of results if we ignore the cost of making the strategy more complicated," then, sure, a strategy that reliably allows opponents to costlessly defect on the first of 100 rounds fails that criterion. I'd argue that a more interesting set of criteria are "is the expected utility close to the maximum expected utility generated by any strategy," "is the standard deviation in expected utility acceptably low," and "is the strategy simple enough that it can be taught, shared, and implemented with little or no error?" Don't let the perfect become the enemy of the good.

[-]Andreas_Giger14y00

I mean, "terribly horrible" on what scale?

On a scale from 0 to "a million people died because someone was being irrational", it would be around "two million people died because someone was being irrational."

On an unrelated note, the idea of precommitting in non-repeated IPD is silly; because if both players are precommitting simultaneously (before learning of their opponent's precommittment) it's the same as no-one precommitting, since they can't update their strategy with that knowledge, and otherwise it's an asymmetrical problem.

The solution to that asymmetrical problem, if you're the one who has to make the precommittment, is to precommit to simple TFT, I think. (~100% confidence)

[-]Mass_Driver14y10

I'm not sure what's silly about it. Just because there's only one game of IPD doesn't mean there can't be multiple rounds of communication before, during, and after each iteration.

As for the asymmetrical problem, if you're really close to 100% confident, would you like to bet $500 against my $20 that I can't find hard experimental evidence that there's a better solution than simple TFT, where "better" means that the alternative solution gets a higher score in an arena with a wide variety of strategies? If I do find an arena like that, and you later claim that my strategy only outperformed simple TFT because of something funky about the distribution of strategies, I'll let you bid double or norhing to see if changing the distribution in any plausible way you care to suggest changes the result.

[-]Andreas_Giger14y00

I'm not sure what's silly about it. Just because there's only one game of IPD doesn't mean there can't be multiple rounds of communication before, during, and after each iteration.

I'm not talking about communication in general; only about one-time precommitting. If it is asymmetrical I don't consider that real IPD anymore, and if it isn't, I fail to see how it would change anything, unless you would find some way for both players to precommit and take their opponent's precommitment into consideration.

I didn't mean to offend you by saying that it was stupid; I just don't find it an interesting problem from a game theory perspective.

As for the bet, it is easy to set up an arena of strategies where any particular strategy would fare worse than a certain other strategy—in this case, a pool of RandomBot derivatives would be a fine arena for DefectBot to excel in. Also, "a wide variety" is far too arbitrary for any bets.

What I had in mind as the "best" strategy is the strategy that scores highest against the strategy that is designed to score highest against it. TFT scores 198 points (paperclips, million lives saved, whatever) against 99C1D, which is the best way to deal with TFT. However, I should probably just give you the $500, since it turns out 198 is way too low even for a purely deterministic strategy. I'm now ~100% certain (haha!) that the highest possible score is 248; and if you consider non-deterministic strategies, the score increases to 248.5-e. And sadly, there's no way for you to score higher than 101 respectively 100.5+e against this strategy.

[-]The_Duck14y10

What I had in mind as the "best" strategy is the strategy that scores highest against the strategy that is designed to score highest against it. TFT scores 198 points (paperclips, million lives saved, whatever) against 99C1D, which is the best way to deal with TFT. However, I should probably just give you the $500, since it turns out 198 is way too low even for a purely deterministic strategy. I'm now ~100% certain (haha!) that the highest possible score is 248; and if you consider non-deterministic strategies, the score increases to 248.5-e. And sadly, there's no way for you to score higher than 101 respectively 100.5+e against this strategy.

You seem to have left this as an exercise to the reader, so I've worked it out with proof :) If we can publicly precommit and the opponent can't, we should commit to the strategy: "Defect for the first fifty rounds. If the opponent fails to cooperate even once during that time, defect for the last fifty as well. Otherwise play TFT for the last fifty rounds."

Taking the other guy's side for a moment, if we know the opponent is using this strategy, how do we get the most points? If we bow to the implied threat and cooperate for the first fifty rounds, then cooperate for the last fifty as well, we get 100 points, all during the last fifty rounds. We can bump this up to 101 by defecting on the last round. Meanwhile the opponent wins 150+98 = 248 points. If we defect at least once during the first fifty we might as well defect for all 100 rounds, since that is what the opponent will do, and this gets only 100 points, so this strategy is not as good. So as you claimed this blackmail strategy wins 248 points against the strategy the wins the most points against it.

I think the following is a proof that 248 is optimal:

Suppose there are A cases of we-cooperate-opponent-defects, B cases of we-cooperate-opponent-cooperates, and C cases of we-defect-opponent-defects, and 100-A-B-C cases of we-defect-opponent-cooperates. Then the opponent wins 3A+2B+C points. This must be >= 100 because otherwise the opponent can win more by defecting all the time instead of doing what we want him to do. Meanwhile we win 2B+C+3*(100-A-B-C) = 300-3A-B-2C points, and we want to maximize this number subject to the aforementioned constraint. Additional constraints are that A+B+C <= 100 and that A, B, C are all > 0. This is a linear programming problem, for which the optimal solution always lies at one of the corners of the allowed domain. The corners of the allowed domain are at

A = 0; B = 50; C = 0 [50 mutual cooperations, 50 we-defect-opponent-cooperates] - this wins 250 points for us, and the opponent wins 100 points.
A = 33.3; B = 0; C = 0 [We cooperate while allowing the opponent to defect 33.3 times, then defect while the opponent cooperates for 66.6 rounds] - we only win 200 points
A = 100; B = 0; C = 0 [We always cooperate; opponent always defects] - we win nothing.
A = 0; B = 100; C = 0 [Everyone always cooperates] - we only win 200 points.
A = 0; B = 0; C = 100 [everyone always defects] - we only win 100 points

So the best we can hope for is 250 points. However we can't really guarantee getting 250 because in the case where we get 250 points the opponent gets only 100, so he is indifferent between allowing us the 250, or defecting all the time and only allowing us 100. To ensure that the opponent follows our plans, we actual need to give him at least 101 points.

Suppose we defect 49 times while the opponent cooperates, followed by 51 mutual cooperations. This would give us 493 + 51\2 = 249 points, and the opponent 51*2 = 102 points. However we can never force the opponent to cooperate on the last move, since we have no ability to punish him for defecting, so we can't actually make this happen. The opponent is going to defect on the last move no matter what our strategy is. So 249 points is out of reach as well.

We can win 248 points by force against a rational self-interested opponent using the strategy in the first paragraph, so that is indeed the maximum, as you claimed.

I haven't thought about nondeterministic strategies though.

[-]Pavitra14y10

This should have been called The Iterated True Prisoner's Dilemma. I thought the number of rounds was going to be randomized on a waiting-time distribution.

[-]Decius14y00

The point that I didn't see made yet is that it's only Iterated Prisoner's Dilemma for the first N-1 times. After that it's simply singe PD.

Of course, N-1 becomes the new N, so it's just a combination of providing a signal to the other player about what kind of decisions you make while also interpreting the signals they are sending about the kind of decisions they make. Dual-cooperation is the best likely outcome over a number of iterations, so the goal of the signalling should be to convince other player to cooperate. "Cooperate now and I will cooperate next round, defect now and I will defect next round" is a perfectly valid signal to try to send.

[-]Andreas_Giger14y00

The point that I didn't see made yet is that it's only Iterated Prisoner's Dilemma for the first N-1 times. After that it's simply singe PD.

It is not correct to say that the last round is simple PD, because each player has additional information at that point. Consequentially, while the decision for round N is the same as in PD, N-1 does not actually become the new N.

[-]Decius14y10

Each player has additional information about how the other player has played in the past. I wasn't trying to say that iterated PD for N=100 rounds becomes N=0, I was saying it becomes N=99, followed by one straight game.

Also, new information about how a player has behaved in radically different circumstances is different from being able to rationally update what they will do in the future. You have never encountered the agent before you in circumstances where they are interacting with you for the last time, ever.

[-]Andreas_Giger14y00

Each player has additional information about how the other player has played in the past. I wasn't trying to say that iterated PD for N=100 rounds becomes N=0, I was saying it becomes N=99, followed by one straight game.

And that is inaccurate, because your decision in round 99 may affect Clippy's decision in round 100. There's no rule anywhere that says Clippy isn't, for example, assuming that your decision making processes are similar, and that if it decided to cooperate the last round after 99 identical turns, there'd be a good chance that you'd cooperate as well because of that. Sure, that's not a very likely scenario, and obviously you should always defect the last round—but this shows why N=100 does not ever become N=99.

(I'm not very familiar with EDT, but it seems like a decision theory that would be prone to not defecting the last round following identical 99 rounds).

[-]Decius14y00

And my decision in round 99 is also part if Iterated Prisoners Dilemma. My decision in round 100 is no longer iterated PD, but normal PD with additional information about how my partner played a IPD.

Key feature of PD as opposed to IPD: In Prisoner's Dilemma, you will never, ever, interact with the other player again. If that's a possibility, then you are playing a game with many similarities but a different premise.

[-]Andreas_Giger14y00

As you said, a key feature of PD is that you're not ever going to interact with the other player again, so while the last round may perhaps be interpreted as PD, the second-to-last round may not.

Of course you could just as well argue that a key feature of PD is also that you have never interacted with the other player before. That's my point of view, but in the end this is an academic question.

It doesn't matter: 100-round IPD only becomes 99-round IPD if you have 100% confidence that Clippy's decision in round 100 is not in any way causally related to your actual decisions in rounds 1..99.

If I pick 100 people randomly off the street and let them play ordinary PD, how many do you think will cooperate, even though it may not make sense to you or me? And here you're playing with a paperclip maximizer you know nothing about.

I really don't think you should have that kind of confidence.

[-]Decius14y00

I don't think having no information about the other player is part of PD. If you do, then it's not academic at all- it's a key difference in a definitional distinction that is important!

After those 99 rounds have been played, is the game PD or isn't it?

Oh, and if you pick me to participate in the closest approximation of PD that you can provide, I will cooperate, take my reward (if any), and then explain that the differences between the approximation and actual PD were key to my decision- because I prefer to live in a world where cooperation happens in pseudo-PD situations.

[-]Andreas_Giger14y00

After those 99 rounds have been played, is the game PD or isn't it?

No, it isn't. But someone who defects in ordinary PD might defect the last round in IPD for the same reasons. I certainly would.

[...] because I prefer to live in a world where cooperation happens in pseudo-PD situations.

You are looking for excuses instead of considering the least convenient possible world. Would you cooperate in this problem?

[-]Decius14y-10

I'm not in an abstract game. I only play games with some concrete aspect. If you asked me, on the street, to play a game with a stranger and I recognized the PD setup, I would participate, but I would not be playing prisoner's dilemma- I would be playing a meta-version which also has meta-payoffs.

Moderation Log