Paper: Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent

Bill "Numerical Recipes" Press and Freeman "Dyson sphere" Dyson have a new paper on iterated prisoner dilemas (IPD). Interestingly they found new surprising results:

It is generally assumed that there exists no simple ultimatum strategy whereby one player can enforce a unilateral claim to an unfair share of rewards. Here, we show that such strategies unexpectedly do exist. In particular, a player X who is witting of these strategies can (i) deterministically set her opponent Y’s score, independently of his strategy or response, or (ii) enforce an extortionate linear relation between her and his scores.

They discuss a special class of strategies - zero determinant (ZD) strategies of which tit-for-tat (TFT) is a special case:

The extortionate ZD strategies have the peculiar property of sharply distinguishing between “sentient” players, who have a theory of mind about their opponents, and “evolutionary” players, who may be arbitrarily good at exploring a ﬁtness landscape (either locally or globally), but who have no theory of mind.

The evolutionary player adjusts his strategy to maximize score, but doesn't take his opponent explicitly into account in another way (hence has "no theory of mind" of the opponent). Possible outcomes are:

If X alone is witting of ZD strategies, then IPD reduces to one of two cases, depending on whether Y has a theory of mind. If Y has a theory of mind, then IPD is simply an ultimatum game (15, 16), where X proposes an unfair division and Y can either accept or reject the proposal. If he does not (or if, equivalently, X has ﬁxed her strategy and then gone to lunch), then the game is dilemma-free for Y. He can maximize his own score only by giving X even more; there is no beneﬁt to him in defecting.

If X and Y are both witting of ZD, then they may choose to negotiate to each set the other’s score to the maximum cooperative value. Unlike naive PD, there is no advantage in defection, because neither can affect his or her own score and each can punish any irrational defection by the other. Nor is this equivalent to the classical TFT strategy (7), which produces indeterminate scores if played by both players.

This latter case sounds like a formalization of Hosfstadter's superrational agents. The cooperation enforcement via cross-setting the scores is very interesting.

Is this connection true or am I misinterpreting it? (This is not my field and I've only skimmed the paper up to now.) What are the implications for FAI? If we'd get into an IPD situation with an agent for which we simply can not put together a theory of mind, do we have to live with extortion? What would effectively mean to have a useful theory of mind in this case?

The paper ends in a grand style (spoiler alert):

It is worth contemplating that, though an evolutionary player Y is so easily beaten within the conﬁnes of the IPD game, it is exactly evolution, on the hugely larger canvas of DNA-based life, that ultimately has produced X, the player with the mind.

This is a rather interesting way to look at the IPD: it seems like there's value here.

If you're not familiar with Markov chains, read up on them first or this is not going to make much sense.

The basic framework is that the IPD can be played by only remembering 1) what happened last round and 2) what to play based on what happened last round. A player's strategy can be written as a vector of four probabilities- how likely they are to cooperate in each case (CC CD DC CC), which they call p for the first player and q for the second.

By combining both vectors, you can determine the transition matrix for the next round, which is a 4 by 4 matrix where all the rows sum to 1. If CooperateBot (1 1 1 1) is playing DefectBot (0 0 0 0), then regardless of what the last round was, the next (and all future) rounds will be CD. If CoinBot (.5 .5 .5 .5) plays CooperateBot, then the next round will be CC with probability .5 and DC with probability .5.

When things get interesting is when the players play strategies that actually make use of the memory. When CoinBot plays TitForTat (1 0 1 0), then if the last round was CC or CD, it will be CC with probability .5, and DC with probability .5. If it was DC or DD last round, then it will be CD with probability .5 and DD with probability .5.

You can then do Markov chain analysis on the transition matrix (and the payoffs) to figure out what the long-run average benefit to each player will be. (When the transition matrix is a closed communicating class, this is easy; when it's not communicating, like two TitForTat players playing each other, then it depends on the initial state. Notice that TitForTat's dynamics can be expressed in this form but not its rule of cooperating on round 1.)

This is where the evolution comes in: if players submitted p and q and then went to lunch, you could calculate the score and be done. But the players are allowed to vary p and q as the game progresses, and a natural choice is simple steepest descent. (This is actually a spot where the paper is a bit fuzzy, and could stand to be more explicit about what they mean by "theory of mind." Players will clearly need a memory longer than one round to run any sort of steepest descent algorithm, and it's much clearer to see how the gradients are calculated if both players are provided with p and q, not just a history of how the game has gone.)

And here's the fascinating result: if player X chooses a p and then does not change it, and player Y sets their q to extract the maximum possible score out of the situation, then X can do better than the cooperation value of 3, because Y does worse (by more than X gains), even though Y is maximizing their average score in the short run. So naive score maximizers can be taken advantage of. If X and Y instead use p and q to communicate, then it becomes a negotiation game like the ultimatum game: X proposes a split of the long run average winnings, and Y either takes it (by playing the strategy that maximizes Y's score) or refuses it (typically by turning into DefectBot). This means Y gives up short-term benefit to convince X that a more equitable division is necessary- which is only a good play when Y expects that X will actually change their behavior.

So even though it's not meaningful to make your probabilities for the next round depend on more than one round's history, it is very meaningful to make the trajectory of your probability vector depend on the history of the other player's probability vector (and your projection of where they'll go in the future). That's what they mean by theory of mind.

Thanks, this is a very useful explanation!

I didn't find this paper particularly interesting, mostly because it doesn't show the strength of extortionate strategies, but rather the limits of evolution in the way the paper defines it, and because these kind of "evolutionary" strategies have never been empirically shown to be particularly successful in IPD matches of infinite length, so their exploitation is not a "significant mathematical feature" as claimed.

To sum up the paper: In a non-zero-sum game of this kind, strategies that only care about gradually improving their own score cannot fully utilize defection as a punishment or deterrent, and thus are permanently exploited by strategies that can.

Note that the kind of evolution the paper talks about has little to do with actual evolution, and the last paragraph is nothing but an empty phrase.

I'm not really able to evaluate the claims in the paper myself, so thanks for the input. Having said that, do you think the paper specifies the strategies with enough detail for us to code them up and test their mettle in a Less Wrong IPD tournament?

Here's the first strategy that is explicitly stated (in the context of the 5/3/1/0 payouts):

If the last outcome was CC, cooperate with probability 11/13.
If the last outcome was CD, cooperate with probability 1/2.
If the last outcome was DC, cooperate with probability 7/26.
If the last outcome was DD, defect.

This supposedly "enforces an extortion factor of 3", whatever that means.

Sure, but this extortionate strategy wouldn't survive long in such a tournament because it would perform poorly against TFT variants as well as against itself.

I know that this isn't exactly what you're asking, but: Stewart and Plotkin tested two variants of ZD strategies in a variant of Axelrod's original tournament; one variant (ZDGTFT-2) had the highest total score, beating TFT.

Here's another strategy computed from the paper: cooperate with probabilities (0.9, 0.7, 0.2, 0.1) in the case (CC, CD, DC, DD). Supposedly this strategy is one that sets the opponent's score to an average of 2, regardless of his or her actions. You could come up with a similar strategy to force any outcome in the interval (1,3), excluding the endpoints.

I've yet to check how this works.

I am now reading the paper with intent to code them into the current simulator. Wish me luck; reading Dyson is a challenge.

Alright, thank you. As far as the last paragraph goes, I took it of course more on the "metaphorical" level. I agree their evolutionary agent might be too restricted to be fully interesting (though it is valuable if their inferiority is demonstrated analytically not only from simulations).

Since it seems you have lot's of experience with IPD, what do you think about the case B)? The paper makes the claim specifically for the ZD strategies, but do you think this "superrationally" result could generalize to any strategy which has also a theory of mind? On the other hand Hofstadter's idea was in the context of one-shot PD, so this might be not apply in general at all... I need to learn more about this subject...

Okay, so this is what happens with the PD strategy in this comment.

Let's try to get an optimal counter-strategy (CS) to the probabilistic strategy above (PS). We work backwards. Suppose we've worked out CS's behavior for the last N-1 turns. Then on the Nth turn, in each of the four possible situations, the probabilities above, and what we've found for CS's behavior, can be used to get us expected payouts for the remainder of the match if we cooperate and if we defect. We choose the action that yields the larger expected payout. This is the optimal strategy to use against this opponent if we want to get a high score.

Note that since PS is stupid and does the same thing on every turn, CS should just defect on the last turn.

However, after working out the math, it appears that CS is actually a very nice strategy. It defects on the last turn, and also on the next-to-last turn if it finds itself in a "CC" situation; in all other cases, it cooperates.

It's obvious that PS, which has some probability of defecting, will win the match against CS, because it's effectively playing against a cooperative rock. In other words, if you play against this strategy and try to maximize your own score, your opponent will have a higher score.

This isn't as ridiculous as it appears! CS isn't "losing" in any significant sense, because the goal we gave it wasn't to win the match; it was to get as many points as possible. In an infinite Prisoner's Dilemma (which is the situation considered in the paper), this is the only reasonable thing to ask, because there's no match to be won. So the "extortion" of PS is actually that if you try to maximize your points against it, it will get even more points than you will.

However, after working out the math, it appears that the optimal strategy against this one is actually a very nice one.

Of course, the same as in a game of chicken where your opponent precommits to defecting.

In infinite IPD:

There are lots of probabilistic strategies your opponent can precommit to that prevent you from averaging CC (in this case: 3).
If your opponent accepts any probabilistic precommitment from you without precommiting himself, you can maximise your score beyond CC.
If you model your opponent as a probabilistic strategy, you accept any probabilistic precommitment from your opponent.

Point 2 may not be obvious, but follows straight from the payoff matrix.

Well, yes; I'm assuming that I know the strategy my opponent is playing, which assumes a precommitment. I'm just trying to explain the reasoning in the paper, without going into determinants and Markov chains and so on.

Upvoted. This seems important/awesome if the paper's "story" is really supported by the math itself, but verifying details and saying something interesting will take time. Thanks much for the link and the explanation :-)

An investigation of evolutionary stability of the ZD strategies can be found in this preprint.

I find the article very interesting, but have trouble following the math. Maybe someone here better at math can help. I do have some understanding of linear algebra, and I've tried to check it with a spreadsheet:

At the very beginning, their closed form solution for V, the stationary vector, seems to allow V's that have negative numbers for the state probabilities. That can't be describing a real game. E.g. if you set p = (0.9, 0.7, 0.2, 0.1) and q = (0.5, 0.5, 0.5, 0.5), you get V = (0.08, -0.08, 0.1, -0.1). [p here is set to the Force-Opponent-Score-Equal-to-2 values, q is a random strategy, and V is calculated by 3x3 determinants of portions of M' as described in the paper]

I don't know how to convert that into a V with no negative numbers. Some of the co-efficients are positive and some negative, so you can't just scale it. Their formula for s_y correctly returns 2, but it's unclear if that corresponds to a real world equilibrium.

Their formula for payoffs sx and sy require division by D(p,q,1). D(p,q,1) can be 0, e.g. for the classic tit-for-tat strategies matched head to head, p = (1,0,1,0) and q = (1,1,0,0). I don't know if that ruins the conclusion or not. If you match their Extort-3 strategy against tit-for-tat, again you get a 0 denominator.

Are these fatal problems? Not sure yet. Their overall conclusion meets with my intuition. They're just saying that if one player only tries to maximize his own score, while the other player is strategic (in terms of denying the first player a higher score), then the second player is going to win in the long term. Except they call the first player "evolutionary," and the second player "sentient."

And two, there's no point being too "smart" (looking back too many moves) when your opponent is "dumb" (looking back only 1 move).

You could say both of these things about the current bargaining position of the US political parties right now.

v cannot have negative entries. It appears that are you are forgetting the signs in the formula for the adjugate.

v is guaranteed to exist and be a valid probability vector as long as M is an irreducible Markov matrix (that is, any state can eventually be reached from any other state). An equivalent and intuitively easier way to calculate v is by repeatedly squaring M: when you do this, all rows of M^k converge to v. This is a consequence of the fact that v is an equilibrium state, i.e., the probability distribution you end up with if you let the Markov chain run forever (from any starting state).

You're right snarles. Thanks for spotting my error. I forgot the signs in the formula for adjugate.

What about the problem of the zero determinant in the denominator? Is that fatal? What's the real world interpretation?

Here's the first strategy that is explicitly stated (in the context of the 5/3/1/0 payouts):

If the last outcome was CC, cooperate with probability 11/13.
If the last outcome was CD, cooperate with probability 1/2.
If the last outcome was DC, cooperate with probability 7/26.
If the last outcome was DD, defect.

This supposedly "enforces an extortion factor of 3", whatever that means.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

39

Paper: Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent

39

39