This post is meant to be a reference for some basic properties of Nash equilibria in symmetric zero-sum games. I'll focus on two player games for simplicity, though the content of the post straightforwardly generalizes to the case of N>2 players. I won't assume familiarity with game theory in this post, though the post could still be somewhat technical and you should consider yourself appropriately warned of this.
Preliminaries
Introducing the terminology
For the purposes of this post, a N-player game is characterized by a reward function r:∏iAi→∏iR which takes an N-tuple of player "actions" (a1,a2,…,aN) and maps this to an N-tuple (r1,r2,…,rN) of rewards that are received by each of the N players. We will interpret "action" liberally to mean "deterministic strategy", so an element of Ai is a deterministic strategy that encodes moves that player i will make in every possible situation that could arise in the game. We will work explicitly with finite games: these are games for which the sets Ai are finite sets. Here the underlying game could be a game that involves randomness: in this case r would simply represent the expected reward of playing the game.
We will also require a probabilistic notion of strategy. A strategy si for player i is simply a probability distribution over the finite set of actions Ai, so it's a function si:Ai→[0,1] that must sum to 1. I will denote the set of all strategies of player i by Si, and I will also engage in some operator overloading to extend the definition of the reward function r to strategies in the obvious way by treating the decisions of distinct players as independent events and taking expectations:
r(s1,…,sN)=∑a1∈A1,…,aN∈aNr(a1,…,aN)∏isi(ai)
A game is said to be zero-sum if the reward function r has the property that ∑iri=0 regardless of the actions chosen by the players, and symmetric if the action sets Ai are all identical for all of the players and the reward function is covariant under (or commutes with) permutations of its inputs: if we apply some permutation to which players are playing which actions, the associated reward vector must also get permuted in the same way. Intuitively, this means in a symmetric game there's really "no distinction" between the players of the game.
When the game is a two-player game, the zero-sum condition gets rid of one of the degrees of freedom in the definition of reward, so putting all of the above together, we can view a two-player, symmetric zero-sum game as represented by a function r:A2→R containing the reward from the point of view of the first player. Positive rewards from the first player's point of view are negative rewards from the second player's point of view and vice versa, so this function contains all of the relevant information we need to know about the game. The condition of symmetry just means that r must be skew-symmetric: we must have r(a,b)=−r(b,a) for all (a,b)∈A2.
Why are these games interesting?
Zero-sum games are fairly common: any two-player game with a winner and loser often falls into this category. For instance, chess is straightforwardly interpreted as a zero-sum game by saying that a win is worth 1, a draw is worth 0 and a loss is worth −1, and similar interpretations are possible for other board games, card games, etc. as well. It turns out a game being zero-sum imposes some important constraints on how the "meta-game" can evolve, and this is part of what I want to talk about in this post.
The condition that the game should be symmetric might seem too strong at first, as most well-known games are not symmetric: for instance, any game with a first mover advantage would not be symmetric. However, we can always make a non-symmetric N-player game into a symmetric game by simply randomizing who gets the position of which player. For instance, white and black obviously have different action spaces in chess, but the slightly expanded game in which we first flip a coin to decide who gets white and then play chess according to the outcome of the toss is indeed a symmetric game. As matchups are indeed randomized in this way in most games, we can actually straightforwardly interpret them as symmetric games even though they don't look as such.
What is a Nash equilibrium?
A Nash equilibrium is intuitively defined to be a collection of strategies by each player such that when these strategies are being played, no player can make himself better off by switching to a different strategy while the strategies of all other players are held fixed. In our context of two-player symmetric zero sum games, a Nash equilibrium is a pair of strategies (a,b)∈S2 such that r(a,b)=maxxr(x,b)=minyr(a,y). A Nash equilibrium is symmetric if it consists of two identical strategies, i.e. if it's of the form (s,s) for some s∈S.
It's a general fact about games with finite action spaces that they must have at least one Nash equilibrium. In our specific case, if (a,b) is a Nash equilibrium, then it must have a reward of zero as if it did not, the player getting the worse end of the deal could make themselves better off by simply copying their opponent's strategy and getting a reward of zero. This leads us to the first somewhat nontrivial result:
Theorem 1. A two-player symmetric zero-sum game must have a symmetric Nash equilibrium.
Proof. We know that there must be some Nash equilibrium (a,b), and we know that we must have r(b,b)=r(a,b)=0 by the above arguments. If (b,b) were not a Nash equilibrium itself then it would mean there is some strategy c such that r(c,b)>r(b,b)=0, but this contradicts the fact that (a,b) is a Nash equilibrium. So in fact (b,b) is a Nash equilibrium as well and we're done.
While this proof is perfectly valid if we assume the well-known result that Nash equilibria exist in general for games with finite action spaces, it's not a very explicit proof. The first main task of the post will be to give a proof of this theorem that's more explicit.
I'll list some other miscellaneous claims without giving proofs, as the arguments are straightforward and similar to the above:
Claim 2. Any two strategies, each of which are playable in some (not necessarily the same) Nash equilibrium of a two-player symmetric zero-sum game, must get zero reward when matched up against each other.
Claim 3. A strategy is played by either player in some Nash equilibrium of a two-player symmetric zero-sum game if and only if it gets zero reward in its worst possible matchup. Symmetric Nash equilibria are exactly such pairs in which each player plays the same such strategy.
Existence of Nash equilibria
While I will go on to give a different and simpler proof of Theorem 1 for this particular case in a later section, I want to start with a method of establishing the existence of Nash equilibria that actually works in general. Therefore, I'll start with an explicit proof of Theorem 1 that assumes only the Brouwer fixed-point theorem. In case the theorem is unfamiliar, I quote a version of it that will be applicable to our context below:
Brouwer's fixed point theorem. Let D⊂Rn be compact and convex, and let f:D→D be a continuous function from D to itself. Then f has a fixed point.
This theorem will be relevant for our purposes because given that we're considering only finite games, S is a simplex in some high dimensional real vector space, and is therefore compact and convex. Therefore, any continuous function S→S we exhibit is going to have a fixed point.
This suggests a way to prove Theorem 1. Suppose that we define the function f as follows: send a∈S to the strategy that has the best performance against a. In this case, it looks like a Nash equilibrium would be a fixed point, so we should be able to do something here.
The problem is that f defined this way is not actually a function S→S, as for any given s∈S there will typically be many strategies in S that achieve the same reward when matched against it. So f actually defines a function S→P(S) sending s to the set of strategies that have the best possible performance against s, and there's no obvious way to apply Brouwer's fixed point theorem to such a function.
It turns out that another fixed point theorem called the Kakutani fixed point theorem can be used to make the above argument work, but I don't want to invoke this theorem as it's instructive to try to salvage the proof based solely on the Brouwer fixed point theorem.
Fixing the indeterminacy
The main problem with our strategy is the lack of uniqueness in the definition of f. How can we remedy this?
Explicitly, the definition we "want" to write down is
f(s)=argmaxxr(x,s)
Our problem is that x→r(x,s) does not necessarily have a unique maximum. However, we can change this. Recall that we can expand r as a sum over actions as follows:
f(s)=argmaxx∑a1,a2∈Ar(a1,a2)x(a1)s(a2)
The right hand side is simply a linear function, so with the constraint that x must be a legitimate probability distribution, here we're just trying to maximize a linear function over a simplex. It's not surprising that this has bad behavior, as linear functions are not strictly concave.
The key insight here is that while linear functions are not strictly concave, they are indeed concave, so only a small perturbation is needed to make that concavity a strict property. We can achieve this by simply adding a strictly concave term to the objective we're trying to maximize, and a particularly natural choice is to pick this term so that it achieves its global minimum at s. While other distances could also work, we can get away with simply using the Euclidean distance, also known as the L2 distance.
Explicitly, I'm recommending to define f instead as
f(s)=argmaxxr(x,s)−ηDL2(s∥x)
for some η>0 with DL2 defined as
DL2(s1∥s2)=∑a∈A(s1(a)−s2(a))2
The key properties we need to check are the following:
The maximization problem on the right hand side admits a unique solution x.
The function f:S→S defined like this is continuous.
Any fixed point of f gives rise to a symmetric Nash equilibrium of the original game.
The first condition
Let's check these properties one by one. For the first condition, we note that the maximization problem is to maximize a strictly concave and continuous function over a convex and compact subset of Rn, so we know it has a unique solution: the maximum exists by continuity of the function and the compactness of the domain, and it's unique by strict concavity and the convexity of the domain. So this point is easy to deal with.
The second condition (technical)
The second condition requires some more thought. However, we can give a geometric interpretation as follows: because r is linear and the second term is quadratic, we can imagine completing the square in the argument of f. This will give us that the argument of the maximum taken in f is actually just measuring the Euclidean distance from a point in Rn, though this point might not necessarily be in S itself, and this point varies continuously with s.
We therefore can reduce the continuity of f to the question "if we move a point p in Rn continuously and we have a convex compact subset D⊂Rn, does the closest point to p in D also vary continuously?" It turns out that this claim is true by a simple argument involving the parallelogram identity
|x+y|2+|x−y|2=2|x|2+2|y|2
for the Euclidean norm. If we let p1,p2 be two points and d1,d2 be the two closest points to them in D respectively, we can take x=p1−d1, y=p2−d2 to obtain
|p1−d1+p2−d2|2+|p1−p2−d1+d2|2=2|p1−d1|2+2|p2−d2|2
Now, suppose that there were some L>0 such that we could find arbitrarily close p2 to p1 with |d1−d2|>L. Then, we can ensure that
∣∣∣p1−d1+d22∣∣∣2=|p1−d1|2+|p2−d2|22−L24+ε
where we can make the absolute value of ε as small as we please by moving p2 closer to p1. Furthermore, if p1 and p2 are close, then the minimum distance of p2 to D cannot be much different from that of p1 (note that we don't yet know this about d1,d2 themselves!), so in fact we can take the right hand side to be
∣∣∣p1−d1+d22∣∣∣2=|p1−d1|2−L24+ε
again up to some small error ε which we can make as small as we like in absolute value. This is now obviously a contradiction, because we had assumed d1 to be the closest point to p1 in D, but we have proven that the average of d1 and d2, which is in D by convexity, is even closer to p1. So our original assumption must have been wrong, and we've proven the continuity of f formally.
The third condition
Now, let's say that we have a fixed point of f; in other words, we have a strategy s satisfying
s=argmaxxr(x,s)−ηDL2(s∥x)
It turns out it's easy to prove s must give rise to a symmetric Nash equilibrium. Indeed, suppose that it were not. In that case, there would be a strategy u that could get a strictly positive result against it, and by linearity of r we would have that the derivative of the function t→r(s+t(u−s),s) would be strictly positive for all t∈(0,1). However, the derivative of t→DL2(s∥s+t(u−s)) is zero at t=0 and small in a neighborhood of t=0, which implies the whole expression
r(s+t(u−s),s)−ηDL2(s∥s+t(u−s))
would have strictly positive derivative in some neighborhood of t=0. This would mean the function is increasing, contradicting the fact that the maximum occurs at x=s. Therefore our initial assumption must have been false, and s indeed gives rise to a symmetric Nash equilibrium.
Finishing the proof
We've checked all of the necessary conditions, so now we can finish the proof. f:S→S as we've defined it earlier is a continuous function from a convex compact set to itself, so by Brouwer's fixed point theorem it must have a fixed point s. By the third condition, this fixed point gives rise to a symmetric Nash equilibrium (s,s)∈S2 of the original game, and we're done!
Convergence to symmetric Nash equilibria
One of the well-known facts about symmetric zero-sum games is that they tend to develop a "metagame", which under circumstances of unbounded rationality on the part of the players would have to be a symmetric Nash equilibrium. This is most apparent in complex card games such as Magic: The Gathering or Hearthstone, especially as these games involve a substantial element of randomness which allows for interesting exploration dynamics and rich structure in Nash equilibria. However, it can be seen even in simple games such as rock-paper-scissors, the prototypical example of a nontrivial symmetric zero-sum game.
One question we might ask is whether the intuitive process by which everyone continuously optimizes their strategy to become better against the current metagame would naturally lead to a convergence to Nash equilibrium. To see that this is not trivial, consider that if we operate in a discrete-time setup in rock-paper-scissors and the current metagame involves everyone playing rock, then in the next time step everyone will be playing paper, then scissors, etc. So it's not obvious that the associated dynamics of everyone trying to beat the metagame don't lead to cyclic behavior of this kind.
The result we might hope for instead is that we hope to converge to a Nash equilibrium if updates are slow: this is reminiscent of the L2 error term we used in the earlier proof of Theorem 1, which has a similar effect of slowing updates down. Namely, instead of everyone suddenly updating to the strategy that worked best against the old metagame, what if we assume the updates made by players are in some sense bounded?
It turns out this is indeed the case under some mild assumptions:
Theorem 2. Suppose that a continuum of players indexed by [0,1] are playing a two-player symmetric zero-sum game in which players are matched randomly with each other. If players update their strategies continuously at a fixed rate per unit time of η>0, and all updates are made to one of the deterministic strategies with the best matchup against the current metagame chosen uniformly at random, then the metagame eventually converges to a Nash equilibrium.
Proof. The idea is to simply leverage the property of symmetric zero-sum games that distance to a Nash equilibrium can be measured by "exploitability", in other words, the worst possible matchup a strategy has. This is not always possible with general finite games but it's possible here.
Let the metagame be denoted by π∈S: it is the probability distribution over all deterministic strategies a player expects to face in a random matchup. At any given time, there is a set of actions E(π)⊂A that have the best possible performance against the current metagame π. Say that any strategy in E(π) scores eπ≥0 against π: this is the exploitability of the current metagame.
Given that the matchup of the metagame against itself has score 0 by definition, the average player is playing a strategy that scores even against the metagame. The assumption we made implies that ηdt players update their strategies in an infinitesimal time interval dt, and they update towards the best strategies.
As the game is finite, there is a nonzero gap between the best strategies and the second best strategies in how well they perform against the current metagame. This means the exploitability of the current metagame follows a law
deπdt=−ηeπ
whenever E(π) is left unchanged by the update. If some strategies in E(π) have strictly positive performance against E(π) as a whole, which is possible, then the update will make E(π) smaller, and this will continue until all strategies in E(π) have even matchups against a strategy chosen uniformly at random from E(π). When that happens, E(π) will remain fixed for some nonzero amount of time by the finiteness of the game. Since we're working in continuous time, we can safely ignore the time interval of zero measure in which this happens, and so eπ→0 as t→∞ with an exponential decay.
Importantly, note that the above proof implies the existence of a symmetric Nash equilibrium! Otherwise, it would be impossible for the exploitability to show such an exponential decay to zero. Indeed, we can formally show this by picking a subsequence πk of metagames from the continuous trajectory and use the compactness of S to find a convergent subsequence. The metagame π that this subsequence converges to must be a symmetric Nash equilibrium of the original game!
It turns out the assumption of continuous time is crucial in this proof. If we assume that updates take place in discrete time, then the problem is that a sufficiently large update to the metagame π can actually change which strategies are good against it.
For instance, in rock-paper-scissors, when we start with a metagame consisting only of rock players, and everyone updates to playing paper, we actually skipped over the strategy of playing scissors: the exploitability of the new metagame is no longer its matchup against paper, but against scissors. This happens whenever the updates we make are too large in discrete time, but the issues go away if we work in continuous time instead.
Here is an example of what this process looks like in discrete time for rock-paper-scissors:
The smaller the step sizes, the closer we get to the Nash equilibrium, which is unique in this particular case but need not be in general. You can also see that we indeed get a nice exponential decay in exploitability as the above result should lead us to expect, at least until the oscillations become too small for the step size to successfully manage them.
In practice, with small step sizes in a discrete time environment, we can also prove a similar result that shows that we can get an approximate Nash equilibrium, as the above plots display for rock-paper-scissors. The details of working out how to do this are left to the reader for the time being, though I could come back to this post at some future date to flesh this out more.
The theoretical results are encouraging because they confirm the folk intuition about how metagames work: we can limit the extent to which cycling problems appear by assuming that people are slow to update their strategies, which seems fairly realistic for most games. However, unfortunately they don't generalize as readily to games that are not zero-sum, as in these games the equivalent measure to exploitability does not need to decline when some players switch to more efficient strategies against the current metagame, even on the margin. This means it's trickier to even establish the existence of Nash equilibria for these games, and why more powerful methods such as the ones in my first proof of Theorem 1 are useful.
This post is meant to be a reference for some basic properties of Nash equilibria in symmetric zero-sum games. I'll focus on two player games for simplicity, though the content of the post straightforwardly generalizes to the case of N>2 players. I won't assume familiarity with game theory in this post, though the post could still be somewhat technical and you should consider yourself appropriately warned of this.
Preliminaries
Introducing the terminology
For the purposes of this post, a N-player game is characterized by a reward function r:∏iAi→∏iR which takes an N-tuple of player "actions" (a1,a2,…,aN) and maps this to an N-tuple (r1,r2,…,rN) of rewards that are received by each of the N players. We will interpret "action" liberally to mean "deterministic strategy", so an element of Ai is a deterministic strategy that encodes moves that player i will make in every possible situation that could arise in the game. We will work explicitly with finite games: these are games for which the sets Ai are finite sets. Here the underlying game could be a game that involves randomness: in this case r would simply represent the expected reward of playing the game.
We will also require a probabilistic notion of strategy. A strategy si for player i is simply a probability distribution over the finite set of actions Ai, so it's a function si:Ai→[0,1] that must sum to 1. I will denote the set of all strategies of player i by Si, and I will also engage in some operator overloading to extend the definition of the reward function r to strategies in the obvious way by treating the decisions of distinct players as independent events and taking expectations:
r(s1,…,sN)=∑a1∈A1,…,aN∈aNr(a1,…,aN)∏isi(ai)
A game is said to be zero-sum if the reward function r has the property that ∑iri=0 regardless of the actions chosen by the players, and symmetric if the action sets Ai are all identical for all of the players and the reward function is covariant under (or commutes with) permutations of its inputs: if we apply some permutation to which players are playing which actions, the associated reward vector must also get permuted in the same way. Intuitively, this means in a symmetric game there's really "no distinction" between the players of the game.
When the game is a two-player game, the zero-sum condition gets rid of one of the degrees of freedom in the definition of reward, so putting all of the above together, we can view a two-player, symmetric zero-sum game as represented by a function r:A2→R containing the reward from the point of view of the first player. Positive rewards from the first player's point of view are negative rewards from the second player's point of view and vice versa, so this function contains all of the relevant information we need to know about the game. The condition of symmetry just means that r must be skew-symmetric: we must have r(a,b)=−r(b,a) for all (a,b)∈A2.
Why are these games interesting?
Zero-sum games are fairly common: any two-player game with a winner and loser often falls into this category. For instance, chess is straightforwardly interpreted as a zero-sum game by saying that a win is worth 1, a draw is worth 0 and a loss is worth −1, and similar interpretations are possible for other board games, card games, etc. as well. It turns out a game being zero-sum imposes some important constraints on how the "meta-game" can evolve, and this is part of what I want to talk about in this post.
The condition that the game should be symmetric might seem too strong at first, as most well-known games are not symmetric: for instance, any game with a first mover advantage would not be symmetric. However, we can always make a non-symmetric N-player game into a symmetric game by simply randomizing who gets the position of which player. For instance, white and black obviously have different action spaces in chess, but the slightly expanded game in which we first flip a coin to decide who gets white and then play chess according to the outcome of the toss is indeed a symmetric game. As matchups are indeed randomized in this way in most games, we can actually straightforwardly interpret them as symmetric games even though they don't look as such.
What is a Nash equilibrium?
A Nash equilibrium is intuitively defined to be a collection of strategies by each player such that when these strategies are being played, no player can make himself better off by switching to a different strategy while the strategies of all other players are held fixed. In our context of two-player symmetric zero sum games, a Nash equilibrium is a pair of strategies (a,b)∈S2 such that r(a,b)=maxxr(x,b)=minyr(a,y). A Nash equilibrium is symmetric if it consists of two identical strategies, i.e. if it's of the form (s,s) for some s∈S.
It's a general fact about games with finite action spaces that they must have at least one Nash equilibrium. In our specific case, if (a,b) is a Nash equilibrium, then it must have a reward of zero as if it did not, the player getting the worse end of the deal could make themselves better off by simply copying their opponent's strategy and getting a reward of zero. This leads us to the first somewhat nontrivial result:
Theorem 1. A two-player symmetric zero-sum game must have a symmetric Nash equilibrium.
Proof. We know that there must be some Nash equilibrium (a,b), and we know that we must have r(b,b)=r(a,b)=0 by the above arguments. If (b,b) were not a Nash equilibrium itself then it would mean there is some strategy c such that r(c,b)>r(b,b)=0, but this contradicts the fact that (a,b) is a Nash equilibrium. So in fact (b,b) is a Nash equilibrium as well and we're done.
While this proof is perfectly valid if we assume the well-known result that Nash equilibria exist in general for games with finite action spaces, it's not a very explicit proof. The first main task of the post will be to give a proof of this theorem that's more explicit.
I'll list some other miscellaneous claims without giving proofs, as the arguments are straightforward and similar to the above:
Claim 2. Any two strategies, each of which are playable in some (not necessarily the same) Nash equilibrium of a two-player symmetric zero-sum game, must get zero reward when matched up against each other.
Claim 3. A strategy is played by either player in some Nash equilibrium of a two-player symmetric zero-sum game if and only if it gets zero reward in its worst possible matchup. Symmetric Nash equilibria are exactly such pairs in which each player plays the same such strategy.
Existence of Nash equilibria
While I will go on to give a different and simpler proof of Theorem 1 for this particular case in a later section, I want to start with a method of establishing the existence of Nash equilibria that actually works in general. Therefore, I'll start with an explicit proof of Theorem 1 that assumes only the Brouwer fixed-point theorem. In case the theorem is unfamiliar, I quote a version of it that will be applicable to our context below:
Brouwer's fixed point theorem. Let D⊂Rn be compact and convex, and let f:D→D be a continuous function from D to itself. Then f has a fixed point.
This theorem will be relevant for our purposes because given that we're considering only finite games, S is a simplex in some high dimensional real vector space, and is therefore compact and convex. Therefore, any continuous function S→S we exhibit is going to have a fixed point.
This suggests a way to prove Theorem 1. Suppose that we define the function f as follows: send a∈S to the strategy that has the best performance against a. In this case, it looks like a Nash equilibrium would be a fixed point, so we should be able to do something here.
The problem is that f defined this way is not actually a function S→S, as for any given s∈S there will typically be many strategies in S that achieve the same reward when matched against it. So f actually defines a function S→P(S) sending s to the set of strategies that have the best possible performance against s, and there's no obvious way to apply Brouwer's fixed point theorem to such a function.
It turns out that another fixed point theorem called the Kakutani fixed point theorem can be used to make the above argument work, but I don't want to invoke this theorem as it's instructive to try to salvage the proof based solely on the Brouwer fixed point theorem.
Fixing the indeterminacy
The main problem with our strategy is the lack of uniqueness in the definition of f. How can we remedy this?
Explicitly, the definition we "want" to write down is
f(s)=argmaxxr(x,s)
Our problem is that x→r(x,s) does not necessarily have a unique maximum. However, we can change this. Recall that we can expand r as a sum over actions as follows:
f(s)=argmaxx∑a1,a2∈Ar(a1,a2)x(a1)s(a2)
The right hand side is simply a linear function, so with the constraint that x must be a legitimate probability distribution, here we're just trying to maximize a linear function over a simplex. It's not surprising that this has bad behavior, as linear functions are not strictly concave.
The key insight here is that while linear functions are not strictly concave, they are indeed concave, so only a small perturbation is needed to make that concavity a strict property. We can achieve this by simply adding a strictly concave term to the objective we're trying to maximize, and a particularly natural choice is to pick this term so that it achieves its global minimum at s. While other distances could also work, we can get away with simply using the Euclidean distance, also known as the L2 distance.
Explicitly, I'm recommending to define f instead as
f(s)=argmaxxr(x,s)−ηDL2(s∥x)
for some η>0 with DL2 defined as
DL2(s1∥s2)=∑a∈A(s1(a)−s2(a))2
The key properties we need to check are the following:
The first condition
Let's check these properties one by one. For the first condition, we note that the maximization problem is to maximize a strictly concave and continuous function over a convex and compact subset of Rn, so we know it has a unique solution: the maximum exists by continuity of the function and the compactness of the domain, and it's unique by strict concavity and the convexity of the domain. So this point is easy to deal with.
The second condition (technical)
The second condition requires some more thought. However, we can give a geometric interpretation as follows: because r is linear and the second term is quadratic, we can imagine completing the square in the argument of f. This will give us that the argument of the maximum taken in f is actually just measuring the Euclidean distance from a point in Rn, though this point might not necessarily be in S itself, and this point varies continuously with s.
We therefore can reduce the continuity of f to the question "if we move a point p in Rn continuously and we have a convex compact subset D⊂Rn, does the closest point to p in D also vary continuously?" It turns out that this claim is true by a simple argument involving the parallelogram identity
|x+y|2+|x−y|2=2|x|2+2|y|2
for the Euclidean norm. If we let p1,p2 be two points and d1,d2 be the two closest points to them in D respectively, we can take x=p1−d1, y=p2−d2 to obtain
|p1−d1+p2−d2|2+|p1−p2−d1+d2|2=2|p1−d1|2+2|p2−d2|2
Now, suppose that there were some L>0 such that we could find arbitrarily close p2 to p1 with |d1−d2|>L. Then, we can ensure that
∣∣∣p1−d1+d22∣∣∣2=|p1−d1|2+|p2−d2|22−L24+ε
where we can make the absolute value of ε as small as we please by moving p2 closer to p1. Furthermore, if p1 and p2 are close, then the minimum distance of p2 to D cannot be much different from that of p1 (note that we don't yet know this about d1,d2 themselves!), so in fact we can take the right hand side to be
∣∣∣p1−d1+d22∣∣∣2=|p1−d1|2−L24+ε
again up to some small error ε which we can make as small as we like in absolute value. This is now obviously a contradiction, because we had assumed d1 to be the closest point to p1 in D, but we have proven that the average of d1 and d2, which is in D by convexity, is even closer to p1. So our original assumption must have been wrong, and we've proven the continuity of f formally.
The third condition
Now, let's say that we have a fixed point of f; in other words, we have a strategy s satisfying
s=argmaxxr(x,s)−ηDL2(s∥x)
It turns out it's easy to prove s must give rise to a symmetric Nash equilibrium. Indeed, suppose that it were not. In that case, there would be a strategy u that could get a strictly positive result against it, and by linearity of r we would have that the derivative of the function t→r(s+t(u−s),s) would be strictly positive for all t∈(0,1). However, the derivative of t→DL2(s∥s+t(u−s)) is zero at t=0 and small in a neighborhood of t=0, which implies the whole expression
r(s+t(u−s),s)−ηDL2(s∥s+t(u−s))
would have strictly positive derivative in some neighborhood of t=0. This would mean the function is increasing, contradicting the fact that the maximum occurs at x=s. Therefore our initial assumption must have been false, and s indeed gives rise to a symmetric Nash equilibrium.
Finishing the proof
We've checked all of the necessary conditions, so now we can finish the proof. f:S→S as we've defined it earlier is a continuous function from a convex compact set to itself, so by Brouwer's fixed point theorem it must have a fixed point s. By the third condition, this fixed point gives rise to a symmetric Nash equilibrium (s,s)∈S2 of the original game, and we're done!
Convergence to symmetric Nash equilibria
One of the well-known facts about symmetric zero-sum games is that they tend to develop a "metagame", which under circumstances of unbounded rationality on the part of the players would have to be a symmetric Nash equilibrium. This is most apparent in complex card games such as Magic: The Gathering or Hearthstone, especially as these games involve a substantial element of randomness which allows for interesting exploration dynamics and rich structure in Nash equilibria. However, it can be seen even in simple games such as rock-paper-scissors, the prototypical example of a nontrivial symmetric zero-sum game.
One question we might ask is whether the intuitive process by which everyone continuously optimizes their strategy to become better against the current metagame would naturally lead to a convergence to Nash equilibrium. To see that this is not trivial, consider that if we operate in a discrete-time setup in rock-paper-scissors and the current metagame involves everyone playing rock, then in the next time step everyone will be playing paper, then scissors, etc. So it's not obvious that the associated dynamics of everyone trying to beat the metagame don't lead to cyclic behavior of this kind.
The result we might hope for instead is that we hope to converge to a Nash equilibrium if updates are slow: this is reminiscent of the L2 error term we used in the earlier proof of Theorem 1, which has a similar effect of slowing updates down. Namely, instead of everyone suddenly updating to the strategy that worked best against the old metagame, what if we assume the updates made by players are in some sense bounded?
It turns out this is indeed the case under some mild assumptions:
Theorem 2. Suppose that a continuum of players indexed by [0,1] are playing a two-player symmetric zero-sum game in which players are matched randomly with each other. If players update their strategies continuously at a fixed rate per unit time of η>0, and all updates are made to one of the deterministic strategies with the best matchup against the current metagame chosen uniformly at random, then the metagame eventually converges to a Nash equilibrium.
Proof. The idea is to simply leverage the property of symmetric zero-sum games that distance to a Nash equilibrium can be measured by "exploitability", in other words, the worst possible matchup a strategy has. This is not always possible with general finite games but it's possible here.
Let the metagame be denoted by π∈S: it is the probability distribution over all deterministic strategies a player expects to face in a random matchup. At any given time, there is a set of actions E(π)⊂A that have the best possible performance against the current metagame π. Say that any strategy in E(π) scores eπ≥0 against π: this is the exploitability of the current metagame.
Given that the matchup of the metagame against itself has score 0 by definition, the average player is playing a strategy that scores even against the metagame. The assumption we made implies that ηdt players update their strategies in an infinitesimal time interval dt, and they update towards the best strategies.
As the game is finite, there is a nonzero gap between the best strategies and the second best strategies in how well they perform against the current metagame. This means the exploitability of the current metagame follows a law
deπdt=−ηeπ
whenever E(π) is left unchanged by the update. If some strategies in E(π) have strictly positive performance against E(π) as a whole, which is possible, then the update will make E(π) smaller, and this will continue until all strategies in E(π) have even matchups against a strategy chosen uniformly at random from E(π). When that happens, E(π) will remain fixed for some nonzero amount of time by the finiteness of the game. Since we're working in continuous time, we can safely ignore the time interval of zero measure in which this happens, and so eπ→0 as t→∞ with an exponential decay.
Importantly, note that the above proof implies the existence of a symmetric Nash equilibrium! Otherwise, it would be impossible for the exploitability to show such an exponential decay to zero. Indeed, we can formally show this by picking a subsequence πk of metagames from the continuous trajectory and use the compactness of S to find a convergent subsequence. The metagame π that this subsequence converges to must be a symmetric Nash equilibrium of the original game!
It turns out the assumption of continuous time is crucial in this proof. If we assume that updates take place in discrete time, then the problem is that a sufficiently large update to the metagame π can actually change which strategies are good against it.
For instance, in rock-paper-scissors, when we start with a metagame consisting only of rock players, and everyone updates to playing paper, we actually skipped over the strategy of playing scissors: the exploitability of the new metagame is no longer its matchup against paper, but against scissors. This happens whenever the updates we make are too large in discrete time, but the issues go away if we work in continuous time instead.
Here is an example of what this process looks like in discrete time for rock-paper-scissors:
The smaller the step sizes, the closer we get to the Nash equilibrium, which is unique in this particular case but need not be in general. You can also see that we indeed get a nice exponential decay in exploitability as the above result should lead us to expect, at least until the oscillations become too small for the step size to successfully manage them.
In practice, with small step sizes in a discrete time environment, we can also prove a similar result that shows that we can get an approximate Nash equilibrium, as the above plots display for rock-paper-scissors. The details of working out how to do this are left to the reader for the time being, though I could come back to this post at some future date to flesh this out more.
The theoretical results are encouraging because they confirm the folk intuition about how metagames work: we can limit the extent to which cycling problems appear by assuming that people are slow to update their strategies, which seems fairly realistic for most games. However, unfortunately they don't generalize as readily to games that are not zero-sum, as in these games the equivalent measure to exploitability does not need to decline when some players switch to more efficient strategies against the current metagame, even on the margin. This means it's trickier to even establish the existence of Nash equilibria for these games, and why more powerful methods such as the ones in my first proof of Theorem 1 are useful.