Acknowledgements:
This article is a writeup of a research project conducted through the SERI program under the mentorship of Alex Turner. I (Jacob Stavrianos) would like to thank Alex for turning a messy collection of ideas into legitimate research, as well as the wonderful researchers at SERI for guiding the project and putting me in touch with the broader X-risk community.
Motivation/Overview
In the single-agent setting, Seeking Power is Often Robustly Instrumental in MDPs showed that optimal policies tend to choose actions which pursue "power" (reasonably formalized). In the multi-agent setting, the Catastrophic Convergence Conjecture presented intuitions that "most agents" will "fight over resources" when they get "sufficiently advanced." However, it wasn't clear how to formalize that intuition.
This post synthesizes single-agent power dynamics (which we believe is now somewhat well-understood in the MDP setting) with the multi-agent setting. The multi-agent setting is important for AI alignment, since we want to reason clearly about when AI agents disempower humans. Assuming constant-sum games (i.e. maximal misalignment between agents), this post presents a result which echoes the intuitions in the Catastrophic Convergence Conjecture post: as agents become "more advanced", "power" becomes increasingly scarce & constant-sum.
An illustrative example
You're working on a project with a team of your peers. In particular, your actions affect the final deliverable, but so do those of your teammates. Say that each member of the team (including you) has some goal for the deliverable, which we can express as a reward function over the set of outcomes. How well (in terms of your reward function) can you expect to do?
It depends on your teammates' actions. Let's first ask "given my opponent's actions, what's the highest expected reward I can attain?"
Case 1: Everyone plays nice
We can start by imagining the case where everyone does exactly what you'd want them to do. Mathematically, this allows you to obtain the globally maximal reward; or "the best possible reward assuming you can choose everyone else's actions". Intuitively, this looks like your team sitting you down for a meeting, asking what you want them to do for the project, and carrying out orders without fail. As expected, this case is 'the best you can hope for" in a formal sense.
Case 2: Everyone plays mean
Now, imagine the case where everyone does exactly what you don't want them to do. Mathematically, this is the worst possible case; every other choice of teammates' actions is at least as good as this one. Intuitively, this case is pretty terrible for you. Imagine the previous case, but instead of following orders your team actively sabotages them. Alternatively, imagine that your team spends the meeting breaking your knees and your laptop.
Case 3: Somewhere in between
However, scenarios where your team is perfectly aligned either with or against you are rare. More typically, we model people as maximizing their own reward, with imperfect correlation between reward functions. Interpreting our example as a multi-player game, we can consider the case where the players' strategies form a Nash equilibrium: every person's action is optimal for themselves given the actions of the rest of their team. This case is both relatively general and structured enough to make claims about; we will use it as a guiding example for the formalism below.
POWER, and why it matters
Many attempts have been made to classify AI robustly instrumental goals, with the goals of understanding why they emerge given seemingly-unrelated utilities and ultimately to counterbalance (either implicitly or explicitly) undesirable robust instrumental subgoals. One promising such attempt is based on POWER (the technical term is all-caps to distinguish from normal use of the word): consider an agent with some space of actions, which receives rewards depending on the chosen actions (formally, an agent in an MDP). Then, POWER is roughly "ability to achieve a wide variety of goals". It's been shown that POWER is robustly instrumental given certain conditions on the environment, but currently no formalism exists describing power of different agents interacting with each other.
Since we'll be working with POWER for the rest of this post, we need a solid definition to build off of. We present a simplified version of the original definition:
Consider a scenario in which an agent has a set of actions and a distribution of reward functions . Then, we define the POWER of that agent as
As an example, we can rewrite the project example from earlier in terms of POWER. Let your goal for the project be chosen from some distribution (maybe you want it done nicely, or fast, or to feature some cool thing that you did, etc). Then, your is the maximum extent to which you can accomplish that goal, in expectation.
However, this model of power can't account for the actions of other agents in the environment (what about what your teammates do? Didn't we already show that it matters a lot?). To say more about the example, we'll need a generalization of POWER.
Multi-agent POWER
We now consider a more realistic scenario: not only are you an agent with a notion of reward and POWER, but so is everyone else, all playing the same multiplayer game. We can even revisit the project example and go through the cases for your teammates' actions in terms of POWER:
- In Case 1, your team works to maximize your reward in every case, which (with some assumptions) maximizes your POWER over the space of all choices of teammate actions.
- In Case 2, your team works to minimize your reward in every case, which analogously minimizes your POWER.
- In case 3, we have a Nash equilibrium of the game used to define multi-agent POWER. In particular, each player's action is a best-response to the actions of every other player. We'll see a parallel between this best-response property and the term in the definition of POWER pop up in the discussion of constant-sum games.
Bayesian games
To extend our formal definition of power to the multi-agent case, we'll need to define a type of multiplayer normal-form game called a Bayesian game. We describe them below:
- At the beginning of the game, each of players is assigned a type from a joint type distribution . The distribution is common knowledge.
- The players then (independently, not sequentially) choose actions , resulting in an action profile .
- Player then receives reward (crucially, a player's reward can depend on their type).
Strategies (technically, mixed strategies) in a Bayesian game are given by functions . Thus, even given a fixed strategy profile , any notion of "expected reward of an action" will have to account for uncertainty in other players' types. We do so be defining interim expected utility for player as follows:
where the expectation is taken over the following:
- the posterior distribution over opponents' types - in other words, what types you expect other players to have, given your type.
- random choice of opponents' actions - even if you know someone's type, they might implement a mixed strategy which stochastically selects actions.
Further, we can define a (Bayesian) Nash Equilibrium to be a strategy profile where each player's strategy is a best response to opponents' strategies in terms of interim expected utility.
Formal definition of multi-agent POWER
We can now define POWER in terms of a Bayesian game:
Fix a strategy profile . We define player 's POWER as
Intuitively, POWER is maximum (expected) reward given a distribution of possible goals. The difference from the single-agent case is that your reward is now influenced by other players' actions (by taking an expectation over opponents' strategy).
Properties of constant-sum games
As both a preliminary result and a reference point for intuition, we consider the special case of zero-sum games:
A zero-sum game is a game in which for every possible outcome of the game, the sum of each player's reward is zero. For Bayesian games, this means that for all type profiles and action profiles , we have . Similarly, a constant-sum game is a game satisfying for any choices of .
As a simple example, consider chess; a two-player adversarial game. We let the reward profile be constant, given by "1 if you win, -1 if you lose" (assume black wins in a tie). This game is clearly zero-sum, since exactly one player will win and lose. We could ask the same "how well can you do?" question as before, but the upper-bound of winning is trivial. Instead, we ask "how well can both players simultaneously do?"
Clearly, you can't both simultaneously win. However, we can imagine scenarios where both players have the power to win: in a chess game between two beginners, the optimal strategy for either player will easily win the game. As it turns out, this argument generalizes (we'll even prove it): in a constant-sum game, the sum of each player's POWER , with equality iff each player responds optimally for all their possible goals ("types"). This condition is equivalent to a Bayesian Nash Equilibrium of the game.
Importantly, this idea suggests a general principle of multi-agent POWER I'll call power-scarcity: in multi-agent games, gaining POWER tends to come at the expense of another player losing POWER. Future research will focus on understanding this phenomenon further and relating it to "how aligned the agents are" in terms of their reward functions.
Claim: Consider a Bayesian constant-sum game with some strategy profile . Then, with equality iff is a Nash Equilibrium.
Intuition: By definition, isn't a Nash Equilibrium iff some player 's strategy isn't a best response. In this case, we see that player has the power to play optimally, but the other players also have the power to capitalize off of player 's mistake (since the game is constant-sum). Thus, the lost reward is "double-counted" in terms of POWER; if no such double-counting exists, then the sum of POWER is just the expected sum of reward, which is by definition of a constant-sum game.
Rigorous proof:
We prove the following for general strategy profiles :
Now, we claim that the inequality on line 2 is an equality iff is a Nash Equilibrium. To see this, note that for each , we have
with equality iff is a best response to . Thus, the sum of these inequalities for each player is an equality iff each is a best response, which is the definition of a Nash Equilibrium.
Final notes
To wrap up, I'll elaborate on the implications of this theorem, as well as some areas of further exploration on power-scarcity:
- It initially seems unintuitive that as players' strategies improve, their collective POWER tends to decrease. The proximate cause of this effect is something like "as your strategy improves, other players lose the power to capitalize off of your mistakes". More work is probably needed to get a clearer picture of this dynamic.
- We suspect that if all players have identical rewards, then the sum of POWER is equal to the sum of best-case POWER for each player. This gives the appearance of a spectrum with [aligned rewards (common payoff), maximal sum power] on one end and [anti-aligned rewards (constant-sum), constant sum power] on the other. Further research might look into an interpolation between these two extremes, possibly characterized by a correlation metric between reward functions.
- We also plan to generalize POWER to Bayesian stochastic games to account for sequential decision making. Thus, any such metric for comparing reward functions would have to be consistent with such a generalization.
- POWER-scarcity results in terms of Nash Equilibria suggest the following dynamic: as agents get smarter and take available opportunities, POWER becomes increasingly scarce. This matches the intuitions presented in the Catastrophic Convergence Conjecture, where agents don’t fight over resources until they get sufficiently “advanced.”
Thanks for the detailed reply!
I want to go a bit deeper into the fine points, but my general reaction is "I wanted that in the post". You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don't think that stating a definition and proving a single theorem that has the "obvious" quality (whether or not it is actually obvious, mind you) is that convincing.
The best way to describe my interpretation is that I feel that you two went for the "scientific paper" style, but the current state of the research, as well as the argument for its value, fit more the "here's-a-cool-formal-idea blogpost or workshop paper". And that's independently of the importance of the result. To say it again differently, I'm ready to accept the importance of a formalism without much explanations of why I should care if it shows a lot of cool results, but when the results are few, I need a more detailed story of why I should care.
About your specific story now:
Nothing to say here, except that you have the frustrating (for me) ability to make me want to read 5 of your posts in detail when explaining something completely different. I am also supposed to make my own research, you know? (Related: I'll be excited with reviewing one of your post with the review project we're doing with a bunch of other researchers. Not sure what post of you would be most appropriate though. If you have some idea, you can post it here. ;) )
When phrased that way, I think my "issue" is that the subtlety you add is mostly hidden within the additional parameter of the strategy profile. That is, with the original intuition, you don't have to find out what the other players will actually do; here you kind of have to. It's a good thing as I agree with you that it makes the intuition subtler, but it also creates a whole new complex problem of inferring strategies.
At this point, I went to reread the last sections, and realized that you're partially dealing with my problem by linking power with well-known strategy profiles (the nash-equilibriums).
This part pushed me to reread the statements in detail. If I get it correctly, you had the intuition that the power behaved like "will this player win", whereas it actually work as "keeping everything else fixed, how well can this player end up". The trick that makes the theorem true and the power bigger than the sum is that for a strategy profile that isn't a nash equilibrium, multiple players might get a lot if they change their action in turn while keeping everything else fixed.
I'm a bit ashamed, because that's actually explained in the intuition of the proof, but I didn't get it on the first reading. I also see now that it was the point of the discussion before the theorem, but that part flew over my head. So my advice for this would be to explain even more in detail the initial intuition and why it is wrong, including where in the maths this happens (the fixing of σ−i).
My updated take after getting this point is that I'm a bit more excited about your formalism.
I agree that this is exciting, but this is only mentioned in the last line of the post, as one perspective among others. Notably, it wasn't clear at all that this was the main application of this work.