In my experience, constant-sum games are considered to provide "maximally unaligned" incentives, and common-payoff games are considered to provide "maximally aligned" incentives. How do we quantitatively interpolate between these two extremes? That is, given an arbitrary payoff table representing a two-player normal-form game (like Prisoner's Dilemma), what extra information do we need in order to produce a real number quantifying agent alignment?
If this question is ill-posed, why is it ill-posed? And if it's not, we should probably understand how to quantify such a basic aspect of multi-agent interactions, if we want to reason about complicated multi-agent situations whose outcomes determine the value of humanity's future. (I started considering this question with Jacob Stavrianos over the last few months, while supervising his SERI project.)
Thoughts:
- Assume the alignment function has range or .
- Constant-sum games should have minimal alignment value, and common-payoff games should have maximal alignment value.
- The function probably has to consider a strategy profile (since different parts of a normal-form game can have different incentives; see e.g. equilibrium selection).
- The function should probably be a function of player A's alignment with player B; for example, in a prisoner's dilemma, player A might always cooperate and player B might always defect. Then it seems reasonable to consider whether A is aligned with B (in some sense), while B is not aligned with A (they pursue their own payoff without regard for A's payoff).
- So the function need not be symmetric over players.
- The function should be invariant to applying a separate positive affine transformation to each player's payoffs; it shouldn't matter whether you add 3 to player 1's payoffs, or multiply the payoffs by a half.
The function may or may not rely only on the players' orderings over outcome lotteries, ignoring the cardinal payoff values. I haven't thought much about this point, but it seems important.EDIT: I no longer think this point is important, but rather confused.
If I were interested in thinking about this more right now, I would:
- Do some thought experiments to pin down the intuitive concept. Consider simple games where my "alignment" concept returns a clear verdict, and use these to derive functional constraints (like symmetry in players, or the range of the function, or the extreme cases).
- See if I can get enough functional constraints to pin down a reasonable family of candidate solutions, or at least pin down the type signature.
Quick sketch of an idea (written before deeply digesting others' proposals):
Intuition: Just like player 1 has a best response (starting from a strategy profile s, improve her own utility as much as possible), she also has an altruistic best response (which maximally improves the other player's utility).
Example: stag hunt. If we're at (rabbit, rabbit), then both players are perfectly aligned. Even if player 1 was infinitely altruistic, she can't unilaterally cause a better outcome for player 2.
Definition: given a strategy profile s, an a-altruistic better response is any strategy of one player that gives the other player at least a extra utility for each point of utility that this player sacrifices.
Definition: player 1 is a-aligned with player 2 if player 1 doesn't have an x-altruistic better response for any x>a.
0-aligned: non-spiteful player. They'll give "free" utility to other players if possible, but they won't sacrifice any amount of their own utility for the sake of others.
c-aligned for c∈(0,1): slightly altruistic. Your happiness matters a little bit to them, but not as much as their own.
1-aligned: positive-sum maximizer. They'll yield their own utility as long as the total sum of utility increases.
c-aligned for c∈(1,∞): subservient player: They'll optimize your utility with higher priority than their own.
∞-aligned: slave. They maximize others' utility, completely disregarding their own.
Obvious extension from players to strategy profiles: How altruistic would a player need to be before they would switch strategies?
On re-reading this I messed up something with the direction of the signs. Don't have time to fix it now, but the idea is hopefully clear.