In my experience, constant-sum games are considered to provide "maximally unaligned" incentives, and common-payoff games are considered to provide "maximally aligned" incentives. How do we quantitatively interpolate between these two extremes? That is, given an arbitrary payoff table representing a two-player normal-form game (like Prisoner's Dilemma), what extra information do we need in order to produce a real number quantifying agent alignment?
If this question is ill-posed, why is it ill-posed? And if it's not, we should probably understand how to quantify such a basic aspect of multi-agent interactions, if we want to reason about complicated multi-agent situations whose outcomes determine the value of humanity's future. (I started considering this question with Jacob Stavrianos over the last few months, while supervising his SERI project.)
Thoughts:
- Assume the alignment function has range or .
- Constant-sum games should have minimal alignment value, and common-payoff games should have maximal alignment value.
- The function probably has to consider a strategy profile (since different parts of a normal-form game can have different incentives; see e.g. equilibrium selection).
- The function should probably be a function of player A's alignment with player B; for example, in a prisoner's dilemma, player A might always cooperate and player B might always defect. Then it seems reasonable to consider whether A is aligned with B (in some sense), while B is not aligned with A (they pursue their own payoff without regard for A's payoff).
- So the function need not be symmetric over players.
- The function should be invariant to applying a separate positive affine transformation to each player's payoffs; it shouldn't matter whether you add 3 to player 1's payoffs, or multiply the payoffs by a half.
The function may or may not rely only on the players' orderings over outcome lotteries, ignoring the cardinal payoff values. I haven't thought much about this point, but it seems important.EDIT: I no longer think this point is important, but rather confused.
If I were interested in thinking about this more right now, I would:
- Do some thought experiments to pin down the intuitive concept. Consider simple games where my "alignment" concept returns a clear verdict, and use these to derive functional constraints (like symmetry in players, or the range of the function, or the extreme cases).
- See if I can get enough functional constraints to pin down a reasonable family of candidate solutions, or at least pin down the type signature.
I think "X and Y are playing a game of stag hunt" has multiple meanings.
The meaning generally assumed in game theory when considering just a single game is that the outcomes in the game matrix are utilities. In that case, I completely agree with Dagon: if on some occasion you prefer to pick "hare" even though you know I will pick "stag", then we are not actually playing the stag hunt game. (Because part of what it means to be playing stag hunt rather than some other game is that we both consider (stag,stag) the best outcome.)
But there are some other situations that might be described by saying that X and Y are playing stag hunt.
Maybe we are playing an iterated stag hunt. Then (by definition) what I care about is still some sort of aggregation of per-round outcomes, and (by definition) each round's outcome still has (stag,stag) best for me, etc. -- but now I need to strategize over the whole course of the game, and e.g. maybe I think that on a particular occasion choosing "hare" when you chose "stag" will make you understand that you're being punished for a previous choice of "hare" and make you more likely to choose "stag" in future.
Or maybe we're playing an iterated iterated stag hunt. Now maybe I choose "hare" when you chose "stag", knowing that it will make things worse for me over subsequent rounds, but hoping that other people looking at our interactions will learn the rule Don't Fuck With Gareth and never, ever choose anything other than "stag" when playing with me.
Or maybe we're playing a game in which the stag hunt matrix describes some sort of payouts that are not exactly utilities. E.g., we're in a psychology experiment and the experimenter has shown us a 2x2 table telling us how many dollars we will get in various cases -- but maybe I'm a billionaire and literally don't care whether I get $1 or $10 and figure I might as well try to maximize your payout, or maybe you're a perfect altruist and (in the absence of any knowledge about our financial situations) you just want to maximize the total take, or maybe I'm actually evil and want you to do as badly as possible.
In the iterated cases, it seems to me that the payout matrix still determines alignment given the iteration context -- how many games, with what opponents, with what aggregation of per-round utilities to yield overall utility (in prospect or in retrospect; the former may involve temporal discounting too). If I don't consider a long string of (stag,stag) games optimal then, again, we are not really playing (iterated) stag hunt.
In the payouts-aren't-really-utilities case, I think it does make sense to ask about the players' alignment, in terms of how they translate payouts into utilities. But ... it feels to me as if this is now basically separate from the actual game itself: the thing we might want to map to a measure of alignedness is something like the function from (both players' payouts) to (both players' utilities). The choice of game may then affect how far unaligned players imply unaligned actions, though. (In a game with (cooperate,defect) options where "cooperate" is always much better for the player choosing it than "defect", the payouts->utilities function would need to be badly anti-aligned, with players actively preferring to harm one another, in order to get uncooperative actions; in a prisoners' dilemma, it suffices that it not be strongly aligned; each player can slightly prefer the other to do better but still choose defection.)