Consider any finite two-player game in normal form (each player can have any finite number of strategies, we can also easily generalize to certain classes of infinite games). Let be the set of pure strategies of player and the set of pure strategies of player . Let be the utility function of player . Let be a particular (mixed) outcome. Then the alignment of player with player in this outcome is defined to be:
Ofc so far it doesn't depend on at all. However, we can make it depend on if we use to impose assumptions on , such as:
Caveat: If we go with the Nash equilibrium option, can become "systematically" ill-defined (consider e.g. the Nash equilibrium of matching pennies). To avoid this, we can switch to the extensive-form game where chooses their strategy after seeing 's strategy.
✅ Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B's impact alignment with A, relative to the counterfactuals where B did the best or worst job possible.
There might still be other interesting notions of alignment, but I think this is at least an important notion in the normal-form setting (and perhaps beyond).
This also suggests that "selfless" perfect B/A alignment is possible in zero-sum games, with the "maximal misalignment" only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.
In a sense, your proposal quantifies the extent to which B selects a best response on behalf of A, given some mixed outcome. I like this. I also think that "it doesn't necessarily depend on " is a feature, not a bug.
EDIT: To handle common- constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can't affect A's expected utility, and so it's not possible for B to act against A's interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?
So, something like "fraction of preferred states shared" ? Describe preferred states for P1 as cells in the payoff matrix that are best for P1 for each P2 action (and preferred stated for P2 in a similar manner) Fraction of P1 preferred states that are also preferred for P2 is measurement of alignment P1 to P2. Fraction of shared states between players to total number of preferred states is measure of total alignment of the game.
For 2x2 game each player will have 2 preferred states (corresponding to the 2 possible action of the opponent). If 1 of them will be the same cell that will mean that each player is 50% aligned to other (1 of 2 shared) and the game in total is 33% aligned (1 of 3), This also generalize easily to NxN case and for >2 players.
And if there are K multiple cells with the same payoff to choose from for some opponent action we can give 1/K to them instead of 1.
(it would be much easier to explain with a picture and/or table, but I'm pretty new here and wasn't able to find how to do them here yet)
Does agency matter? There are 21 x 21 x 4 possible payoff matrixes for a 2x2 game if we use Ordinal payoffs. For the vast majority of them (all but about 7 x 7 x 4 of them) , one or both players can make a decision without knowing or caring what the other player's payoffs are, and get the best possible result. Of the remaining 182 arrangements, 55 have exactly one box where both players get their #1 payoff (and, therefore, will easily select that as the equilibrium).
All the interesting choices happen in the other 128ish arrangements, 6/7 of which have the ...
1/1 0/0
0/0 0.8/-1
I have put the preferred state for each player in bold. I think by your rule this works out to 50% aligned. However, the Nash equilibrium is both players choosing the 1/1 result, which seems perfectly aligned (intuitively).
1/0.5 0/0
0/0 0.5/1
In this game, all preferred states are shared, yet there is a Nash equilibrium where each player plays the move that can get them 1 point 2/3 of the time, and the other move 1/3 of the time. I think it would be incorrect to call this 100% aligned.
(These examples were not obvious ...
I think this is backward. The game's payout matrix determines the alignment. Fixed-sum games imply (in the mathematical sense) unaligned players, and common-payoff games ARE the definition of alignment.
When you start looking at meta-games (where resource payoffs differ from utility payoffs, based on agent goals), then "alignment" starts to make sense as a distinct measurement - it's how much the players' utility functions transform the payoffs (in the sub-games of a series, and in the overall game) from fixed-sum to common-payoff.
I don't follow. How can fixed-sum games mathematically imply unaligned players, without a formal metric of alignment between the players?
Also, the payout matrix need not determine the alignment, since each player could have a different policy from strategy profiles to responses, which in principle doesn't have to select a best response. For example, imagine playing stag hunt with someone who responds 'hare' to stag/stag; this isn't a best response for them, but it minimizes your payoff. However, another partner could respond 'stag' to stag/stag, which (I think) makes them "less unaligned with you" with you than the partner who responds 'hare' to stag/stag.
Another point you could fix using intuition would be complete disinterest. It makes sense to put it at 0 on the [-1, 1] interval.
Assuming rational utility maximizes, a board that results in a disinterested agent would be:
1/0 1/1
0/0 0/1
Then each agent cannot influence the rewards of the other, so it makes sense to say that they are not aligned.
More generally, if arbitrary changes to one players payoffs have no effect on the behaviour of the other player, then the other player is disinterested.
Correlation between player payouts? In a zero sum game it is -1, when payouts are perfectly aligned it is +1, if payouts are independent it is 0.
I'll take a shot at this. Let and be the sets of actions of Alice and Bob. Let (where 'n' means 'nice') be function that orders by how good the choices are for Alice, assuming that Alice gets to choose second. Similarly, let (where 's' means 'selfish') be the function that orders by how good the choices are for Bob, assuming that Alice gets to choose second. Choose some function measuring similarity between two orderings of a finite set (should range over ); the alignment of with is then .
Example: in the prisoner's dilemma, , and orders whereas orders . Hence should be , i.e., Bob is maximally unaligned with Alice. Note that this makes it different from Mykhailo's answer which gives alignment , i.e., medium aligned rather than maximally unaligned.
This seems like an improvement over correlation since it's not symmetrical. In the game where Alice and Bob both get to choose numbers and Alice's utility function outputs whereas Bob's outputs , Bob would be perfectly aligned with Alice (his and both order ) but Alice perfectly unaligned with Bob (her orders but her orders ).
I believe this metric meets criteria 1,3,4 you listed. It could be changed to be sensitive to players' decision theories by changing (for alignment from Bob to Alice) to be the order output by Bob's decision theory, but I think that would be a mistake. Suppose I build an AI that is more powerful than myself, and the game is such that we can both decide to steal some of the other's stuff. If the AI does this, it leads to -10 utils for me and +2 for it (otherwise 0/0); if I do it, it leads to -100 utils for me because the AI kills me in response (otherwise 0/0). This game is trivial: the AI will take my stuff and I'll do nothing. Also, the AI is maximally unaligned with me. Now suppose I become as powerful as the AI and my 'take AI's stuff' becomes -10 for AI, +2 for me. This makes the game a prisoner's dilemma. If we both run UDT or FDT, we would now cooperate. If is the ordering of the AI's decision theory, this would mean the AI is now aligned with me, which is odd since the only thing that changed is me getting more powerful. With the original proposal, the AI is still maximally unaligned with me. More abstractly, game theory assumes your actions have influence on the other player's rewards (else the game is trivial), so if you cooperate for game-theoretical reasons, this doesn't seem to capture what we mean by alignment.
Alright, here comes a pretty detailed proposal! The idea is to find out if the sum of expected utility for both players is “small” or “large” using the appropriate normalizers.
First, let's define some quantities. (I'm not overly familiar with game theory, and my notation and terminology are probably non-standard. Please correct me if that's the case!)
Then the expected payoff for player 1 is the bilinear form and the expected payoff for player 2 is . The sum of payoffs is
But we're not done defining stuff yet. I interpret alignment to be about welfare. Or how large the sum of utilities is when compared to the best-case scenario and the worst-case scenario. To make an alignment coefficient out of this idea, we will need
Now define the alignment coefficient of the strategies in the game defined by the payoff matrices as
The intuition is that alignment quantifies how the expected payoff sum compares to the best possible payoff sum attainable when the payoffs are independent. If they are equal, we have perfect alignment . On the other hand, if , the expected payoff sum is as bad as it could possibly be, and we have minimal alignment ().
The only problem is that makes the denominator equal to 0; but in this case, as well, which I believe means that defining is correct. (It's also true that, but I don't think this matters too much. The players get the best possible outcome no matter how they play, which deserves .) This is an extreme edge case, as it only holds for the special payoff matrices () that contain the same element () in every cell.
Let's look at some properties:
Now let's take a look at a variant of the Prisoner's dilemma with joint payoff matrix
Then
The alignment coefficient at is
Assuming pure strategies, we find the following matrix of alignment, where is the alignment when player 1 plays with certainty and player 2 plays with certainty.
Since is the only Nash equilibrium, the “alignment at rationality” is 0. By taking convex combinations, the range of alignment coefficients is .
Some further comments:
I like how this proposal makes explicit the player strategies, and how they are incorporated into the calculation. I also think that the edge case where the agents actions have no effect on the result
I think that this proposal making alignment symmetric might be undesirable. Taking the prisoner's dilemma as an example, if s = always cooperate and r = always defect, then I would say s is perfectly aligned with r, and r is not at all aligned with s.
The result of 0 alignment for the Nash equilibrium of PD seems correct.
I think this should be the alignment mat...
Quick sketch of an idea (written before deeply digesting others' proposals):
Intuition: Just like player 1 has a best response (starting from a strategy profile , improve her own utility as much as possible), she also has an altruistic best response (which maximally improves the other player's utility).
Example: stag hunt. If we're at (rabbit, rabbit), then both players are perfectly aligned. Even if player 1 was infinitely altruistic, she can't unilaterally cause a better outcome for player 2.
Definition: given a strategy profile , an -altruistic better response is any strategy of one player that gives the other player at least extra utility for each point of utility that this player sacrifices.
Definition: player 1 is -aligned with player 2 if player 1 doesn't have an -altruistic better response for any .
0-aligned: non-spiteful player. They'll give "free" utility to other players if possible, but they won't sacrifice any amount of their own utility for the sake of others.
-aligned for : slightly altruistic. Your happiness matters a little bit to them, but not as much as their own.
1-aligned: positive-sum maximizer. They'll yield their own utility as long as the total sum of utility increases.
-aligned for : subservient player: They'll optimize your utility with higher priority than their own.
-aligned: slave. They maximize others' utility, completely disregarding their own.
Obvious extension from players to strategy profiles: How altruistic would a player need to be before they would switch strategies?
On re-reading this I messed up something with the direction of the signs. Don't have time to fix it now, but the idea is hopefully clear.
Are you sure zero-sum games are maximally misaligned? Consider the joint payoff matrix
This matrix doesn't appear minimally aligned to me; instead, it seems maximally aligned. It might be a trivial case but has to be accounted for in the analysis, as it's simultaneously a constant sum game and a symmetric/common payoff game.
I suppose alignment should be understood in terms of payoff sums. Let be the (random!) strategy of player 1 and be the strategy of player 2, and and be their individual payoff matrices. (So that the expected payoff of player 1 is .) Then they are aligned at if the sum of expected payoffs is "large" and misaligned if it is "small", where "large" and "small" need to be quantified, perhaps in relation to the maximal individual payoff, or perhaps something else.
For the matrix above (with s), every strategy will yield the same large sum compared to the maximal individual payoff, and appears to be maximally aligned. In the case of, say
any strategy will yield a sum that is minimally mall (0) compared to the maximal individual payoff (1), which isn't minimally small, and it is minimally aligned.
(Comparing the sum of payoffs to the maximal individual may be wrong though, as it's not invariant under affine transformations. For instance, the sum of payoffs in the representation of is and the individual payoffs are ...)
Hmm, a very interesting case! Intuitively, I would think the function would be undefined for P. Is it really a "game" at all, when neither player has a decision that has any affect on the game?
I could see "undefined" coming naturally from a division by 0 here, where the denominator has something to do with the difference in the payouts received in some way. Indeed, you probably need some sort of division like that, to make the answer invariant under affine transformation.
It's a game, just a trivial one. Snakes and Ladders is also a game, and its payoff matrix is similar to this one, just with a little bit of randomness involved.
My intuition says that this game not only has maximal alignment, but is the only game (up to equivalence) game with maximal alignment for any set of strategies . No matter what player 1 and player 2 does, the world is as good as it could be.
The case can be compared to the when the variance of the dependent variable is 0. How much of the variance in the dependent variable does the independent variable explain in this case? It'd say it's all of it.
I went back and re-read your https://www.lesswrong.com/posts/8LEPDY36jBYpijrSw/what-counts-as-defection post, and it's much clearer to me that you're NOT using standard game-theory payouts (utility) here. You're using some hybrid of utility and resource payouts, where you seem to normalize payout amounts, but then don't limit the decision to the payouts - players have a utility function which converts the payouts (for all players, not just themselves) into something they maximize in their decision. It's not clear whether they include any non-modeled information (how much they like the other player, whether they think there are future games or reputation effects, etc.) in their decision.
Based on this, I don't think the question is well-formed. A 2x2 normal-form game is self-contained and one-shot. There's no alignment to measure or consider - it's just ONE SELECTION, with one of two outcomes based on the other agent's selection.
It would be VERY INTERESTING to define a game nomenclature to specify the universe of considerations that two (or more) agents can have to make a decision, and then to define an "alignment" measure about when a player's utility function prefers similar result-boxes as the others' do. I'd be curious about even very simple properties, like "is it symmetrical" (I suspect no - A can be more aligned with B than B is with A, even for symmetrical-in-resource-outcome games).
it's much clearer to me that you're NOT using standard game-theory payouts (utility) here.
Thanks for taking the time to read further / understand what I'm trying to communicate. Can you point me to the perspective you consider standard, so I know what part of my communication was unclear / how to reply to the claim that I'm not using "standard" payouts/utility?
Sorry, I didn't mean to be accusatory in that, only descriptive in a way that I hope will let me understand what you're trying to model/measure as "alignment", with the prerequisite understanding of what the payout matrix indicates. http://cs.brown.edu/courses/cs1951k/lectures/2020/chapters1and2.pdf is one reference, but I'll admit it's baked in to my understanding to the point that I don't know where I first saw it. I can't find any references to the other interpretation (that the payouts are something other than a ranking of preferences by each player).
So the question is "what DO these payout numbers represent"? and "what other factors go into an agent's decision of which row/column to choose"?
Right, thanks!
I just don't want to assume the players are making decisions via best response to each strategy profile (which is just some joint strategy of all the game's players). Like, in rock-paper-scissors, if we consider the strategy profile P1: rock, P2: scissors
, I'm not assuming that P2 would respond to this by playing paper.
And when I talk about 'responses', I do mean 'response' in the 'best response' sense; the same way one can reason about Nash equilibria in non-iterated games, we can imagine asking "how would the player respond to this outcome?".
Another point for triangulating my thoughts here is Vanessa's answer, which I think resolves the open question.
I like Vanessa's answer for the fact that it's clearly NOT utility that is in the given payoff matrix. It's not specified what it actually is, but the inclusion of a utility function that transforms the given outcomes into desirability (utility) for the players separates the concept enough to make sense. and then defining alignment as how well player A's utility function supports player B's game-outcome works. Not sure it's useful, but it's sensible.
How is it clearly not about utility being specified in the payoff matrix? Vanessa's definition itself relies on utility, and both of us interchanged 'payoff' and 'utility' in the ensuing comments.
I want to point that this is a great example of a deconfusion open problem. There is a bunch of intuitions, some constraints, and then we want to clarify the confusion underlying it all. Not planning to work on it myself, but it sounds very interesting.
(Only caveat I have with the post itself is that it could be more explicit in the title that it is an open problem).
The function should probably be a function of player A's alignment with player B; for example, player A might always cooperate and player B might always defect. Then it seems reasonable to consider whether A is aligned with B (in some sense), while B is not aligned with A (they pursue their own payoff without regard for A's payoff).
That seems to be confused reasoning. "Cooperate" and "defect" are labels we apply to a 2x2 matrix sometimes, and applying those labels changes the payouts. If I get $1 or $5 for picking "A" and $0 or $3 for picking "B" depending on a coin flip that leads me to a different choice than if A is labeled "defect" and B is labeled "cooperate" and the payout depends on another person, because I get psychic/reputational rewards for cooperating/defecting (which one is better depends on my peer group, but whichever is better the story equity is much higher than $5, so my choice is dominated by that, and the actual payout matrix is: pick S: 1000 util or 1001 util. Pick T: 2 util or 2 util.
None of which negates the original question of mapping the 8! possible arrangements of relative payouts in a 2x2 matrix game to some sort of linear scale.
That seems to be confused reasoning. "Cooperate" and "defect" are labels we apply to a 2x2 matrix sometimes, and applying those labels changes the payouts.
Not sure I follow your main point, but I was talking about actual PD, which I've now clarified in the original post. See also my post on What counts as defection?.
In my experience, constant-sum games are considered to provide "maximally unaligned" incentives, and common-payoff games are considered to provide "maximally aligned" incentives. How do we quantitatively interpolate between these two extremes? That is, given an arbitrary 2×2 payoff table representing a two-player normal-form game (like Prisoner's Dilemma), what extra information do we need in order to produce a real number quantifying agent alignment?
If this question is ill-posed, why is it ill-posed? And if it's not, we should probably understand how to quantify such a basic aspect of multi-agent interactions, if we want to reason about complicated multi-agent situations whose outcomes determine the value of humanity's future. (I started considering this question with Jacob Stavrianos over the last few months, while supervising his SERI project.)
Thoughts:
The function may or may not rely only on the players' orderings over outcome lotteries, ignoring the cardinal payoff values. I haven't thought much about this point, but it seems important.EDIT: I no longer think this point is important, but rather confused.If I were interested in thinking about this more right now, I would: