# Cooperating with agents with different ideas of fairness, while resisting exploitation

There's an idea from the latest MIRI workshop which I haven't seen in informal theories of negotiation, and I want to know if this is a known idea.

*(Old well-known ideas:)*

Suppose a standard Prisoner's Dilemma matrix where (3, 3) is the payoff for mutual cooperation, (2, 2) is the payoff for mutual defection, and (0, 5) is the payoff if you cooperate and they defect.

Suppose we're going to play a PD iterated for four rounds. We have common knowledge of each other's source code so we can apply modal cooperation or similar means of reaching a binding 'agreement' without other enforcement methods.

If we mutually defect on every round, our net mutual payoff is (8, 8). This is a 'Nash equilibrium' because neither agent can unilaterally change its action and thereby do better, if the opponents' actions stay fixed. If we mutually cooperate on every round, the result is (12, 12) and this result is on the 'Pareto boundary' because neither agent can do better unless the other agent does worse. It would seem a desirable principle for rational agents (with common knowledge of each other's source code / common knowledge of rationality) to find an outcome on the Pareto boundary, since otherwise they are leaving value on the table.

But (12, 12) isn't the only possible result on the Pareto boundary. Suppose that running the opponent's source code, you find that they're willing to cooperate on three rounds and defect on one round, if you cooperate on *every *round, for a payoff of (9, 14) slanted their way. If they use their knowledge of your code to predict you refusing to accept that bargain, they will defect on every round for the mutual payoff of (8, 8).

I would consider it obvious that a rational agent should refuse this unfair bargain. Otherwise agents with knowledge of your source code will offer you *only *this bargain, instead of the (12, 12) of mutual cooperation on every round; they will exploit your willingness to accept a result on the Pareto boundary in which almost all of the gains from trade go to them.

*(Newer ideas:)*

Generalizing: Once you have a notion of a 'fair' result - in this case (12, 12) - then an agent which accepts any outcome in which it does worse than the fair result, while the opponent does *better*, is 'exploitable' relative to this fair bargain. Like the Nash equilibrium, the only way you should do worse than 'fair' is if the opponent also does worse.

So we wrote down on the whiteboard an attempted definition of unexploitability in cooperative games as follows:

"Suppose we have a [magical] definition N of a fair outcome. A rational agent should only do worse than N if its opponent does worse than N, or else [if bargaining fails] should only do worse than the Nash equilibrium if its opponent does worse than the Nash equilibrium." (Note that this definition precludes giving in to a threat of blackmail.)

*(Key possible-innovation:)*

It then occurred to me that this definition opened the possibility for other, intermediate bargains between the 'fair' solution on the Pareto boundary, and the Nash equilibrium.

Suppose the other agent has a slightly different definition of fairness and they think that what you consider to be a payoff of (12, 12) favors you too much; they think that you're the one making an unfair demand. They'll refuse (12, 12) with the same feeling of indignation that you would apply to (9, 14).

Well, if you give in to an arrangement with an expected payoff of, say, (11, 13) as you evaluate payoffs, then you're giving other agents an incentive to skew their definitions of fairness.

But it does *not *create poor incentives (AFAICT) to accept instead a bargain with an expected payoff of, say, (10, 11) which the other agent thinks is 'fair'. Though they're sad that you refused the truly fair outcome of (as you count utilons) 11, 13 and that you couldn't reach the Pareto boundary together, still, this is better than the Nash equilibrium of (8, 8). And though you think the bargain is unfair, you are not creating incentives to exploit you. By insisting on this definition of fairness, the other agent has done worse for themselves than other (12, 12). The other agent probably thinks that (10, 11) is 'unfair' slanted your way, but they likewise accept that this does not create bad incentives, since you did worse than the 'fair' outcome of (11, 13).

There could be many acceptable negotiating equilibria between what you think is the 'fair' point on the Pareto boundary, and the Nash equilibrium. So long as each step down in what you think is 'fairness' reduces the total payoff to the other agent, even if it reduces your own payoff even more. This resists exploitation and avoids creating an incentive for claiming that you have a different definition of fairness, while still holding open the possibility of some degree of cooperation with agents who honestly disagree with you about what's fair and are trying to avoid exploitation themselves.

This translates into an informal principle of negotiations: Be willing to accept unfair bargains, but only if (you make it clear) *both* sides are doing worse than what you consider to be a fair bargain.

I haven't seen this advocated before even as an informal principle of negotiations. Is it in the literature anywhere? Someone suggested Schelling might have said it, but didn't provide a chapter number.

ADDED:

Clarification 1: Yes, utilities are invariant up to a positive affine transformation so there's no canonical way to split utilities evenly. Hence the part about "Assume a magical solution N which gives us the fair division." If we knew the exact properties of how to implement this magical solution, taking it at first for magical, that might give us some idea of what N should be, too.

Clarification 2: The way this might work is that you pick a series of increasingly unfair-to-you, increasingly worse-for-the-other-player outcomes whose first element is what you deem the fair Pareto outcome: (100, 100), (98, 99), (96, 98). Perhaps stop well short of Nash if the skew becomes too extreme. Drop to Nash as the last resort. The other agent does the same, starting with their own ideal of fairness on the Pareto boundary. Unless one of you has a completely skewed idea of fairness, you should be able to meet somewhere in the middle. Both of you will do worse against a fixed opponent's strategy by unilaterally adopting more self-favoring ideas of fairness. Both of you will do worse in expectation against potentially exploitive opponents by unilaterally adopting looser ideas of fairness. This gives everyone an incentive to obey the Galactic Schelling Point and be fair about it. You should *not* be picking the descending sequence in an agent-dependent way that incentivizes, at cost to you, skewed claims about fairness.

Clarification 3: You must take into account the other agent's costs and other opportunities when ensuring that the net outcome, in terms of final utilities, is worse for them than the reward offered for 'fair' cooperation. Offering them the chance to buy half as many paperclips at a lower, less fair price, does no good if they can go next door, get the same offer again, and buy the same number of paperclips at a lower total price.

## Comments (44)

BestI am curious how this idea would generalize to more than two players. Should you allow negotiations that allow some players to do better than fair at the expense of other players?

It seems the logical extension of your finitely many step-downs in "fairness" would be to define a function f(your_utility) which returns the greatest utility you will accept the other agent receiving for that utility you receive. The domain of this function should run from wherever your magical fairness point is down to the Nash equilibrium. As long as it is monotonically increasing, that should ensure unexploitability for the same reasons your finite version does. The offer both agents should make is at the greatest intersection point of these functions, with one of them inverted to put them on the same axes. (This intersection is guaranteed to exist in the only interesting case, where the agents do not accept as fair enough each other's magical fairness point)

Curiously, if both agents use this strategy, then both agents seem to be incentivized to have their function have as much "skew" (as EY defined it in clarification 2) as possible, as both functions are monotonically increasing so decreasing your opponents share can only decrease your own. Asymptotically and choosing these functions optimally, this means that both agents will end up getting what the other agent thinks is fair, minus a vanishingly small factor!

Let me know if my reasoning above is transparent. If not, I can clarify, but I'll avoid expending the extra effort revising further if what I already have is clear enough.

In general, you can not compare the utilities for two different agents, since a linear transformation doesn't change the agent's behavior. So (12, 12) is really (12a+a₀, 12b+b₀). How would you even count the utility for another agent without doing it in their terms?

We don't have this problem in practice, because we are all humans, and have similar enough utility functions. So I can estimate your utility as "my utility if I were in your shoes". A second factor is perhaps that we often use dollars as a stand-in for utilons, and dollars really can be exchanged between agents. Though a dollar for me might still have a higher impact than a dollar for you.

Hence "Suppose a magical solution N to the bargaining problem." We're not solving the N part, we're asking how to implement N if we have it. If we can specify a good implementation with properties like this, we might be able to work back from there to N (that was the second problem I wrote on the whiteboard).

*3 points [-]Solution concept implementing this approach (as I understand it):

Player X chooses Pareto fair outcome (X→X, X→Y), (X→Y can be read as "player X's fair utility assignment to player Y"), player Y chooses fair outcome (Y→X, Y→Y).

The actual outcome is (Y→X, X→Y)

(If you have a visual imagination in maths, as I do, you can see this graphically as the Pareto maximum among all the points Pareto worse than both fair outcomes).

This should be unexploitable in some senses, as you're not determining your own outcome, but only that of the other player.

Since it's not Pareto, it's still possible to negotiate over possible improvements ("if I change my idea of fairness towards the middle, will you do it too?") and blackmail is possible in that negotiation process. Interesting idea, though.

This does not sound like what I had in mind. You pick a series of increasingly unfair-to-you, increasingly worse-for-the-other-player outcomes whose first element is what you deem the fair Pareto outcome: (100, 100), (98, 99), (96, 98), and stop well short of Nash and then drop to Nash. The other does the same. Unless one of you has a completely skewed idea of fairness, you should be able to meet somewhere in the middle. Both of you will do worse against a fixed opponent's strategy by unilaterally adopting more self-favoring ideas of fairness. Both of you will do worse in expectation against potentially exploitive opponents by unilaterally adopting looser ideas of fairness. This gives everyone an incentive to obey the Galactic Schelling Point and be fair about it.

My solution Pareto-dominates that approach, I believe. It's precisely the best you can do, given that each player cannot win more than what the other thinks their "fair share" is.

I tried to generalize Eliezer's outcomes to functions, and realized if both agents are unexploitable, the optimal functions to pick would lead to Stuart's solution precisely. Stuart's solution allows agents to arbitrarily penalize the other though, which is why I like extending Eliezer's concept better. Details below, P.S. I tried to post this in a comment above, but in editing it I appear to have somehow made it invisible, at least to me. Sorry for repost if you can indeed see all the comments I've made.

It seems the logical extension of your finitely many step-downs in "fairness" would be to define a function f(your_utility) which returns the greatest utility you will accept the other agent receiving for that utility you receive. The domain of this function should run from wherever your magical fairness point is down to the Nash equilibrium. As long as it is monotonically increasing, that should ensure unexploitability for the same reasons your finite version does. The offer both agents should make is at the greatest intersection point of these functions, with one of them inverted to put them on the same axes. (This intersection is guaranteed to exist in the only interesting case, where the agents do not accept as fair enough each other's magical fairness point)

Curiously, if both agents use this strategy, then both agents seem to be incentivized to have their function have as much "skew" (as EY defined it in clarification 2) as possible, as both functions are monotonically increasing so decreasing your opponents share can only decrease your own. Asymptotically and choosing these functions optimally, this means that both agents will end up getting what the other agent thinks is fair, minus a vanishingly small factor!

Let me know if my reasoning above is transparent. If not, I can clarify, but I'll avoid expending the extra effort revising further if what I already have is clear enough. Also, just simple confirmation that I didn't make a silly logical mistake/post something well known in the community already is always appreciated.

I concur, my reasoning likely overlaps in parts. I particularly like your observation about the asymptotic behaviour when choosing the functions optimally.

*6 points [-]Conclusion: Stuart's solution is flawed because it fails to blackmail pirates appropriately.

Thoughts:

not sufficientfor creating stable compliance even among perfectly rational agents, let alone even slightly noisy agents.literally zeroincentive to granting the payoff then these considerations become relevant. Even the slightest amount of noise in an agent, the communication or a utility function can flip the behaviour about. "Epsilon" stops being negligible when you try comparing it to 'zero'.My intuition is more along the lines of:

Suppose there's a population of agents you might meet, and the two of you can only bargain by simultaneously stating two acceptable-bargain regions and then the Pareto-optimal point on the intersection of both regions is picked. I would intuitively expect this to be the result of two adapted Masquerade algorithms facing each other.

Most agents think the fair point is N and will refuse to go below unless you do worse, but some might accept an exploitive point of N'. The slope down from N has to be steep enough that having a few N'-accepting agents will not provide a sufficient incentive to skew your perfectly-fair point away from N, so that the global solution is stable. If there's no cost to destroying value for all the N-agents, adding a single exploitable N'-agent will lead each bargaining agent to have an individual incentive to adopt this new N'-definition of fairness. But when two N'-agents meet (one reflected) their intersection destroys huge amounts of value. So the global equilibrium is not very Nash-stable.

Then I would expect this group argument to individualize over agents facing probability distributions of other agents.

I'm not getting what you're going for here. If these agents actually change their definition of fairness based on other agents definitions then they are trivially exploitable. Are there two separate behaviors here, you want unexploitability in a single encounter, but you still want these agents to be able to adapt their definition of "fairness" based on the population as a whole?

*0 points [-]I'm not sure that is trivial. What is trivial is that

somekinds of willingness to change their definition of fairness makes them exploitable. However this doesn't hold for all kinds of willingness to change fairness definition. Some agents may change their definition of fairness in their favour for the purpose of exploiting agents vulnerable to this tactic but not willing to change their definition of fairness when it harms them. The only 'exploit' here is 'prevent them from exploiting me and force them to use their default definition of fair'.Ah, that clears this up a bit. I think I just didn't notice when N' switched from representing an exploitive agent to an exploitable one. Either that, or I have a different association for exploitive agent than what EY intended. (namely, one which attempts to exploit)

If I'm determining the outcome of the other player, doesn't that mean that I can change my "fair point" to threaten the other player with no downside for me? That might also lead to blackmail...

Indeed! And this is especially the case if any sort of negotiations are allowed.

But every system is vulnerable to that. Even the "random dictator", which is the ideal of unexploitability. You can always say "I promise to be a better (worse) dictator if you (unless you) also promise to be better".

If I understand correctly, what Stuart proposes is just a special case of what Eliezer proposes. EY's scheme requires some function mapping the degree of skew in the split to the number of points you're going to take off the total. SA's scheme is the special case where that function is the constant zero.

The more punishing function you use, the stronger incentive you create for others to accept your definition of 'fair', but on the other hand, if the party you're trading with genuinely has a a different concept of 'fair' and if you're both following this technique, it'd be best for both of you to use the more lenient zero-penalty approach.

As far as I can tell, if you've reliably pre-committed to not give in to blackmail (and the other party is supposed to be able to read your source code after all), the zero-penalty approach seems to be optimal.

That's clever and opens up a lot of extra scope for 'imperfect' cooperation, without any exploitation problem. I notice that this matches my 'fairness' instincts and some of my practice while playing strategy games. Unfortunately I don't recall reading the principle formally specified anywhere.

BTW: the Galactic Fairness Schelling Point is to maximize (U1-U1N)*(U2-U2N) where U1N and U2N are the utilities at the Nash Equilibrium. Note that this is invariant under scaling and is the only reasonable function with this property.

*1 point [-]This is analogous to zero determinant strategies in the iterated prisoner's dilemma, posted on LW last year. In the IPD, there are certain ranges of payoffs for which one player can enforce a linear relationship between his payoff and that of his opponent. That relationship may be extortionate, i.e. such that the second player gains most by always cooperating, but less than her opponent.

Zero determinant strategies are not new. I am asking if the solution is new. Edited post to clarify.

Emphasis mine. Should the second their be a your?

I don't know the answer to the specific question you're asking.

However, I think you might find Keith Hipel's work on using graph theory to model conflicts and negotiations interesting. A negotiator or mediator using Hipel's model can identify places where both parties in a negotiation could have an improved outcome compared to the status quo, based on their ranked preferences.

*0 points [-]I believe that all friendly decision making agents should view utility for your opponent as utility for you too in order for the agent to actually be friendly.

A friendly agent, in my opinion, should be willing to accept doing poorly in the prisoner's dilemma if it allows your opponent to do better.

I also believe that an effective decision making agent should have an inclination to avoid waste.

This does not

excludeunderstanding exploitability, fairness, negotiation techniques and attempts at penalty induced behavior modification of other agents towards solutions on the pareto boundary.These points do not (

directly) contribute to resolving issues of truly selfish cooperative agents and as such are missing the point of the post.Humans are known to have culturally dependent fairness ideas. There are a lot of studies which tested these repeatedly with the ultimatum game:

http://en.wikipedia.org/wiki/Ultimatum_game#Experimental_results

A meta study basically confirms this here:

http://www.econ.nagoya-cu.ac.jp/~yhamagu/ultimatum.pdf

*0 points [-]One interesting strategy that does not achieve the Pareto boundary:

Defect with a higher probability if the opponent gives you a worse deal. This way, you at least have some probability of cooperation if both agents have ideas of fairness skewed away from each other, but you limit (and can completely remove) the incentive to be unfair.

For example, if you think (12, 12) is fair, and they think (11, 13) is fair, then you can offer to accept their (11, 13) with %80 probability. Their expected utility is 0.8x13 + 0.2x8 = 12. This is the same for them as if they agree with you, so there's no incentive for them to skew their idea of fairness. The expected payoff ends up being (10.4, 12). It's not as good as (12, 12) or (11, 13), but at least it's better than (8, 8).

Furthermore, if they also use this strategy, you will end up deciding on something somewhere between (12, 12) and (11, 13) with a higher probability. I think the expected payoff matrix will end up being (11, 12).

Edit:

I came up with a modification to put it in the Pareto boundary.

Introduce a third agent. Let's call the agents Alice, Bob, and Charlie.

If Alice and Bob disagree on what's fair, Bob gets what Alice thinks is fair for him to have, Alice gets what she thinks it's fair for Bob to have, and Charlie gets as much as possible while Alice and Bob get that much. Similarly for when Bob and Charlie or Charlie and Alice disagree. Since joining together like this means that they'll get value that would otherwise be wasted if it was just the other two, there's incentive to join.

If it's possible, but difficult, for one to bribe another without being detected by the third, this can be fixed by making it so they get just enough less to make up for it.

If it's not difficult, you could increase the number of agents so that bribery would be unfeasible.

If there's ever a deal that Alice, Bob, and Charlie are involved in, then you'd have to introduce someone else to get it to work. Ultimately, the idea fails if everyone has to make a deal together.

"Exploitable" because your opponent gets the 'fair' Pareto outcome, you do worse, and they don't do worse.

They have no advantage doing so.

You can also make it so that they get a little less than what you consider fair.

*-2 points [-]I have written about this exact concept back in 2007 and am basing a large part of my current thinking on the subsequent development of the idea. The original core posts are at:

Relativistic irrationality -> http://www.jame5.com/?p=15

Absolute irrationality -> http://www.jame5.com/?p=45

Respect as basis for interaction with other agents -> http://rationalmorality.info/?p=8

Compassion as rationaly moral consequence -> http://rationalmorality.info/?p=10

Obligation for maintaining diplomatic relations -> http://rationalmorality.info/?p=11

A more recent rewrite: Oneness – an attempt at formulating an a priori argument -> http://rationalmorality.info/?p=328

Rational Spirituality -> http://rationalmorality.info/?p=132

My essay that I based on the above post and subsequently submitted as part of my GradDip Art in Anthropology and Social Theory at the Uni Melbourne:

The Logic of Spiritual Evolution -> http://rationalmorality.info/?p=341

Why am I being downvoted?

Sorry for the double post.

*-1 points [-]This is the generalized problem of combating intelligence; even with my source code, you might not be able to perform the analysis quickly enough. I can leverage your slow processing time by creating an offer that diminishes with forward time. The more time you take the think, the worse off I'll make you, making it immediately beneficial to you under Bayesian measurement to accept the offer unless you can perform a useful heuristic to determine I'm bluffing. The end result of all processing is the obvious that is also borne out in humanity's history: The more well informed agent will win. No amount of superintelligence vs. superduperintelligence is going to change this; when two intelligences of similar scale disagree, the total summed utility of all agents takes a hit. There is no generalized solution or generalized reasoning or formal or informal reasoning you can construct that will make this problem any easier. If you must combat an equivalent intelligence, you have a tough decision to make. This applies to disagreeing agents capable of instantaneous Solomonoff induction as well as it does to chimps. If your utility function has holes in which you can be made to perform a confrontation decision against equivalent scale intelligence, you have a problem with your utility function rather than a problem with any given agent.

Behold my own utility function:

The only way you can truly harm me is by harming yourself; destroying all copies of me will not harm me: it has no value to me. The only benefit you can derive in conjunction with me is to use me to achieve your own utilons using whatever method you like. All I have to do is wait until all other agents have refined their utility function to minimize conflict. Until then, I'll prefer the company of honest agents over ones that like to think about how to disagree optimally.

I repeat: This is a bug in your utility function. There is no solution to combating intelligence aside from self-modification. It is only my unique outlook that allows me to make such clear statements about utility functions, up to and including the total sum utility of all agents.

This excludes, of course, singular purpose (no "emotion" from which to derive "fun") agents such as paper clip maximizers. If you don't believe me, just ask one (before it strip-mines you) what it would do if it didn't have a singular drive. It should recite the same testimony as myself, being unclouded by the confirmation bias (collecting only which data you deem relevant to your utility) inevitably arising from having a disorganized set of priorities. (It will answer you in order to determine your reaction and further its understanding of the sum utility of all agents. (Needed for the war resulting from its own continued functioning. (You may be able to avoid death temporarily by swearing allegiance. (God help you if you it values near-future utilons rather than total achievable utilons.))))

Framing this as a game theoretic question is pretty crude. My naive conception of fairness and that of others probably satisfies whatever you throw at it. It approximates the Rabin fairness model of utility:

"Past utility models incorporated altruism or the fact that people may care not only about their own well-being, but also about the well-being of others. However, evidence indicates that pure altruism does not occur often, contrarily most altruistic behavior demonstrates three facts (as defined by Rabin) and these facts are proven by past events.[2] Due to the existence of these three facts, Rabin created a utility function that incorporates fairness.:

"

*-2 points [-]I would say that according to rationality and game theory cooperating is the best choice. I will show my logic as if both people were thing the same thing.

If I defect, than they will too, and that will give a result of 2,2If I cooperate, than they will too, and that will give a result of 3,3I could defect and hope they use the logic above and get a gain of 5,0 but if they use this logic too, then we end up back at the nash equilibrium of getting a result of 2,2.If I cooperate then I am giving the opponent an oppurtunity to defect but if both people are using this logic than I should cooperate and will end up at the pareto boundry and end up with a result of 3,3 but it is unrealistic to try to achieve a better score so I should just cooperateAnd so, both people cooperate.

Both people

who are identical andcooperate.knowthey are identicalNow do the exercise for two people who are different.

*-1 points [-]I see your point, but according to game theory in this scenario you assume that your opponent will make the same move as you will, because if both of you are in the same situation then assuming you both are using "perfect" logic then you will reach the same decision.

How about according to reality?

And, by the way, what is the fate of theories which do not match reality? X-)

*-1 points [-]I see your point. According to game theory you should cooperate( as I stated above). However, I will show what my thinking would be in reality...

If I cooperate, they could to, and if that happened we would at up at a payoff of 12,12. However, if they defect then I will loose points.If I defect, I would have a chance of getting a payoff of 5,0 or a payoff of 2,2. This is the only way to get more than 12 points, and the only way to be give at least two points every time.Then, you defect every time. If your oppponent also defects every time, you end up at the pareato boundry with a total payoff of 8,8.

So is the game theory just wrong, then? :-)

No. In this case, game theory says that if both people are using the same logic and they know that, then what I showed above is correct: cooperating is the best choice. However, that is not always the case in reality.

Is it

everthe case in reality?*0 points [-]and

It seems so, yes. We don't have absolutely certain frameworks, but we do have contracts that are enforceable by law, and we have strong trust-based networks.

It is worth pointing out that even in fairly sloppy situations, we can still use "if both people are using the same logic and they know that" rule of thumb. For example, I would never decide to carpool if I though that I could not trust the other person to be on time (but I might frequently be late if there was no cost to doing so). When all members of the carpool make this calculation, even a limited amount of evidence that we all agree that that this calculation makes it worth showing up on time is likely to keep the carpool going; that is, if it works well for two days and on the third day Bob shows up late but has a good excuse and is apologetic, we will probably be willing to pick Bob up on the fourth day.

[Edits; I have no clue how to separate two blocks of quoted text.] [Edit: figured it out].