TL;DR: using a simple mixed strategy, LDT can give in to threats, ultimatums, and commitments - while incentivizing cooperation and fair[1] splits instead.

This strategy made it much more intuitive to many people I've talked to that smart agents probably won't do weird everyone's-utility-eating things like threatening each other or participating in commitment races.

1. The Ultimatum game

This part is taken from planecrash[2][3].

You're in the Ultimatum game. You're offered 0-10 dollars. You can accept or reject the offer. If you accept, you get what's offered, and the offerer gets $(10-offer). If you reject, both you and the offerer get nothing.

The simplest strategy that incentivizes fair splits is to accept everything ≥ 5 and reject everything < 5. The offerer can't do better than by offering you 5. If you accepted offers of 1, the offerer that knows this would always offer you 1 and get 9, instead of being incentivized to give you 5. Being unexploitable in the sense of incentivizing fair splits is a very important property that your strategy might have.

With the simplest strategy, if you're offered 5..10, you get 5..10; if you're offered 0..4, you get 0 in expectation.

Can you do better than that? What is a strategy that you could use that would get more than 0 in expectation if you're offered 1..4, while still being unexploitable (i.e., still incentivizing splits of at least 5)?

I encourage you to stop here and try to come up with a strategy before continuing.

The solution, explained by Yudkowsky in planecrash (children split 12 jellychips, so the offers are 0..12):

When the children return the next day, the older children tell them the correct solution to the original Ultimatum Game.

It goes like this:

When somebody offers you a 7:5 split, instead of the 6:6 split that would be fair, you should accept their offer with slightly less than 6/7 probability.  Their expected value from offering you 7:5, in this case, is 7 * slightly less than 6/7, or slightly less than 6.  This ensures they can't do any better by offering you an unfair split; but neither do you try to destroy all their expected value in retaliation.  It could be an honest mistake, especially if the real situation is any more complicated than the original Ultimatum Game.

If they offer you 8:4, accept with probability slightly-more-less than 6/8, so they do even worse in their own expectation by offering you 8:4 than 7:5.

It's not about retaliating harder, the harder they hit you with an unfair price - that point gets hammered in pretty hard to the kids, a Watcher steps in to repeat it.  This setup isn't about retaliation, it's about what both sides have to do, to turn the problem of dividing the gains, into a matter of fairness; to create the incentive setup whereby both sides don't expect to do any better by distorting their own estimate of what is 'fair'.

[The next stage involves a complicated dynamic-puzzle with two stations, that requires two players working simultaneously to solve.  After it's been solved, one player locks in a number on a 0-12 dial, the other player may press a button, and the puzzle station spits out jellychips thus divided.

The gotcha is, the 2-player puzzle-game isn't always of equal difficulty for both players.  Sometimes, one of them needs to work a lot harder than the other.]

They play the 2-station video games again.  There's less anger and shouting this time.  Sometimes, somebody rolls a continuous-die and then rejects somebody's offer, but whoever gets rejected knows that they're not being punished.  Everybody is just following the Algorithm.  Your notion of fairness didn't match their notion of fairness, and they did what the Algorithm says to do in that case, but they know you didn't mean anything by it, because they know you know they're following the Algorithm, so they know you know you don't have any incentive to distort your own estimate of what's fair, so they know you weren't trying to get away with anything, and you know they know that, and you know they're not trying to punish you.  You can already foresee the part where you're going to be asked to play this game for longer, until fewer offers get rejected, as people learn to converge on a shared idea of what is fair.

Sometimes you offer the other kid an extra jellychip, when you're not sure yourself, to make sure they don't reject you.  Sometimes they accept your offer and then toss a jellychip back to you, because they think you offered more than was fair.  It's not how the game would be played between dath ilan and true aliens, but it's often how the game is played in real life.  In dath ilan, that is.

This allows even very different agents with very different notions of fairness to cooperate most of the time.

So, if in the game with $0..10, you're offered $4 instead of the fair $5, you understand that if you accept, the other player will get $6 - and so you accept with the probability of slightly less than 5/6, making the other player receive, in expectation, slightly less than the fair $5. You still get $4 most of the time when you're offered this unfair split, but you're incentivizing fair splits. Even if you're offered $1, you accept slightly less than in 5/9 cases - which is more than half of the time, but still incentivizes offering you the fair 5-5 split instead.

If the other player makes a commitment to offer you $4 regardless of what you do, it simply doesn't change what you do when you're offered $4. You want to accept $4 with \(p=5/6-\epsilon\) regardless of what led to this offer. Otherwise, you'll incentivize offers of $4 instead of $5. This means other players don't make bad commitments (and if they do, you usually give in).

(This is symmetrical. If you're the offerer, and the other player accepts only at least $6 and always rejects $5 or lower, you can offer $6 with p=5/6-e or otherwise offer less and be rejected.)

2. Threats, commitments, and ultimatums

You can follow the same procedure in all games. Figure out the fair split of gains, then try to coordinate on it; if the other agent is not willing to agree to the fair split and demands something else, agree to their ultimatum probabilistically, in a way that incentivizes the fair split instead.

2.1 Game of Chicken

Let's say the payoff matrix is:

-100, -1005, -1
-1, 50, 0

Let's assume we consider the fair split in this game to be 2, you can achieve it by coordinating on throwing a fair coin to determine who does what.

If the other player instead commits to not swerve, you calculate that if you give in, they get 5; the fair payoff is 2; so you simply give in and swerve with p=97%, making the other player get less than 2 in expectation; they would've done better by cooperating. Note that this decision procedure is much better than never giving in to threats - which would correspond to getting -100 every time instead of just 3% of the time - while still having the property that it's better for everyone to not threaten you at all.

2.2 Stones

If the other player is a stone with "Threat" written on it, you should do the same thing, even if it looks like the stone's behavior doesn't depend on what you'll do in response. Responding to actions and ignoring the internals when threatened means you'll get a lot fewer stones thrown at you.

2.3 What if I don't know the other player's payoffs?

You want to make decisions that don’t incentivize threatening you. If you receive a threat and know nothing about the other agent’s payoffs, simply don’t give in to the threat! (If you have some information, you can transparently give in with a probability low enough that you're certain transparently making decisions this way isn't incentivizing this threat.)

2.4 What if the other player makes a commitment before I make any decisions?

Even without the above strategy, why would this matter? You can just make the right decisions you want to make. You can use information when you want to be using it and not use it when it doesn't make sense to use it. The time at which you receive the information doesn't have to be an input into what you consider if you think it doesn't matter when you receive it.

With the above algorithm, if you receive a threat, you simply look at it and give in to it most of the time in many games, all while incentivizing not threatening you, because the other player can get more utility if they don't threaten you.

(In reality, making decisions this way means you'll rarely receive threats. In most games, you'll coordinate with the other player on extracting the most utility. Agents will look at you, understand that threatening you means less utility, and you won't have to spend time googling random number generators and probabilistically giving in. It doesn't make sense for the other agent to make threatening commitments; and if they do, it's slightly bad for them.

It's never a good idea to threaten an LDT agent.)

  1. ^

    Humans might use the Shapley value, the ROSE value, or their intuitive feeling of fairness. Other agents might use very different notions of fairness.

  2. ^
  3. ^

    The idea of unexploitable cooperation with agents with different notions of fairness seems to have first been introduced by @Eliezer Yudkowsky in this 2013 post, with agents accepting unfair (according to them) bargains in which the other agent does worse than in the fair point on the Pareto frontier; but it didn’t suggest accepting unfair bargains probabilistically, to create new points where the other agent does just slightly worse in expectation than it would’ve in the fair point. One of the comments almost got there, but didn’t suggest adding  \(-\epsilon\)  to the giving-in probability, so the result was considered exploitable (as the other agent was indifferent between making a threat and accepting the fair bargain).

New Comment
19 comments, sorted by Click to highlight new comments since:

It's definitely not clear to me that updatelessness + Yudkowsky's solution prevent threats. The core issue is that a target and a threatener face a prima facie symmetric decision problem of whether to use strategies that depend on their counterpart's strategy or strategies that do not depend on their counterpart's strategy.[1]

In other words, the incentive targets have to use non-dependent strategies that incentivise favourable (no-threat) responses from threateners is the same incentive threateners have to use non-dependent strategies that incentivise favourable (give-into-threat) responses from targets. This problem is discussed in more detail in parts of Responses to apparent rationalist confusions about game / decision theory and in Updatelessness doesn't solve most problems.

There are potential symmetry breakers that privilege a no-threat equilibrium, such as the potential for cooperation between different targets. However, there are also potential symmetry breakers in the other direction. I expect Yudkowsky is aware of the symmetry of this problem and either thinks the symmetry breakers in favour of no-threats seem very strong, or is just very confident in the superintelligences-should-figure-this-stuff-out heuristic. Relatedly, this post argues that mutually transparent agents should be able to avoid most of the harm of threats being executed, even if they are unable to avoid threats from being made.

But these are different arguments to the one you make here, and I'm personally unconvinced even these arguments are strong enough that it's not very important for us to work on preventing harmful threats from being made by or against AIs that humanity deploys.

FYI A lot of Center on Long-Term Risk's research is motivated by this problem; I suggest people reach out to us if you're interested in working on it! 

  1. ^

    Examples of non-dependent strategies would include

    • Refusing all threats regardless of why they were made
    • Refusing threats to the extent prescribed by Yudkowsky's solution regardless of why they were made
    • Making threats regardless of a target's refusal strategy when the target is incentivised to give in

    An example of a dependent strategy would be

    • Refusing threats more often when a threatener accurately predicted whether or not you would refuse in order to determine whether to make a threat; and refusing threats less often when they did not predict you, or did so less accurately

For posterity, and if it's of interest to you, my current sense on this stuff is that we should basically throw out the frame of "incentivizing" when it comes to respectful interactions between agents or agent-like processes. This is because regardless of whether it's more like a threat or a cooperation-enabler, there's still an element of manipulation that I don't think belongs in multi-agent interactions we (or our AI systems) should consent to.

I can't be formal about what I want instead, but I'll use the term "negotiation" for what I think is more respectful. In negotiation there is more of a dialogue that supports choices to be made in an informed way, and there is less this element of trying to get ahead of your trading partner by messing with the world such that their "values" will cause them to want to do what you want them to do.

I will note that this "negotiation" doesn't necessarily have to take place in literal time and space. There can be processes of agents thinking about each other that resemble negotiation and qualify to me as respectful, even without a physical conversation. What matters, I think, is whether the logical process that lead to an another agent's choices can be seen in this light.

And I think in cases when another agent is "incentivizing" my cooperation in a way that I actually like, it is exactly when the process was considering what the outcome would be of a negotiating process that respected me.

If the other player is a stone with “Threat” written on it, you should do the same thing, even if it looks like the stone’s behavior doesn’t depend on what you’ll do in response. Responding to actions and ignoring the internals when threatened means you’ll get a lot fewer stones thrown at you.

In order to "do the same thing" you either need the other's player's payoffs, or according to the next section "If you receive a threat and know nothing about the other agent’s payoffs, simply don’t give in to the threat!" So if all you see is a stone, then presumably you don't know the other agent's payoffs, so presumably "do the same thing" means "don't give in".

But that doesn't make sense because suppose you're driving and suddenly a boulder rolls towards you. You're going to "give in" and swerve, right? What if it's an animal running towards you and you know they're too dumb to do LDT-like reasoning or model your thoughts in their head, you're also going to swerve, right? So there's still a puzzle here where agents have an incentive to make themselves look like a stone (i.e., part of nature or not an agent), or to never use LDT or model others in any detail.

Another problem is, do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?

do you know how to formulate/formalize a version of LDT so that we can mathematically derive the game outcomes that you suggest here?

I recall Eliezer saying this was an open problem, at a party about a year ago.

By a stone, I meant a player with very deterministic behavior in a game with known payoffs, named this way after the idea of cooperate-stones in prisoner’s dilemma (with known payoffs).

I think to the extent there’s no relationship between giving in to a boulder/implemeting some particular decision theory and having this and other boulders thrown at you, UDT and FDT by default swerve (and probably don't consider the boulders to be threatening them, and it’s not very clear in what sense this is “giving in”); to the extent it sends more boulders their way, they don’t swerve.

If making decisions some way incentivizes other agents to become less like LDTs and more like uncooperative boulders, you can simply not make decisions that way. (If some agents actually have an ability to turn into animals and you can’t distinguish the causes behind an animal running at you, you can sometimes probabilistically take out your anti-animal gun and put them to sleep.)

Do you maybe have a realistic example where this would realistically be a problem?

I’d be moderately surprised if UDT/FDT consider something to be a better policy than what’s described in the post.

Edit: to add, LDTs don't swerve to boulders that were created to influence the LDT agent's responses. If you turn into a boulder because you expect some agents among all possible agents to swerve, this is a threat, and LDTs don't give in to those boulders (and it doesn't matter whether or not you tried to predict the behavior of LDTs in particular). If you believed LDT agents or agents in general would swerve against a boulder, and that made you become a boulder, LDT agents obviously don't swerve to that boulder. They might swerve to boulders that are actually natural boulders caused by the very simple physics no one influenced to cause the agents to do something. They also pay their rent- because they'd be evicted otherwise, not for the reason of getting rent from them under the threat of eviction but for the reason of getting rent from someone else, and they're sure there were no self-modifications to make it look this way.

If making decisions some way incentivizes other agents to become less like LDTs and more like uncooperative boulders, you can simply not make decisions that way.

Another way that those agents might handle the situation is not to become boulders themselves, but to send boulders to make the offer. That is, send a minion to present the offer without any authority to discuss terms. I believe this often happens in the real world, e.g. customer service staff whose main goal, for their own continued employment, is to send the aggrieved customer away empty-handed and never refer the call upwards.

A smart agent can simply make decisions like a negotiator with restrictions on the kinds of terms it can accept, without having to spawn a "boulder" to do that.

You can just do the correct thing, without having to separate yourself into parts that do things correctly and a part that tries to not look at the world and spawns correct-thing-doers.

In Parfit's Hitchhiker, you can just pay once you're there, without precommiting/rewriting yourself into an agent that pays. You can just do the thing that wins.

Some agents can't do the things that win and would have to rewrite themselves into something better and still lose in some problems, but you can be an agent that wins, and gradient descent probably crystallizes something that wins into what is making the decisions in smart enough things.

If you receive a threat and know nothing about the other agent’s payoffs, simply don’t give in to the threat!

With an important caveat: if carrying out the threat doesn't cost the threatener utility relative to never making the threat, then it's not a threat, just a promise (a promise to do whatever is locally in their best interests, whether you do the thing they demanded or not).

You're going to have a bad time if you try to live out LDT by ignoring threats, and end up ignoring "threats" like "pay your mortgage or we'll repossess your house".

Yep! If someone is doing things because it's in their best interests and not to make you do something (and they're not a result of someone else shaping themselves into them to cause you do something, whereas some previous agent wouldn't actually prefer the thing the new one prefers, that you don't want to happen), then this is not a threat.

Not having read that part of planecrash, the solution I immediately thought of, just because it seemed so neat, was that if offered a fraction of the money, accept with probability . The other player’s expectation is , maximised at . Is Eliezer’s solution better than mine, or mine better than his?

One way in which Eliezer’s is better is that mine does not have an immediate generalisation to all threat games.

Your solution works! It's not exploitable, and you get much more than 0 in expectation! Congrats!

Eliezer's solution is better/optimal in the sense that it accepts with the highest probability a strategy can use without becoming exploitable. If offered 4/10, you accept with p=40%; the optimal solution accepts with p=83% (or slightly less than 5/6); if offered 1/10, it's p=10% vs. p=55%. The other player's payout is still maximized at 5, but everyone gets the payout a lot more often!

It's not how the game would be played between dath ilan and true aliens

This is a very important caveat. Many humans or CDT agents could be classified as “true aliens” by someone not part of their ingroup.

It's not how the game would be played between dath ilan and true aliens

This is a reference to "Sometimes they accept your offer and then toss a jellychip back to you". Between dath ilan and true aliens, you do the same except for tossing the jellychip when you think you got more than what would've been fair. See True Prisoner's Dilemma.

How one should signal their decision procedure in real life without getting their ass busted for "gambling with lives" etc.?

Make your decision unpredictable to your counterparty but not truly random. This happens all the time in e.g. nuclear deterrence in real life.

For singleton events (large-scale nuclear attack and counterattack), deception plays an important role.  This isn't a problem, apparently, in dath ilan - everyone has common knowledge of other's rationality.

(It is pretty important to very transparently respond with a nuclear strike to a nuclear strike. I think both Russia and the US are not really unpredictable in this question. But yeah, if you have nuclear weapons and your opponents don't, you might want to be unpredictable, so your opponent is more scared of using conventional weapons to destroy you. In real-life cases with potentially dumb agents, it might make sense to do this.)

I think creating uncertainty in your adversary applies a bit more than you give it credit for, and assuring a second strike is an exception.

It has been crucial to Russia's strategy in Ukraine to exploit NATO's fear of escalation by making various counter-threats whenever NATO proposes expanding aid to Ukraine somehow. This has bought them 2 years without ATACMS missiles attacking targets inside Russia, and that hasn't require anyone to be irrational, just incapable of perfectly modeling the Kremlin.

Even when responding to a nuclear strike, you can essentially have a mixed strategy. I think China does not have enough missiles to assure a second strike, but builds extra decoy silos so they can't all be destroyed. They didn't have to roll a die, just be unpredictable.

I guess when criminals and booing bystanders are not as educated as dath ilani children, some real-world situations might get complicated. Possibly, transparent stats about the actions you've taken in similar situations might serve the same purpose even if you don't broadcast throwing your dice on live TV. Or it might make sense to transparently never give in to some kinds of threats in some sorts of real-life situations.