In this post I'll try to show a surprising link between two research topics on LW: game-theoretic cooperation between AIs (quining, Loebian cooperation, modal combat, etc) and stable self-modification of AIs (tiling agents, Loebian obstacle, etc).

When you're trying to cooperate with another AI, you need to ensure that its action will fulfill your utility function. And when doing self-modification, you also need to ensure that the successor AI will fulfill your utility function. In both cases, naive utility maximization doesn't work, because you can't fully understand another agent that's as powerful and complex as you. That's a familiar difficulty in game theory, and in self-modification it's known as the Loebian obstacle (fully understandable successors become weaker and weaker).

In general, any AI will be faced with two kinds of situations. In "single player" situations, you're faced with a choice like eating chocolate or not, where you can figure out the outcome of each action. (Most situations covered by UDT are also "single player", involving identical copies of yourself.) Whereas in "multiplayer" situations your action gets combined with the actions of other agents to determine the outcome. Both cooperation and self-modification are "multiplayer" situations, and are hard for the same reason. When someone proposes a self-modification to you, you might as well evaluate it with the same code that you use for game theory contests.

If I'm right, then any good theory for cooperation between AIs will also double as a theory of stable self-modification for a single AI. That means neither problem can be much easier than the other, and in particular self-modification won't be a special case of utility maximization, as some people seem to hope. But on the plus side, we need to solve one problem instead of two, so creating FAI becomes a little bit easier.

The idea came to me while working on this mathy post on IAFF, which translates some game theory ideas into the self-modification world. For example, Loebian cooperation (from the game theory world) might lead to a solution for the Loebian obstacle (from the self-modification world) - two LW ideas with the same name that people didn't think to combine before!

New Comment
50 comments, sorted by Click to highlight new comments since: Today at 6:30 PM

Note that there are two very distinct reasons for cooperation/negotation:

1) It's the best way to get what I want. The better I model other agents, the better I can predict how to interact with them in a way that meets my desires. For this item, an external agent is no different from any other complex system. 2) I actually care about the other agent's well-being. There is a term in my utility function for their satisfaction.

Very weirdly, we tend to assume #2 about humans (when it's usually a mix of mostly 1 and a bit of 2). And we focus on #1 for AI, with no element of #2.

When you say "code for cooperation", I can't tell if you're just talking about #1, or some mix of the two, where caring about the other's satisfaction is a goal.

Mostly #1. Is there a reason to build AIs that inherently care about the well-being of paperclippers etc?

But EA should be mostly #2?

In humans, there's a lot of #2 behind our cooperative ability (even if the result looks a lot like #1). I don't know how universal that will be, but it seems likely to be computationally cheaper at some margin to encode #2 than to calculate and prove #1.

In my view, "code for cooperation" will very often have a base assumption that cooperation in satisfying others' goals is more effective (which feels like "more pleasant" or "more natural" from inside the algorithm) than contractual resource exchanges.

I think there is a difference between creating an agent and negotiating with another agent. If agent 1 creates an agent 2, it will always know for sure its goal function.

However, if two agents meet, and agent A says to agent B that it has utility function U, and even if it sends its source code as proof, agent A doesn't have reasons to believe it. Any source code could be faked. The more advanced are both agents, the more difficult for them to prove its values to each other. So they will be always in suspicion that another side is cheating.

As a result, as I once said once too strongly: Any two sufficiently advanced agents will go to war with each other. The one exception is if they are two instances of the same source code, but even in this case cheating is possible.

To prevent cheating is better to destroy the second agent (unfortunately). What are the solutions for this problem in LW research?

Note that source code can't be faked in the self modification case. Software agent A can set up a test environment (a virtual machine or simulated universe), create new agent B inside that, and then A has a very detailed and accurate view of B's innards.

However, logical uncertainty is still an obstacle, especially with agents not verified by theorem-proving.

I don't believe it. War wastes resources. The only reason war happens is because two agents have different beliefs about the likely outcome of war, which means at least one of them has wrong and self-harming beliefs. Sufficiently rational agents will never go to war, instead they'll agree about the likely outcome of war, and trade resources in that proportion. Maybe you can't think of a way to set up such trade, because emails can be faked etc, but I believe that superintelligences will find a way to achieve their mutual interest. That's one reason why I'm interested in AI cooperation and bargaining.

I'm flashing back to reading Jim Fearon!

Fearon's paper concludes that pretty much only two mechanisms can explain "why rationally led states" would go to war instead of striking a peaceful bargain: private information, and commitment problems.

Your comment brushes off commitment problems in the case of superintelligences, which might turn out to be right. (It's not clear to me that superintelligence entails commitment ability, but nor is it clear that it doesn't entail commitment ability.) I'm less comfortable with setting aside the issue of private information, though.

Assuming rational choice, competing agents are only going to truthfully share information if they have incentives to do so, or at least no incentive not to do so, but in cases where war is a real possibility, I'd expect the incentives to actively encourage secrecy: exaggerating war-making power and/or resolve could allow an agent to drive a harder potential bargain.

You suggest that the ability to precommit could guarantee information sharing, but I feel unease about assuming that without a systematic argument or model. Did Schelling or anybody else formally analyze how that would work? My gut has the sinking feeling that drawing up the implied extensive-form game and solving for equilibrium would produce a non-zero probability of non-commitment, imperfect information exchange, and conflict.

Finally I'll bring in a new point: Fearon's analysis explicitly relies on assuming unitary states. In practice, though, states are multipartite, and if the war-choosing bit of the state can grab most of the benefits from a potential war, while dumping most of the potential costs on another bit of the state, that can enable war. I expect something analogous could produce war between superintelligences, as I don't see why superintelligences have to be unitary agents.

That's a good question and I'm not sure my thinking is right. Let's say two AIs want to go to war for whatever reason. Then they can agree to some other procedure that predicts the outcome of war (e.g. war in 1% of the universe, or simulated war) and precommit to accept it as binding. It seems like both would benefit from that.

That said I agree that bargaining is very tricky. Coming up with an extensive form game might not help, because what if the AIs use a different extensive form game? There's been pretty much no progress on this for a decade, I don't see any viable attack.

Let's say two AIs want to go to war for whatever reason. Then they can agree to some other procedure that predicts the outcome of war (e.g. war in 1% of the universe, or simulated war) and precommit to accept the outcome as binding. It seems like both would benefit from that.

My (amateur!) hunch is that an information deficit bad enough to motivate agents to sometimes fight instead of bargain might be an information deficit bad enough to motivate agents to sometimes fight instead of precommitting to exchange info and then bargain.

Coming up with an extensive form game might not help, because what if the AIs use a different extensive form game?

Certainly, any formal model is going to be an oversimplification, but models can be useful checks on intuitive hunches like mine. If I spent a long time formalizing different toy games to try to represent the situation we're talking about, and I found that none of my games had (a positive probability of) war as an equilibrium strategy, I'd have good evidence that your view was more correct than mine.

There's been pretty much no progress on this in a decade, I don't see any viable attack.

There might be some analogous results in the post-Fearon, rational-choice political science literature, I don't know it well enough to say. And even if not, it might be possible to build a relevant game incrementally.

Start with a take-it-or-leave-it game. Nature samples a player's cost of war from some distribution and reveals it only to that player. (Or, alternatively, Nature randomly assigns a discrete, privately known type to a player, where the type reflects the player's cost of war.) That player then chooses between (1) initiating a bargaining sub-game and (2) issuing a demand to the other player, triggering war if the demand is rejected. This should be tractable, since standard, solvable models exist for two-player bargaining.

So far we have private information, but no precommitment. But we could bring precommitment in by adding extra moves to the game: before making the bargain-or-demand choice, players can mutually agree to some information-revealing procedure followed by bargaining with the newly revealed information in hand. Solving this expanded game could be informative.

Maybe you can't think of a way to set up such trade, because emails can be faked etc, but I believe that superintelligences will find a way to achieve their mutual interest.

They'll also find ways of faking whatever communication methods are being used.

To me, this sounds like saying that sufficiently rational agents will never defect in prisoner dilemma provided they can communicate with each other.

I think you need verifiable pre-commitment, not just communication - in a free-market economy, enforced property rights basically function as such a pre-commitment mechanism. Where pre-commitment (including property right enforcement) is imperfect, only a constrained optimum can be reached, since any counterparty has to assume ex-ante that the agent will exploit the lack of precommitment. Imperfect information disclosure has similar effects, however in that case one has to "assume the worst" about what information the agent has; the deal must be altered accordingly, and this generally comes at a cost in efficiency.

The whole point of the prisoner's dilemma is that the prisoners cannot communicate. If they can, it's not a prisoner's dilemma any more.

Yeah, I would agree with that. My bar for "sufficiently rational" is quite high though, closer to the mathematical ideal of rationality than to humans. (For example, sufficiently rational agents should be able to precommit.)

Sufficiently rational agents will never go to war, instead they'll agree about the likely outcome of war, and trade resources in that proportion.

Not if the "resource" is the head of one of the rational agents on a plate.

The Aumann theorem requires identical priors and identical sets of available information.

I think sharing all information is doable. As for priors, there's a beautiful LW trick called "probability as caring" which can almost always make priors identical. For example, before flipping a coin I can say that all good things in life will be worth 9x more to me in case of heads than tails. That's purely a utility function transformation which doesn't touch the prior, but for all decision-making purposes it's equivalent to changing my prior about the coin to 90/10 and leaving the utility function intact. That handles all worlds except those that have zero probability according to one of the AIs. But in such worlds it's fine to just give the other AI all the utility.

sharing all information is doable

In all cases? Information is power.

before flipping a coin I can say that all good things in life will be worth 9x more to me in case of heads than tails

There is an old question that goes back to Abraham Lincoln or something:

If you call a dog's tail a leg, how many legs does a dog have?

I think the idea that if one AI says there is a 50% chance of heads, and the other AI says there is a 90% chance of heads, the first AI can describe the second AI as knowing that there is a 50% chance, but caring more about the heads outcome. Since it can redescribe the other's probabilities as matching its own, agreement on what should be done will be possible. None of this means that anyone actually decides that something will be worth more to them in the case of heads.

the first AI can describe the second AI as knowing that there is a 50% chance, but caring more about the heads outcome.

First of all this makes any sense solely in the decision-taking context (and not in the forecast-the-future context). So this is not about what will actually happen but about comparing the utilities of two outcomes. You can, indeed, rescale the utility involved in a simple case, but I suspect that once you get to interdependencies and non-linear consequences things will get more hairy, if possible at all.

Besides, this requires you to know the utility function in question.

While war is irrational, demonstrative behaviour like arms race may be needed to discourage another side from war.

Imagine that two benevolent superintelligence appear. However, SI A suspects that SI B is a paperclip maximizer. In that case, it is afraid that SI B may turn off SI A. Thus it demonstratively invests some resources in protecting its power source, so it would be expensive for the SI B to try to turn off SI A.

This starts the arms race, but the race is unstable and could result in war.

Even if A is FAI and B is a paperclipper, as long as both use correct decision theory, they will instantly merge into a new SI with a combined utility function. Avoiding arms races and any other kind of waste (including waste due to being separate SIs) is in their mutual interest. I don't expect rational agents to fail achieving mutual interest. If you expect that, your idea of rationality leads to predictably suboptimal utility, so it shouldn't be called "rationality". That's covered in the sequences.

But how I could be sure that paperclip maximiser is a rational agent with correct decision theory? I would not expect it from the papercliper.

If an agent is irrational, it can cause all sorts of waste. I was talking about sufficiently rational agents.

If the problem is proving rationality to another agent, SI will find a way.

My point is exactly this. If SI is able to prove its rationality (meaning that it is always cooperating in PD etc.), it also able fake any such proof.

If you have two options: to turn off papercliper, or to cooperate with it by giving it half of the universe, what would you do?

I imagine merging like this:

1) Bargain about a design for a joint AI, using any means of communication

2) Build it in a location monitored by both parties

3) Gradually transfer all resources to the new AI

4) Both original AIs shut down, new AI fulfills their combined goals

No proof of rationality required. You can design the process so that any deviation will help the opposing side.

I could imagine some failure modes, but surely I can't imagine the best one. For example, "both original AIs shut down" simultaneously is vulnerable for defecting.

I also have some busyness experience, and I found that almost every deal includes some cheating, and the cheating is everytime something new. So I always have to ask myself - where is the cheating from the other side? If don't see it, it's bad, as it could be something really unexpected. Personally, I hate cheating.

An AI could devise a very secure merging process. We don't have to code it ourselves.

But should we merge with papercliper if we could turn it off?

It reminds me Great Britain policy towards Hitler before WW2, which suggested to give him what he wants to prevent the war. https://en.wikipedia.org/wiki/Appeasement

If we can turn off the paperclipper for free, sure. But if war would destroy X resources, it's better to merge and spend X/2 on paperclips.

So if the price of turning off paperclip is Y, if Y is higher than X/2 , we should cooperate?

But if we agree on this, we create for the papercliper an incentive to increase Y, until it reaches X/2. To increase Y, papercliper has to invest in defense mechanisms or offensive weapons. It creates arms race, until negotiations become more profitable. However, arms race is risky and could turn into war.

Edited: higher.

The paperclipper doesn't need to invest anything. The AIs will just merge without any arms race or war. The possibility of an arms race or war, and its full predicted cost to both sides, will be taken into account during barganing instead. For example, if the paperclipper has a button that can nuke half of our utility, the merged AI will prioritize paperclips more.

So they meet before the possible start of the arms race and compare each other relative advantages? I still think that they may try to demonstrate higher barging power than they actually have and that it is almost impossible for us to predict how their game will play because of its complexity.

Thanks for participating in this interesting conversation.

Yeah, bargaining between AIs is a very hard problem and we know almost nothing about it. It will probably have all sorts of deception tactics. But in any case, using bargaining instead of war is still in both AI's common interest, and AIs should be able to achieve common interest.

For example, if A has hidden information that will give it an advantage in war, then B can precommit to giving A more share conditional on seeing it (e.g. by constructing a successor AI that visibly includes the precommitment under A's watch). Eventually the AIs should agree on all questions of fact and disagree only on values, at which point they agree on how the war will likely go, so they skip the war and share the bigger pie according to the war's predicted outcome.

BTW, the book "On thermonuclear war" by Kahn is exactly an attempt to predict the ways of war, negotiations and barging between two presumably rational agents (superpowers). Even an idea to move all resources to new third agent is discussed, as I remember - that is donating all nukes to UN.

How B could see that A has hidden information?

Personally, I feel like you have a mathematically correct, but idealistic and unrealistic model of relations between two perfect agents.

Yeah, Schelling's "Strategy of Conflict" deals with many of the same topics.

A: "I would have an advantage in war so I demand a bigger share now" B: "Prove it" A: "Giving you the info would squander my advantage" B: "Let's agree on a procedure to check the info, and I precommit to giving you a bigger share if the check succeeds" A: "Cool"

If visible precommitment by B requires it to share the source code for its successor AI, then it would also be giving up any hidden information it has. Essentially both sides have to be willing to share all information with each other, creating some sort of neutral arbitration about which side would have won and at what cost to the other. That basically means creating a merged superintelligence is necessary just to start the bargaining process, since they each have to prove to the other that the neutral arbiter will control all relevant resources to prevent cheating.

Realistically, there will be many cases where one side thinks its hidden information is sufficient to make the cost of conflict smaller than the costs associated with bargaining, especially given the potential for cheating.

A: "I would have an advantage in war so I demand a bigger share now" B: "Prove it" A: "Giving you the info would squander my advantage" B: "Let's agree on a procedure to check the info, and I precommit to giving you a bigger share if the check succeeds" A: "Cool"

Simply by telling B about the existence of an advantage A is giving B info that could weaken it. Also, what if the advantage is a way to partially cheat in precommitments?

I think there are two other failure modes, which need to be a resolved:

A weaker side is making negotiation longer if it helps it to gain power

A weaker side could fake the size of its army (Like North Korea did with its wooden missiles on last military show)

Even if A is FAI and B is a paperclipper, as long as both use correct decision theory, they will instantly merge into a new SI with a combined utility function.

What combined utility function? There is no way to combine utility functions.

Weighted sum, with weights determined by bargaining.

If agent 1 creates an agent 2, it will always know for sure its goal function.

Wait, we have only examples of the opposite. Every human who creates another human ha at best a hazy understanding of that new human's goal function. As soon as agent 2 has any unobserved experiences or self-modification, it's a distinct separate agent.

Any two sufficiently advanced agents will go to war with each other

True with a wide enough definition of "go to war". Instead say "compete for resources" and you're solid. Note that competition may include cooperation (against mutual "enemies" or against nature), trade, and even altruism or charity (especially where the altruistic agent perceives some similarity with the recipient, and it becomes similar to cooperation against nature).

By going to war I meant an attempt to turn off another agent.

I think that's a pretty binary (and useless) definition. There have been almost no wars that didn't end until one of the participating groups was completely eliminated. There have been conflicts and competition among groups that did have that effect, but we don't call them "war" in most cases.

Open, obvious, direct violent conflict is a risky way to attain most goals, even those that are in conflict with some other agent. Rational agents would generally prefer to kill them off by peaceful means.

There is a more sophisticated definition of war, coming from Clausewitz, which on contemporary language could be said something like that "the war is changing the will of your opponent without negotiation". The enemy must unconditionaly capitualte, and give up its value system.

You could do it by threat, torture, rewriting of the goal system or deleting the agent.

Does the agent care about changing the will of the "opponent", or just changing behavior (in my view of intelligence, there's not much distinction, but that's not the common approach)? If you care mostly about future behavior rather than internal state, then the "without negotiation" element become meaningless and you're well on your way toward accepting that "competition" is a more accurate frame than "war".

If agent 1 creates an agent 2, it will always know for sure its goal function.

That is the point, though. By Loeb's theorem, the only agents that are knowable for sure are those with less power. So an agent might want to create a successor that isn't fully knowable in advance, or, on the other hand, if a perfectly knowable successor could be constructed, then you would have a finite method to ensure the compatibility of two source codes (is this true? It seems plausible).

Thanks for interesting post. I think that there are two types of self-modification. In the first, an agent is working on lower level parts of itself, for example, by adding hardware or connecting modules. It produces evolutionary development with small returns and is relatively safe.

Another type is high-level self-modification, where the second agent is created, as you describe. Its performance should be mathematically proved (that is difficult) or tested in many simulated environments (which is also risky, as a superior agent will be able to break through it.) We could call it a revolutionary way of self-improvement. Such self-modification will provide higher returns if successful.

Knowing all this, most agents will prefer evolutionary development, that is gaining the same power by lower-level changes. But risk-hungry agents will still prefer revolutionary methods, in case if they are time constrained.

Early stage AI will be time constrained by arms race with other (possible) AIs, and it will prefer risky revolutionary ways of development, even if its probability of failure will be very high.

(It was TL;DR of my text "Levels of self-improvement".)

Thanks, that's an interesting perspective. I think even high-level self-modification can be relatively safe with sufficient asymmetry in resources--simulated environments give a large advantage to the original, especially if the successor can be started with no memories of anything outside the simulation. Only an extreme difference in intelligence between the two would overcome that.

Of course, the problem of transmitting values to a successor without giving it any information about the world is a tricky one, since most of the values we care about are linked to reality. But maybe some values are basic enough to be grounded purely in math that applies to any circumstances.

I also wrote a (draft) text "Catching treacherous turn" where I attempted to create best possible AI box and see conditions, where it will fail.

Obviously, we can't box superintelligence, but we could box AI of around human level and prevent its self-improving by many independent mechanisms. One of them is cleaning its memory before any of its new tasks.

In the first text I created a model of self-improving process and in the second I explore how SI could be prevented based on this model.