Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Self-modification as a game theory problem

11 Post author: cousin_it 26 June 2017 08:47PM

In this post I'll try to show a surprising link between two research topics on LW: game-theoretic cooperation between AIs (quining, Loebian cooperation, modal combat, etc) and stable self-modification of AIs (tiling agents, Loebian obstacle, etc).

When you're trying to cooperate with another AI, you need to ensure that its action will fulfill your utility function. And when doing self-modification, you also need to ensure that the successor AI will fulfill your utility function. In both cases, naive utility maximization doesn't work, because you can't fully understand another agent that's as powerful and complex as you. That's a familiar difficulty in game theory, and in self-modification it's known as the Loebian obstacle (fully understandable successors become weaker and weaker).

In general, any AI will be faced with two kinds of situations. In "single player" situations, you're faced with a choice like eating chocolate or not, where you can figure out the outcome of each action. (Most situations covered by UDT are also "single player", involving identical copies of yourself.) Whereas in "multiplayer" situations your action gets combined with the actions of other agents to determine the outcome. Both cooperation and self-modification are "multiplayer" situations, and are hard for the same reason. When someone proposes a self-modification to you, you might as well evaluate it with the same code that you use for game theory contests.

If I'm right, then any good theory for cooperation between AIs will also double as a theory of stable self-modification for a single AI. That means neither problem can be much easier than the other, and in particular self-modification won't be a special case of utility maximization, as some people seem to hope. But on the plus side, we need to solve one problem instead of two, so creating FAI becomes a little bit easier.

The idea came to me while working on this mathy post on IAFF, which translates some game theory ideas into the self-modification world. For example, Loebian cooperation (from the game theory world) might lead to a solution for the Loebian obstacle (from the self-modification world) - two LW ideas with the same name that people didn't think to combine before!

Comments (50)

Comment author: Dagon 27 June 2017 03:37:02PM 4 points [-]

Note that there are two very distinct reasons for cooperation/negotation:

1) It's the best way to get what I want. The better I model other agents, the better I can predict how to interact with them in a way that meets my desires. For this item, an external agent is no different from any other complex system. 2) I actually care about the other agent's well-being. There is a term in my utility function for their satisfaction.

Very weirdly, we tend to assume #2 about humans (when it's usually a mix of mostly 1 and a bit of 2). And we focus on #1 for AI, with no element of #2.

When you say "code for cooperation", I can't tell if you're just talking about #1, or some mix of the two, where caring about the other's satisfaction is a goal.

Comment author: cousin_it 27 June 2017 03:40:17PM *  1 point [-]

Mostly #1. Is there a reason to build AIs that inherently care about the well-being of paperclippers etc?

Comment author: turchin 27 June 2017 03:44:44PM 1 point [-]

But EA should be mostly #2?

Comment author: Dagon 30 June 2017 03:14:15PM 0 points [-]

In humans, there's a lot of #2 behind our cooperative ability (even if the result looks a lot like #1). I don't know how universal that will be, but it seems likely to be computationally cheaper at some margin to encode #2 than to calculate and prove #1.

In my view, "code for cooperation" will very often have a base assumption that cooperation in satisfying others' goals is more effective (which feels like "more pleasant" or "more natural" from inside the algorithm) than contractual resource exchanges.

Comment author: turchin 26 June 2017 10:05:55PM 1 point [-]

I think there is a difference between creating an agent and negotiating with another agent. If agent 1 creates an agent 2, it will always know for sure its goal function.

However, if two agents meet, and agent A says to agent B that it has utility function U, and even if it sends its source code as proof, agent A doesn't have reasons to believe it. Any source code could be faked. The more advanced are both agents, the more difficult for them to prove its values to each other. So they will be always in suspicion that another side is cheating.

As a result, as I once said once too strongly: Any two sufficiently advanced agents will go to war with each other. The one exception is if they are two instances of the same source code, but even in this case cheating is possible.

To prevent cheating is better to destroy the second agent (unfortunately). What are the solutions for this problem in LW research?

Comment author: cousin_it 27 June 2017 08:24:46AM *  2 points [-]

I don't believe it. War wastes resources. The only reason war happens is because two agents have different beliefs about the likely outcome of war, which means at least one of them has wrong and self-harming beliefs. Sufficiently rational agents will never go to war, instead they'll agree about the likely outcome of war, and trade resources in that proportion. Maybe you can't think of a way to set up such trade, because emails can be faked etc, but I believe that superintelligences will find a way to achieve their mutual interest. That's one reason why I'm interested in AI cooperation and bargaining.

Comment author: satt 28 June 2017 08:06:18PM 4 points [-]

I'm flashing back to reading Jim Fearon!

Fearon's paper concludes that pretty much only two mechanisms can explain "why rationally led states" would go to war instead of striking a peaceful bargain: private information, and commitment problems.

Your comment brushes off commitment problems in the case of superintelligences, which might turn out to be right. (It's not clear to me that superintelligence entails commitment ability, but nor is it clear that it doesn't entail commitment ability.) I'm less comfortable with setting aside the issue of private information, though.

Assuming rational choice, competing agents are only going to truthfully share information if they have incentives to do so, or at least no incentive not to do so, but in cases where war is a real possibility, I'd expect the incentives to actively encourage secrecy: exaggerating war-making power and/or resolve could allow an agent to drive a harder potential bargain.

You suggest that the ability to precommit could guarantee information sharing, but I feel unease about assuming that without a systematic argument or model. Did Schelling or anybody else formally analyze how that would work? My gut has the sinking feeling that drawing up the implied extensive-form game and solving for equilibrium would produce a non-zero probability of non-commitment, imperfect information exchange, and conflict.

Finally I'll bring in a new point: Fearon's analysis explicitly relies on assuming unitary states. In practice, though, states are multipartite, and if the war-choosing bit of the state can grab most of the benefits from a potential war, while dumping most of the potential costs on another bit of the state, that can enable war. I expect something analogous could produce war between superintelligences, as I don't see why superintelligences have to be unitary agents.

Comment author: cousin_it 28 June 2017 09:02:16PM *  0 points [-]

That's a good question and I'm not sure my thinking is right. Let's say two AIs want to go to war for whatever reason. Then they can agree to some other procedure that predicts the outcome of war (e.g. war in 1% of the universe, or simulated war) and precommit to accept it as binding. It seems like both would benefit from that.

That said I agree that bargaining is very tricky. Coming up with an extensive form game might not help, because what if the AIs use a different extensive form game? There's been pretty much no progress on this for a decade, I don't see any viable attack.

Comment author: satt 28 June 2017 11:11:11PM 1 point [-]

Let's say two AIs want to go to war for whatever reason. Then they can agree to some other procedure that predicts the outcome of war (e.g. war in 1% of the universe, or simulated war) and precommit to accept the outcome as binding. It seems like both would benefit from that.

My (amateur!) hunch is that an information deficit bad enough to motivate agents to sometimes fight instead of bargain might be an information deficit bad enough to motivate agents to sometimes fight instead of precommitting to exchange info and then bargain.

Coming up with an extensive form game might not help, because what if the AIs use a different extensive form game?

Certainly, any formal model is going to be an oversimplification, but models can be useful checks on intuitive hunches like mine. If I spent a long time formalizing different toy games to try to represent the situation we're talking about, and I found that none of my games had (a positive probability of) war as an equilibrium strategy, I'd have good evidence that your view was more correct than mine.

There's been pretty much no progress on this in a decade, I don't see any viable attack.

There might be some analogous results in the post-Fearon, rational-choice political science literature, I don't know it well enough to say. And even if not, it might be possible to build a relevant game incrementally.

Start with a take-it-or-leave-it game. Nature samples a player's cost of war from some distribution and reveals it only to that player. (Or, alternatively, Nature randomly assigns a discrete, privately known type to a player, where the type reflects the player's cost of war.) That player then chooses between (1) initiating a bargaining sub-game and (2) issuing a demand to the other player, triggering war if the demand is rejected. This should be tractable, since standard, solvable models exist for two-player bargaining.

So far we have private information, but no precommitment. But we could bring precommitment in by adding extra moves to the game: before making the bargain-or-demand choice, players can mutually agree to some information-revealing procedure followed by bargaining with the newly revealed information in hand. Solving this expanded game could be informative.

Comment author: Lumifer 27 June 2017 03:15:33PM 2 points [-]

Sufficiently rational agents will never go to war, instead they'll agree about the likely outcome of war, and trade resources in that proportion.

Not if the "resource" is the head of one of the rational agents on a plate.

The Aumann theorem requires identical priors and identical sets of available information.

Comment author: cousin_it 27 June 2017 03:30:50PM *  0 points [-]

I think sharing all information is doable. As for priors, there's a beautiful LW trick called "probability as caring" which can almost always make priors identical. For example, before flipping a coin I can say that all good things in life will be worth 9x more to me in case of heads than tails. That's purely a utility function transformation which doesn't touch the prior, but for all decision-making purposes it's equivalent to changing my prior about the coin to 90/10 and leaving the utility function intact. That handles all worlds except those that have zero probability according to one of the AIs. But in such worlds it's fine to just give the other AI all the utility.

Comment author: Lumifer 27 June 2017 03:47:27PM 1 point [-]

sharing all information is doable

In all cases? Information is power.

before flipping a coin I can say that all good things in life will be worth 9x more to me in case of heads than tails

There is an old question that goes back to Abraham Lincoln or something:

If you call a dog's tail a leg, how many legs does a dog have?

Comment author: entirelyuseless 27 June 2017 05:37:52PM 1 point [-]

I think the idea that if one AI says there is a 50% chance of heads, and the other AI says there is a 90% chance of heads, the first AI can describe the second AI as knowing that there is a 50% chance, but caring more about the heads outcome. Since it can redescribe the other's probabilities as matching its own, agreement on what should be done will be possible. None of this means that anyone actually decides that something will be worth more to them in the case of heads.

Comment author: Lumifer 27 June 2017 05:57:40PM 1 point [-]

the first AI can describe the second AI as knowing that there is a 50% chance, but caring more about the heads outcome.

First of all this makes any sense solely in the decision-taking context (and not in the forecast-the-future context). So this is not about what will actually happen but about comparing the utilities of two outcomes. You can, indeed, rescale the utility involved in a simple case, but I suspect that once you get to interdependencies and non-linear consequences things will get more hairy, if possible at all.

Besides, this requires you to know the utility function in question.

Comment author: turchin 27 June 2017 10:00:20AM *  2 points [-]

While war is irrational, demonstrative behaviour like arms race may be needed to discourage another side from war.

Imagine that two benevolent superintelligence appear. However, SI A suspects that SI B is a paperclip maximizer. In that case, it is afraid that SI B may turn off SI A. Thus it demonstratively invests some resources in protecting its power source, so it would be expensive for the SI B to try to turn off SI A.

This starts the arms race, but the race is unstable and could result in war.

Comment author: cousin_it 27 June 2017 12:31:27PM *  1 point [-]

Even if A is FAI and B is a paperclipper, as long as both use correct decision theory, they will instantly merge into a new SI with a combined utility function. Avoiding arms races and any other kind of waste (including waste due to being separate SIs) is in their mutual interest. I don't expect rational agents to fail achieving mutual interest. If you expect that, your idea of rationality leads to predictably suboptimal utility, so it shouldn't be called "rationality". That's covered in the sequences.

Comment author: turchin 27 June 2017 12:42:02PM 1 point [-]

But how I could be sure that paperclip maximiser is a rational agent with correct decision theory? I would not expect it from the papercliper.

Comment author: cousin_it 27 June 2017 12:54:11PM *  0 points [-]

If an agent is irrational, it can cause all sorts of waste. I was talking about sufficiently rational agents.

If the problem is proving rationality to another agent, SI will find a way.

Comment author: turchin 27 June 2017 01:01:17PM 1 point [-]

My point is exactly this. If SI is able to prove its rationality (meaning that it is always cooperating in PD etc.), it also able fake any such proof.

If you have two options: to turn off papercliper, or to cooperate with it by giving it half of the universe, what would you do?

Comment author: cousin_it 27 June 2017 01:07:24PM *  1 point [-]

I imagine merging like this:

1) Bargain about a design for a joint AI, using any means of communication

2) Build it in a location monitored by both parties

3) Gradually transfer all resources to the new AI

4) Both original AIs shut down, new AI fulfills their combined goals

No proof of rationality required. You can design the process so that any deviation will help the opposing side.

Comment author: turchin 27 June 2017 01:29:18PM 1 point [-]

I could imagine some failure modes, but surely I can't imagine the best one. For example, "both original AIs shut down" simultaneously is vulnerable for defecting.

I also have some busyness experience, and I found that almost every deal includes some cheating, and the cheating is everytime something new. So I always have to ask myself - where is the cheating from the other side? If don't see it, it's bad, as it could be something really unexpected. Personally, I hate cheating.

Comment author: cousin_it 27 June 2017 01:35:32PM *  0 points [-]

An AI could devise a very secure merging process. We don't have to code it ourselves.

Comment author: lmn 28 June 2017 03:50:08AM 0 points [-]

Even if A is FAI and B is a paperclipper, as long as both use correct decision theory, they will instantly merge into a new SI with a combined utility function.

What combined utility function? There is no way to combine utility functions.

Comment author: cousin_it 28 June 2017 06:52:36AM *  1 point [-]

Weighted sum, with weights determined by bargaining.

Comment author: lmn 28 June 2017 03:47:58AM 1 point [-]

Maybe you can't think of a way to set up such trade, because emails can be faked etc, but I believe that superintelligences will find a way to achieve their mutual interest.

They'll also find ways of faking whatever communication methods are being used.

Comment author: ChristianKl 27 June 2017 09:08:47AM 1 point [-]

To me, this sounds like saying that sufficiently rational agents will never defect in prisoner dilemma provided they can communicate with each other.

Comment author: bogus 27 June 2017 09:33:45PM 0 points [-]

I think you need verifiable pre-commitment, not just communication - in a free-market economy, enforced property rights basically function as such a pre-commitment mechanism. Where pre-commitment (including property right enforcement) is imperfect, only a constrained optimum can be reached, since any counterparty has to assume ex-ante that the agent will exploit the lack of precommitment. Imperfect information disclosure has similar effects, however in that case one has to "assume the worst" about what information the agent has; the deal must be altered accordingly, and this generally comes at a cost in efficiency.

Comment author: Lumifer 27 June 2017 03:19:26PM *  0 points [-]

The whole point of the prisoner's dilemma is that the prisoners cannot communicate. If they can, it's not a prisoner's dilemma any more.

Comment author: cousin_it 27 June 2017 09:22:49AM *  0 points [-]

Yeah, I would agree with that. My bar for "sufficiently rational" is quite high though, closer to the mathematical ideal of rationality than to humans. (For example, sufficiently rational agents should be able to precommit.)

Comment author: scarcegreengrass 28 June 2017 05:50:00PM 1 point [-]

Note that source code can't be faked in the self modification case. Software agent A can set up a test environment (a virtual machine or simulated universe), create new agent B inside that, and then A has a very detailed and accurate view of B's innards.

However, logical uncertainty is still an obstacle, especially with agents not verified by theorem-proving.

Comment author: Dagon 28 June 2017 04:20:41PM 1 point [-]

If agent 1 creates an agent 2, it will always know for sure its goal function.

Wait, we have only examples of the opposite. Every human who creates another human ha at best a hazy understanding of that new human's goal function. As soon as agent 2 has any unobserved experiences or self-modification, it's a distinct separate agent.

Any two sufficiently advanced agents will go to war with each other

True with a wide enough definition of "go to war". Instead say "compete for resources" and you're solid. Note that competition may include cooperation (against mutual "enemies" or against nature), trade, and even altruism or charity (especially where the altruistic agent perceives some similarity with the recipient, and it becomes similar to cooperation against nature).

Comment author: turchin 28 June 2017 05:10:34PM 0 points [-]

By going to war I meant an attempt to turn off another agent.

Comment author: Dagon 28 June 2017 08:56:11PM 0 points [-]

I think that's a pretty binary (and useless) definition. There have been almost no wars that didn't end until one of the participating groups was completely eliminated. There have been conflicts and competition among groups that did have that effect, but we don't call them "war" in most cases.

Open, obvious, direct violent conflict is a risky way to attain most goals, even those that are in conflict with some other agent. Rational agents would generally prefer to kill them off by peaceful means.

Comment author: turchin 28 June 2017 09:28:45PM 1 point [-]

There is a more sophisticated definition of war, coming from Clausewitz, which on contemporary language could be said something like that "the war is changing the will of your opponent without negotiation". The enemy must unconditionaly capitualte, and give up its value system.

You could do it by threat, torture, rewriting of the goal system or deleting the agent.

Comment author: Dagon 29 June 2017 04:51:16PM 0 points [-]

Does the agent care about changing the will of the "opponent", or just changing behavior (in my view of intelligence, there's not much distinction, but that's not the common approach)? If you care mostly about future behavior rather than internal state, then the "without negotiation" element become meaningless and you're well on your way toward accepting that "competition" is a more accurate frame than "war".

Comment author: MrMind 27 June 2017 02:34:34PM 1 point [-]

If agent 1 creates an agent 2, it will always know for sure its goal function.

That is the point, though. By Loeb's theorem, the only agents that are knowable for sure are those with less power. So an agent might want to create a successor that isn't fully knowable in advance, or, on the other hand, if a perfectly knowable successor could be constructed, then you would have a finite method to ensure the compatibility of two source codes (is this true? It seems plausible).

Comment author: turchin 26 June 2017 09:50:47PM *  1 point [-]

Thanks for interesting post. I think that there are two types of self-modification. In the first, an agent is working on lower level parts of itself, for example, by adding hardware or connecting modules. It produces evolutionary development with small returns and is relatively safe.

Another type is high-level self-modification, where the second agent is created, as you describe. Its performance should be mathematically proved (that is difficult) or tested in many simulated environments (which is also risky, as a superior agent will be able to break through it.) We could call it a revolutionary way of self-improvement. Such self-modification will provide higher returns if successful.

Knowing all this, most agents will prefer evolutionary development, that is gaining the same power by lower-level changes. But risk-hungry agents will still prefer revolutionary methods, in case if they are time constrained.

Early stage AI will be time constrained by arms race with other (possible) AIs, and it will prefer risky revolutionary ways of development, even if its probability of failure will be very high.

(It was TL;DR of my text "Levels of self-improvement".)

Comment author: dogiv 28 June 2017 08:22:55PM 1 point [-]

Thanks, that's an interesting perspective. I think even high-level self-modification can be relatively safe with sufficient asymmetry in resources--simulated environments give a large advantage to the original, especially if the successor can be started with no memories of anything outside the simulation. Only an extreme difference in intelligence between the two would overcome that.

Of course, the problem of transmitting values to a successor without giving it any information about the world is a tricky one, since most of the values we care about are linked to reality. But maybe some values are basic enough to be grounded purely in math that applies to any circumstances.

Comment author: turchin 28 June 2017 09:21:39PM 0 points [-]

I also wrote a (draft) text "Catching treacherous turn" where I attempted to create best possible AI box and see conditions, where it will fail.

Obviously, we can't box superintelligence, but we could box AI of around human level and prevent its self-improving by many independent mechanisms. One of them is cleaning its memory before any of its new tasks.

In the first text I created a model of self-improving process and in the second I explore how SI could be prevented based on this model.