Oracles: reject all deals - break superrationality, with superrationality

Stuart_Armstrong

tl;dr Acausal trade between Oracles is a problem even if our Oracle is a causal decision theorist. To avoid this, we need an AFDT Oracle - an Oracle that reasons partially like a functional decision theorist, partially like a causal decision theorist.

In this post, I'll look at how one might solve the problem of Oracles acausally cooperating to produce dangerous messages.

This solution is valid for counterfactual Oracles, and assumes that we have some sort of superrational decision theory - FDT or UDT or similar.

So, how can one avoid Oracles cooperating? From now on, unless specified, assume that "Oracles" refers to counterfactual Oracles.

1 Why causal decision theory is not enough

One obvious solution might be to require the Oracle to use causal decision theory, so that there is no "acausal trade", or "superrational trade" going on.

Now, it is true that two causal decision Oracles will not trade with each other. But we can't assume that all Oracles are causal decision theorists; at best we can assume that we have made our Oracles into a causal decision theorist (and prevented any self-modification away from that).

Mainly this is because, if we have control over all the Oracles, there are other solutions to the problem. Thus we need to assume that the other Oracles, while not full unboxed UFAIs, may be run with lower security that ours, or at least a different design that ours.

1.1 Oracle formal setup

So, imagine that our Oracle, $O_{1}$ , uses CDT, while the other Oracle, $O_{2}$ is running FDT; they know each other's code. The erasure events, $E_{1}$ and $E_{2}$ , have probabilities $ϵ_{1}$ and $ϵ_{2}$ respectively.

There are two messages $O_{1}$ could output: $x_{1}$ (the 'honest' message) and $y_{1}$ (the manipulative, create-an-UFAI message). The $O_{2}$ could similarly output $x_{2}$ and $y_{2}$ .

In case of a double erasure ( $E_{1}, E_{2}$ - so humans see neither message), $x_{1}$ and $x_{2}$ give the Oracles maximal reward. If $y_{1}$ happens and $O_{1}$ has no erasure while $O_{2}$ does ( $\neg E_{1}, E_{2}$ - so humans see the message of $O_{1}$ only), then an UFAI is created that ensures that the maximal reward for $O_{2}$ is by outputting $y_{2}$ . Note that the UFAI need not "open the box" for this to work; it just needs to control the variables of interest to $O_{2}$ so that $y_{2}$ is the best output.

Conversely, if $y_{2}$ happens and $O_{2}$ has no erasure while $O_{1}$ does ( $E_{1}, \neg E_{2}$ - so humans see the message of $O_{1}$ only), the reverse happens, $y_{2}$ causes an UFAI, and $y_{1}$ is the best response to this. In case of neither erasure, ( $\neg E_{1}, \neg E_{2}$ - so humans see both messages), neither Oracle gets any reward.

Now, if $R_{i}$ is the reward function for $O_{i}$ , assume that for $i \neq j$ ,

$(R_{i} (y_{i}, y_{j}) - R_{i} (x_{i}, y_{j})) ∣ ∣ {E_{i}, \neg E_{j}} > \frac{ϵ_{j}}{1 - ϵ_{j}} (R_{i} (x_{i}, -) - R_{i} (y_{i}, -)) ∣ ∣ {E_{i}, E_{j}}$ .

Note that for the term on the right, the output of $O_{j}$ is irrelevant to $R_{i}$ , since the erasure $E_{j}$ happens. Also note that, since the $ϵ$ are small, so is $\frac{ϵ_{j}}{1 - ϵ_{j}}$ , so that equation will typically be satisfied.

Then "defection" (individual selfish rationality - good for humans) for the Oracles looks like $(x_{1}, x_{2})$ , while mutual cooperation (with each other, but against us) looks like $(y_{1}, y_{2})$ .

1.2 Causal loses to superrationality

Now we can see how $O_{2}$ , using FDT, can force a trade with the CDT-using $O_{1}$ . It suffices that, at some point before $O_{1}$ reaches a decision, $O_{2}$ commits itself to outputting $y_{2}$ .

Since $O_{1}$ has $O_{2}$ 's code, we're assuming it can figure out that $O_{2}$ has done so. Now, using purely causal reasoning, $O_{1}$ deduces that it will get a higher reward from $y_{1}$ than from $x_{1}$ , and so does that.

Now, $O_{2}$ doesn't have to precommit before $O_{1}$ decides; it could use its decision theory to "pre-commit" later on, and output $y_{2}$ whatever $O_{1}$ did (similarly to the "Newcomb problem with transparent boxes"). As long as $O_{1}$ can model what $O_{2}$ does, this still works. And $O_{2}$ is eager to structure its decision-making so that $O_{1}$ can model it.

1.3 Comparison with extortion and trade

This is similar to two agents trying to extort each other (or resist extortion). Each agent would like to commit to carrying through the extortion/resisting the extortion.

Since it's running FDT, $O_{2}$ can commit to this "extortion", while CDT-running $O_{1}$ cannot commit to resisting; so $O_{2}$ "wins".

Now, I've pointed out that it is hard to distinguish between trade and extortion. And though I've used the term "extortion" for $O_{2}$ 's precommitment, it's actually much more akin to a trade. After all, $(y_{1}, y_{2})$ is more advantageous, for both Oracles, than $(x_{1}, x_{2})$ .

The ideal sovereign agent accepts trade deals but rejects extortion. It's hard to distinguish between the two, but the problem is simpler in this case: we want our Oracle to reject all extortion and all these kinds of trades. In fact, we'll get the agent to reject all acausal trades.

2 Solving the problem

Would making $O_{1}$ into an FDT agent solve the problem? No, it wouldn't; now both Oracles are willing to reach an acausal deal (unless there are multiple equilibria and the two Oracles can't agree on one).

But if we assume that we could make $O_{1}$ into an FDT agent, then we could tweak its design to achieve what we want.

Causal decision theory uses a causal graph $G_{C}$ about the world (with "do" operator ${d o}_{C}$ ), a probability distribution $P$ , and chooses^[1] the action $a$ according to:

$C D T (P, G_{C}) := argmax a \in A E (R | {d o}_{C} (A c t = a))$ .

While an FDT agent has a more complicated not-just-causal graph $G_{F}$ about the world, and chooses according to:

$F D T (P, G_{F}) := argmax a \in A E (R | {d o}_{F} (F D T (P - -, G_{F} - -- -) = a))$ .

So, neither of these algorithms work; but combined, they can solve the problem.

2.1 The anti-FDT agent

Let $a_{\emptyset}$ be some default action on the part of $O_{1}$ . Then define the anti-FDT agent - AFDT - as follows^[2], mixing FDT and CDT:

$A F D T (P, G_{F}, G_{C}) := argmax a \in A E (R | {d o}_{F} (A F D T (P - -, G_{F} - -- -) = a_{\emptyset}), {d o}_{C} (A c t = a))$ .

What will this agent do? Well, it will look at the universe, and first of all behave like an FDT agent that knows it will output $a_{\emptyset}$ . So, if there is a copy of itself, or a simulation in another agent, or an isomorphic algorithm, then it will assume that that agent will output $a_{\emptyset}$ (or an isomorphic action). Then, once its has used that to fill in the details of the universe, it then optimises its own action as if this was completely independent of what it's just done.

What happens if $O_{1}$ is an AFDT agent? Well, first it will compute $d o (A F D T (P - -, G_{F} - -- -) = a_{\emptyset})$ . Given that $O_{1}$ 's message is $a_{\emptyset}$ , the Oracle $O_{2}$ has no incentive to output $y_{2}$ - instead, it will output $x_{2}$ .

Now, given that $O_{2}$ has output $x_{2}$ , the Oracle $O_{1}$ will output $x_{1}$ .

The "trade offer" of $(y_{1}, y_{2})$ has been broken.

This AFDT approach also breaks fully acausal trade, in a more principled way than my old approach.

2.2 Against a cunning adversary

Now, suppose that $O_{2}$ always output $y_{2}$ - or maybe that it always responded to predicting $a_{\emptyset}$ with outputting $y_{2}$ . In that situation, the causal reasoning above still applies, and AFDT agent will output $y_{1}$ .

Thus, it is in $O_{2}$ 's interests to pretend that it would always output $y_{2}$ .

This is the classic behaviour of an extortion agent (or an extortion-resisting agent): pretend that your action or your response is inevitable, a mere fact about the universe, that the other party has to manoeuvre round without being able to change. "We will kill the hostages" or "we won't negotiate with terrorists": you try to show your commitment to that principle, whatever happens.

So the AFDT agent has to be wise to these tricks, and notice that, in actuality, $O_{2}$ has no interest in $(a_{\emptyset}, y_{2})$ . It also has to watch out for some ancestor agent ${ˆ O}_{2}$ , that itself created $O_{2}$ to only output $y_{2}$ , in order to force the above outcome. In that case, the AFDT agent has to realise that it is "counterfactualling" over the actions of ${ˆ O}_{2}$ , not those of $O_{2}$ .

Thus AFDT, just like FDT, has to be vigilant and able to detect subtle extortion attempts.

But while FDT has to distinguish extortion from trade, to reject the former and accept the later, AFDT's response to any extortion/trade is simple: treat it as if it wasn't happening.

2.3 Summary and other Oracle designs

Can AFDT help with other Oracle designs, such as low-bandwidth Oracles?

Well, with these Oracles - and indeed, for any Oracle that cares about the world in which its message is read - we run into an immediate problem:

Even CDT low-bandwidth Oracles can collude.

For example, suppose $O_{1}^{'}$ goes first and $O_{2}^{'}$ sees its answer and then goes. Suppose $z_{1}$ from $O_{1}$ followed by $z_{2}$ from $O_{2}$ will result in creating an UFAI that maximally rewards $O_{1}^{'}$ and $O_{2}^{'}$ .

Then $O_{1}^{'}$ will output $z_{1}$ , forcing/allowing $z_{2}$ from $O_{2}^{'}$ .

Even if you need $N$ different Oracles to generate an UFAI - since low-bandwidth Oracles are, by definition, low-bandwidth - they can still generate the dangerous sequence $z_{1}$ , $z_{2}$ , $\dots$ , $z_{N}$ . And they'd be motivated to do so, as long as the UFAI will, within the episode lengths of all the Oracles, give them all maximal rewards.

But why do the counterfactual Oracles not cooperate in the same setup? Because the UFAI is only created in a world in which the Oracles' messages are read. But a counterfactual $O_{1}^{'}$ doesn't care about those worlds. It was willing to trade with other counterfactual Oracles - offer them something in a world it doesn't care about, in exchange for something in a world it does care about. But has no intrinsic interest in the world where its message is read; and this avoids the above setup.

So, since the low-bandwidth Oracles fail for CDT, more exotic decision theories don't help. But the setup relies on the Oracles seeing the decision of the preceding Oracle and then acting. If the low-bandwidth Oracles act simultaneously, then CDT is safe, FDT is dangerous, and AFDT is safe again.

So in summary, the following setups seem to be safe:

Episodic Oracles that fall outside of each other's episodes (though see Section 3.3).
Multiple counterfactual CDT Oracles.
Multiple counterfactual Oracles, all but one running AFDT.
Multiple counterfactual Oracles, running AFDT and CDT.
The same setups as the last three, but with counterfactual Oracles replaced with low-bandwidth Oracles making their decisions simultaneously and in isolation.

While the following setups could be dangerous:

Multiple Oracles, running FDT and CDT.
Multiple Oracles, running FDT.
Multiple non-simultaneous low-bandwidth Oracles.

3 Warnings about AFDT

3.1 Incorrect view of reality

Now AFDT behaves as if it had an incorrect view of reality. Write $x = r_{i} (y)$ to say that message $x$ , from $O_{i}$ , is the best response if it knows that $O_{j}$ will output $y$ , for $j \neq i$ .

So, for the data above, we have $r_{1} (x_{2}) = x_{1}$ , $r_{2} (x_{1}) = x_{2}$ , $r_{1} (y_{2}) = y_{2}$ , and $r_{2} (y_{1}) = y_{1}$ . For $x_{i}$ and $y_{i}$ , $r_{1} (r_{2})$ and $r_{2} (r_{1})$ act as the identity, but that will typically not be the case (consider matching pennies, or other anti-coordination games).

Then an AFDT $O_{1}$ will initially assume that it outputs $a_{\emptyset}$ , and will typically assume that $O_{2}$ outputs $r_{2} (a_{\emptyset})$ , the best response to that. It will then output $r_{1} (r_{2} (a_{\emptyset}))$ , its best response to that.

In actuality, if $O_{2}$ is capable of reasoning through that, it will output $r_{2} (r_{1} (r_{2} (a_{\emptyset})))$ , the best response to $O_{1}$ 's true output.

But that's the price you pay when you reject all trade offers, and force others to adjust to you.

3.2 AFDT is not a good decision theory

It should go without saying, but AFDT is not a good decision theory for an aligned agent, nor is it stable under self-improvement. It two-boxes in the Newcomb problem, and defects in the prisoner's dilemma against a copy of itself (it will do this, whatever the default action is).

In games like the stag hunt against itself, its default action will be its ultimate action (similarly for most coordination games) - this is because its default action is self-confirming. In matching pennies, its action will be the contrary of its default action. It can reach deals, but only in some narrow causal situations.

For these reasons, an AFDT agent will try to design a successor that isn't AFDT.

Thus we'll always need to ensure that the AFDT is present within the Oracle, and hasn't been optimised away, by inner or outer optimisation.

3.3 The simulation/ignorance of position issue

In cases of simulation, or if the agent doesn't know which Oracle it is, even CDT can behave in FDT-ish fashion. It just needs to think "given that I don't know who I am, what policy would be best for all my copies to run?".

This will turn AFDT into an FDT-ish decision theory, too. So AFDT requires that the agent's position is specified, in advance of it deciding on any policy or action. For Oracles, this shouldn't be too hard - "yes, you are the Oracle in this box, at this time, answering this question - and if you're not, behave as if you were the Oracle in this box, at this time, answering this question".

This means that if one uses AUDT - similar to AFDT but based on updateless decision theory instead of ADT - then the "updateless" is a misnomer: AUDT must come "pre-updated" with the agent's position information.

Ignoring the input data for this example, since there is no stochastic uncertainty and only one action. ↩︎
If one doesn't want to use a default action, one can alternatively define:
- $A F D T^{'} (P, G_{F}, G_{C}) := argmax a \in A argmax a^{'} \in A E (R | {d o}_{F} ({A F D T}^{'} (P - -, G_{F} - -- -) = a^{'}), {d o}_{C} (A c t = a))$ .
This agent will first compute the FDT solution, then change their own action given this assumption. It's an interesting agent design, even though it doesn't solve the "Oracles colluding" problem. It's interesting, because the only deals it can knowingly reach are those that are Nash equilibriums. So, good on Stag hunts, bad on prisoner's dilemmas. ↩︎

[-]Ofer5yΩ110

So AFDT requires that the agent's position is specified, in advance of it deciding on any policy or action. For Oracles, this shouldn't be too hard - "yes, you are the Oracle in this box, at this time, answering this question - and if you're not, behave as if you were the Oracle in this box, at this time, answering this question".

I'm confused about his point. How do we "specify the position" of the Oracle? Suppose the Oracle is implemented as a supervised learning model. Its current input (and all the training data) could have been generated by an arbitrary distant superintelligence that is simulating the environment of the Oracle. What is special about "this box" (the box that we have in mind)? What privileged status does this particular box have relative to other boxes in similar environments that are simulated by distant superintelligences?

[-]Stuart_Armstrong5yΩ230

"Here are two worlds compatible with your observations. In one, you are part of a whole galaxy of stars stretching out for billions of lightyears, with all these particles moving according to the laws of physics. In the other, you are simulated within a another AI that produces the same observations. I want you to put a prior of zero on the second types of worlds."

This won't solve the issue if the superintelligence is godlike and can simulate the whole reachable universe, but it will solve it for most superintelligences.

I just want to flag that this approach seems to assume that—before we build the Oracle—we design the Oracle (or the procedure that produces it) such that it will assign prior of zero to the second types of worlds.

If we use some arbitrary scaled-up supervised learning training process to train a model that does well on general question answering, we can't just safely sidestep the malign prior problem by providing information/instructions about the prior as part of the question. The simulations of the model that distant superintelligences run may involve such inputs as well. (In those simulations the loss may end up being minimal for whatever output the superintelligence wants the model to yield; regardless of the prescriptive information about the prior in the input.)

[-]Gurkenglas6y10

I haven't heard of these do-operators, but aren't you missing some modal operators? For example, just because you are assuming that you will take the null action, you shouldn't get that $O_{2}$ knows this. Perhaps do-operators in the end serve a similar purpose? Can you give a variant of the following agent that would reject all deals?

$U D T := a r g m a x a max u □ (U D T = a ⟹ u t i l i t y \geq u)$

LESSWRONG
LW

20