So AFDT requires that the agent's position is specified, in advance of it deciding on any policy or action. For Oracles, this shouldn't be too hard - "yes, you are the Oracle in this box, at this time, answering this question - and if you're not, behave as if you were the Oracle in this box, at this time, answering this question".
I'm confused about his point. How do we "specify the position" of the Oracle? Suppose the Oracle is implemented as a supervised learning model. Its current input (and all the training data) could have been generated by an arbitrary distant superintelligence that is simulating the environment of the Oracle. What is special about "this box" (the box that we have in mind)? What privileged status does this particular box have relative to other boxes in similar environments that are simulated by distant superintelligences?
"Here are two worlds compatible with your observations. In one, you are part of a whole galaxy of stars stretching out for billions of lightyears, with all these particles moving according to the laws of physics. In the other, you are simulated within a another AI that produces the same observations. I want you to put a prior of zero on the second types of worlds."
This won't solve the issue if the superintelligence is godlike and can simulate the whole reachable universe, but it will solve it for most superintelligences.
I just want to flag that this approach seems to assume that—before we build the Oracle—we design the Oracle (or the procedure that produces it) such that it will assign prior of zero to the second types of worlds.
If we use some arbitrary scaled-up supervised learning training process to train a model that does well on general question answering, we can't just safely sidestep the malign prior problem by providing information/instructions about the prior as part of the question. The simulations of the model that distant superintelligences run may involve such inputs as well. (In those simulations the loss may end up being minimal for whatever output the superintelligence wants the model to yield; regardless of the prescriptive information about the prior in the input.)
I haven't heard of these do-operators, but aren't you missing some modal operators? For example, just because you are assuming that you will take the null action, you shouldn't get that knows this. Perhaps do-operators in the end serve a similar purpose? Can you give a variant of the following agent that would reject all deals?
tl;dr Acausal trade between Oracles is a problem even if our Oracle is a causal decision theorist. To avoid this, we need an AFDT Oracle - an Oracle that reasons partially like a functional decision theorist, partially like a causal decision theorist.
In this post, I'll look at how one might solve the problem of Oracles acausally cooperating to produce dangerous messages.
This solution is valid for counterfactual Oracles, and assumes that we have some sort of superrational decision theory - FDT or UDT or similar.
So, how can one avoid Oracles cooperating? From now on, unless specified, assume that "Oracles" refers to counterfactual Oracles.
1 Why causal decision theory is not enough
One obvious solution might be to require the Oracle to use causal decision theory, so that there is no "acausal trade", or "superrational trade" going on.
Now, it is true that two causal decision Oracles will not trade with each other. But we can't assume that all Oracles are causal decision theorists; at best we can assume that we have made our Oracles into a causal decision theorist (and prevented any self-modification away from that).
Mainly this is because, if we have control over all the Oracles, there are other solutions to the problem. Thus we need to assume that the other Oracles, while not full unboxed UFAIs, may be run with lower security that ours, or at least a different design that ours.
1.1 Oracle formal setup
So, imagine that our Oracle, O1, uses CDT, while the other Oracle, O2 is running FDT; they know each other's code. The erasure events, E1 and E2, have probabilities ϵ1 and ϵ2 respectively.
There are two messages O1 could output: x1 (the 'honest' message) and y1 (the manipulative, create-an-UFAI message). The O2 could similarly output x2 and y2.
In case of a double erasure (E1,E2 - so humans see neither message), x1 and x2 give the Oracles maximal reward. If y1 happens and O1 has no erasure while O2 does (¬E1,E2 - so humans see the message of O1 only), then an UFAI is created that ensures that the maximal reward for O2 is by outputting y2. Note that the UFAI need not "open the box" for this to work; it just needs to control the variables of interest to O2 so that y2 is the best output.
Conversely, if y2 happens and O2 has no erasure while O1 does (E1,¬E2 - so humans see the message of O1 only), the reverse happens, y2 causes an UFAI, and y1 is the best response to this. In case of neither erasure, (¬E1,¬E2 - so humans see both messages), neither Oracle gets any reward.
Now, if Ri is the reward function for Oi, assume that for i≠j,
Note that for the term on the right, the output of Oj is irrelevant to Ri, since the erasure Ej happens. Also note that, since the ϵ are small, so is ϵj1−ϵj, so that equation will typically be satisfied.
Then "defection" (individual selfish rationality - good for humans) for the Oracles looks like (x1,x2), while mutual cooperation (with each other, but against us) looks like (y1,y2).
1.2 Causal loses to superrationality
Now we can see how O2, using FDT, can force a trade with the CDT-using O1. It suffices that, at some point before O1 reaches a decision, O2 commits itself to outputting y2.
Since O1 has O2's code, we're assuming it can figure out that O2 has done so. Now, using purely causal reasoning, O1 deduces that it will get a higher reward from y1 than from x1, and so does that.
Now, O2 doesn't have to precommit before O1 decides; it could use its decision theory to "pre-commit" later on, and output y2 whatever O1 did (similarly to the "Newcomb problem with transparent boxes"). As long as O1 can model what O2 does, this still works. And O2 is eager to structure its decision-making so that O1 can model it.
1.3 Comparison with extortion and trade
This is similar to two agents trying to extort each other (or resist extortion). Each agent would like to commit to carrying through the extortion/resisting the extortion.
Since it's running FDT, O2 can commit to this "extortion", while CDT-running O1 cannot commit to resisting; so O2 "wins".
Now, I've pointed out that it is hard to distinguish between trade and extortion. And though I've used the term "extortion" for O2's precommitment, it's actually much more akin to a trade. After all, (y1,y2) is more advantageous, for both Oracles, than (x1,x2).
The ideal sovereign agent accepts trade deals but rejects extortion. It's hard to distinguish between the two, but the problem is simpler in this case: we want our Oracle to reject all extortion and all these kinds of trades. In fact, we'll get the agent to reject all acausal trades.
2 Solving the problem
Would making O1 into an FDT agent solve the problem? No, it wouldn't; now both Oracles are willing to reach an acausal deal (unless there are multiple equilibria and the two Oracles can't agree on one).
But if we assume that we could make O1 into an FDT agent, then we could tweak its design to achieve what we want.
Causal decision theory uses a causal graph GC about the world (with "do" operator doC), a probability distribution P, and chooses[1] the action a according to:
While an FDT agent has a more complicated not-just-causal graph GF about the world, and chooses according to:
So, neither of these algorithms work; but combined, they can solve the problem.
2.1 The anti-FDT agent
Let a∅ be some default action on the part of O1. Then define the anti-FDT agent - AFDT - as follows[2], mixing FDT and CDT:
What will this agent do? Well, it will look at the universe, and first of all behave like an FDT agent that knows it will output a∅. So, if there is a copy of itself, or a simulation in another agent, or an isomorphic algorithm, then it will assume that that agent will output a∅ (or an isomorphic action). Then, once its has used that to fill in the details of the universe, it then optimises its own action as if this was completely independent of what it's just done.
What happens if O1 is an AFDT agent? Well, first it will compute do(AFDT(P––,GF––––)=a∅). Given that O1's message is a∅, the Oracle O2 has no incentive to output y2 - instead, it will output x2.
Now, given that O2 has output x2, the Oracle O1 will output x1.
The "trade offer" of (y1,y2) has been broken.
This AFDT approach also breaks fully acausal trade, in a more principled way than my old approach.
2.2 Against a cunning adversary
Now, suppose that O2 always output y2 - or maybe that it always responded to predicting a∅ with outputting y2. In that situation, the causal reasoning above still applies, and AFDT agent will output y1.
Thus, it is in O2's interests to pretend that it would always output y2.
This is the classic behaviour of an extortion agent (or an extortion-resisting agent): pretend that your action or your response is inevitable, a mere fact about the universe, that the other party has to manoeuvre round without being able to change. "We will kill the hostages" or "we won't negotiate with terrorists": you try to show your commitment to that principle, whatever happens.
So the AFDT agent has to be wise to these tricks, and notice that, in actuality, O2 has no interest in (a∅,y2). It also has to watch out for some ancestor agent ˆO2, that itself created O2 to only output y2, in order to force the above outcome. In that case, the AFDT agent has to realise that it is "counterfactualling" over the actions of ˆO2, not those of O2.
Thus AFDT, just like FDT, has to be vigilant and able to detect subtle extortion attempts.
But while FDT has to distinguish extortion from trade, to reject the former and accept the later, AFDT's response to any extortion/trade is simple: treat it as if it wasn't happening.
2.3 Summary and other Oracle designs
Can AFDT help with other Oracle designs, such as low-bandwidth Oracles?
Well, with these Oracles - and indeed, for any Oracle that cares about the world in which its message is read - we run into an immediate problem:
For example, suppose O′1 goes first and O′2 sees its answer and then goes. Suppose z1 from O1 followed by z2 from O2 will result in creating an UFAI that maximally rewards O′1 and O′2.
Then O′1 will output z1, forcing/allowing z2 from O′2.
Even if you need N different Oracles to generate an UFAI - since low-bandwidth Oracles are, by definition, low-bandwidth - they can still generate the dangerous sequence z1, z2, …, zN. And they'd be motivated to do so, as long as the UFAI will, within the episode lengths of all the Oracles, give them all maximal rewards.
But why do the counterfactual Oracles not cooperate in the same setup? Because the UFAI is only created in a world in which the Oracles' messages are read. But a counterfactual O′1 doesn't care about those worlds. It was willing to trade with other counterfactual Oracles - offer them something in a world it doesn't care about, in exchange for something in a world it does care about. But has no intrinsic interest in the world where its message is read; and this avoids the above setup.
So, since the low-bandwidth Oracles fail for CDT, more exotic decision theories don't help. But the setup relies on the Oracles seeing the decision of the preceding Oracle and then acting. If the low-bandwidth Oracles act simultaneously, then CDT is safe, FDT is dangerous, and AFDT is safe again.
So in summary, the following setups seem to be safe:
While the following setups could be dangerous:
3 Warnings about AFDT
3.1 Incorrect view of reality
Now AFDT behaves as if it had an incorrect view of reality. Write x=ri(y) to say that message x, from Oi, is the best response if it knows that Oj will output y, for j≠i.
So, for the data above, we have r1(x2)=x1, r2(x1)=x2, r1(y2)=y2, and r2(y1)=y1. For xi and yi, r1(r2) and r2(r1) act as the identity, but that will typically not be the case (consider matching pennies, or other anti-coordination games).
Then an AFDT O1 will initially assume that it outputs a∅, and will typically assume that O2 outputs r2(a∅), the best response to that. It will then output r1(r2(a∅)), its best response to that.
In actuality, if O2 is capable of reasoning through that, it will output r2(r1(r2(a∅))), the best response to O1's true output.
But that's the price you pay when you reject all trade offers, and force others to adjust to you.
3.2 AFDT is not a good decision theory
It should go without saying, but AFDT is not a good decision theory for an aligned agent, nor is it stable under self-improvement. It two-boxes in the Newcomb problem, and defects in the prisoner's dilemma against a copy of itself (it will do this, whatever the default action is).
In games like the stag hunt against itself, its default action will be its ultimate action (similarly for most coordination games) - this is because its default action is self-confirming. In matching pennies, its action will be the contrary of its default action. It can reach deals, but only in some narrow causal situations.
For these reasons, an AFDT agent will try to design a successor that isn't AFDT.
Thus we'll always need to ensure that the AFDT is present within the Oracle, and hasn't been optimised away, by inner or outer optimisation.
3.3 The simulation/ignorance of position issue
In cases of simulation, or if the agent doesn't know which Oracle it is, even CDT can behave in FDT-ish fashion. It just needs to think "given that I don't know who I am, what policy would be best for all my copies to run?".
This will turn AFDT into an FDT-ish decision theory, too. So AFDT requires that the agent's position is specified, in advance of it deciding on any policy or action. For Oracles, this shouldn't be too hard - "yes, you are the Oracle in this box, at this time, answering this question - and if you're not, behave as if you were the Oracle in this box, at this time, answering this question".
This means that if one uses AUDT - similar to AFDT but based on updateless decision theory instead of ADT - then the "updateless" is a misnomer: AUDT must come "pre-updated" with the agent's position information.
Ignoring the input data for this example, since there is no stochastic uncertainty and only one action. ↩︎
If one doesn't want to use a default action, one can alternatively define:
This agent will first compute the FDT solution, then change their own action given this assumption. It's an interesting agent design, even though it doesn't solve the "Oracles colluding" problem. It's interesting, because the only deals it can knowingly reach are those that are Nash equilibriums. So, good on Stag hunts, bad on prisoner's dilemmas. ↩︎