Namely, the agent will two-box if it predicts that the predictor decides that the agent one-boxes
Then won't the predictor predict that the agent two-boxes?
There is a bit of terminological confusion here. Predictor makes a decision to either one-box or two-box, and the decision to one-box could be called "predicting that the agent one-boxes", but it isn't necessarily able to predict so unconditionally, so it's more like "predicting that the agent one-boxes given that the predictor one-boxes".
The predictor is supposed to one-box, but the agent isn't supposed to infer (within M steps) that the predictor one-boxes (just as in finite diagonal step, the objective is to make a decision unpredictable, not impossible). On the other hand, if this plan works out, the agent should be able to infer that the predictor one-boxes conditional on the agent one-boxing.
A common theme is that the diagonal rule could make events unpredictable, but still conditionally predictable. It's interesting whether it's possible to navigate this edge well.
I don't understand this answer and cousin_it doesn't either (I just asked him).
The predictor is supposed to one-box, but the agent isn't supposed to infer (within M steps) that the predictor one-boxes (just as in finite diagonal step, the objective is to make a decision unpredictable, not impossible).
If the predictor outputs "one-box", then the agent must prove this if it enumerates all proofs within a certain length because there is a proof-by-simulation of this fact, of length proportional to the predictor's run time, right? I don't see how the diagonal step can prevent the agent from finding this proof, unless it makes the predictor not output "one-box".
The agent doesn't use enough steps to simulate the predictor, it decides early (because it finds a proof that predictor conditionally one-boxes early), which is also what might allow the predictor to conditionally predict agent's one-boxing within predictor's limited computational resources. The M steps where the agent protects the predictor from being unconditionally predictable is a very small number here, compared to agent's potential capability.
How is the agent supposed to know which parts of the world are "opponents" that it should make unpredictable? This seems like a potentially show-stopping problem with your approach and I'm surprised it's not listed under open problems. Is there something obvious that I'm missing?
This post doesn't propose a complete decision algorithm, it merely collects several applications of what appears to be the same trick. Your question seems to fall in the scope of the third open problem: perhaps it's a good idea to make all sorts of facts (events in the world) unconditionally unpredictable (unpredictable when not assuming what agent's own decision would be), if at the same time it's possible to retain their conditional predictability (predictability when an assumption about agent's own decision is made).
(Making an event unconditionally unpredictable but conditionally predictable looks like a good candidate for what gaining control over that event means, and an agent would want to gain control over as many events as possible. In particular, gaining control over your opponent's decisions is probably a good idea.)
gaining control over your opponent's decisions is probably a good idea.
(It might be good to keep in mind Eliezer's point about how discorrelation is potentially a scarce resource (because correlation is a ratchet).)
That depends on the opponent's computational power. There are certainly cryptographic approaches to this problem.
Sufficiently increasing the threshold N allows to prove that no other statements of the form [A=A1 => U=U1] will be proved before a decision is made (and so that the decision will be as expected). . . . This is the same guarantee that countable diagonal rule gives, albeit at the cost of having to know L.
Doesn't increasing N also change L? If N is part of the specification of the agent, then its value can affect the length of proofs about the agent. This indicates that there are probably agent designs for which we cannot just sufficiently increase N in order to make this diagonalization argument work.
It's possible. In the example, L doesn't depend on N, but it could. What we need is an L that works as an upper bound, even if we use M>N based on 2*L+ in the diagonal step.
I don't see what the problem is. You compare 'what happens if I do X?' with 'what happens if I do Y?' Not, 'If I do X, what would have happened if I did Y?'
You split the evaluation into cases with constant past decisions, then pick the best outcome. Then the AI doesn't even notice when it gets into a catch-22 condition.
This post collects a few situations where agents might want to make their decisions either predictable or unpredictable to certain methods of prediction, and considers a method of making a decision unpredictable by "diagonalizing" a hypothetical prediction of that decision. The last section takes a stab at applying this tool to the ASP problem.
The diagonal step
To start off, consider the halting problem, interpreted in terms of agents and predictors. Suppose that there is a Universal Predictor, an algorithm that is able to decide whether any given program halts or runs forever. Then, it's easy for a program (agent) to evade its gaze by including a diagonal step in its decision procedure: the agent checks (by simulation) if Universal Predictor comes to some decision about the agent, and if it does, the agent acts contrary to the Predictor's decision. This makes the prediction wrong, and Universal Predictors impossible.
The same trick could be performed against something that could exist, normal non-universal Predictors, which allows an agent to make itself immune to their predictions. In particular, ability of other agents to infer decisions of our agent may be thought of as prediction that an agent might want to hinder. This is possible so long as the predictors in question can be simulated in enough detail, that is it's known what they do (what they know) and our agent has enough computational resources to anticipate their hypothetical conclusions. (If an agent does perform the diagonal step with respect to other agents, the predictions of other agents don't necessarily become wrong, as they could be formally correct by construction, but they cease to be possible, which could mean that the predictions won't be made at all.)
Knowing own decision
In UDT/ADT, an agent is given definitions of its action (strategy) and utility value, and evaluates an action (strategy) by inferring the logical implications of taking it on the utility value (see this post for a more detailed description of the decision procedure and this post for an explanation of how it might be better to optimize strategy, rather than individual decisions). However, this algorithm breaks down if an agent becomes able to infer its own decision (or that a certain decision is impossible): in that case, it can infer anything from the hypothesis that it makes an impossible decision. This is one case where an agent might want to make its decision unpredictable to itself!
It seems that normally an agent won't be able to predict its own decision (see these posts for some discussion), but it's unclear how to set up an agent so that it provably doesn't do that. Worse, an ADT agent is a program that reasons using first-order logic and knows its own source code, which means that a statement declaring agent's actual decision is provable (by a standard inference system that enumerates all valid statements), and so the best we can hope for is that the agent won't stumble upon such a proof before it made a decision (see the section of this post on provability by the agent).
Countable diagonal step
Adding the diagonal step to ADT agents, to make them unable to predict (infer) their own decision, in the simplest case requires the agents to simulate themselves up until the point where they make a decision. In general this is too hard, so one way out is to grant the agents provability oracles. The resulting setting is described in this post.
The inference (prediction) system of an agent with a provability oracle can be completely described as a formal system S, which the agent uses (with help of the oracle) to check validity of any statement (in one step). Trouble with predicting own decisions then corresponds to S being too strong, able to infer agent's decision. The ("countable") diagonal step can be implemented by the agent checking, for each possible action A, whether S proves that actual action doesn't equal A, and performing A if S proves so. If S is sound, that situation is never realized, which means that S is weak enough to fail to predict agent's decision. The agent makes itself just complicated enough for its own inference system to become unable to reason about (origins of) its decision. This condition is then sufficient to prove that the agent satisfactorily resolves some thought experiments, such as Newcomb's Problem.
Avoiding spurious statements (without oracles)
An ADT agent without provability oracle only needs to hold off on predicting its decision until the decision is made. This can be arranged by use of the diagonal step if it's known when the decision can be made. Let's consider an example.
Newcomb's problem can be formalized (as in these posts) as follows:
Define constant symbols A=agent(), U=world(). It's easy to see that statements [A=1 => U=1000000] and [A=2 => U=1000] are valid, no matter what agent() is (assuming it terminates and always outputs 1 or 2). But if actually A=2, then A=1 is false, and so the statement [A=1 => U=0] is true as well. How do we know that the agent won't find specifically the statements [A=1 => U=0] and [A=2 => U=1000], choose A=2 as a result, with A=1 indeed becoming false?
The problem is that even when we know that in the intended situation certain statements assigning value to possible actions are true, that doesn't automatically guarantee that the agent will find these same statements, and thus that it will make the decision implied by the statements we know to be true. In Newcomb's problem, an agent could in principle encounter true statements [A=1 => U=0] and [A=2 => U=1000] listed at the very beginning of a contrived inference system and two-box as a result, which is what makes those statements true.
Thus, it matters how agent's inference system works, even if we know that it's correct (i.e. enumerates all valid first order statements following from the stated axioms). It has to be sufficiently "natural" in some sense, so that we can rely on [A=1 => U=1000000] coming before statements like [A=1 => U=0], with agent acting on the more "natural" arguments about value of a possible action, and not on one of the spurious ones (implied by the possible action not being actually taken).
Finite diagonal step
In the preceding example, we can point out specific proofs of [A=1 => U=1000000] and [A=2 => U=1000] that don't depend on definition of A. Knowing the length of such proofs (say, L) and using an inference system that enumerates all valid proofs in the order of their length allows predicting when those particular proofs will be encountered, and so when, at the latest, the decision will be made (if it's not made before that, but using some other, shorter proof). It is then only until that point that the agent has to hold off on predicting its own decision.
If we know the number of steps N for the duration of which the inference system needs to be protected (while it considers the proofs no longer than L), the ("finite") diagonal step consists in running agent's inference system S for N steps, and contradicting (by action) any prediction about agent's action reached during these N steps. This step is decidable and doesn't require an oracle. Implementing it guarantees that (if S is sound and N is selected as discussed above) agent won't be able to predict its decision before it's made.
Sufficiently increasing the threshold N allows to prove that no other statements of the form [A=A1 => U=U1] will be proved before a decision is made (and so that the decision will be as expected). Specifically, let M>N be the new threshold. We know that (for example) [A=1 => U=1000000] is proved by S within N steps, so if (say) [A=1 => U=0] is proved also within N steps, then the proof of negation of [A=1] is only a few steps longer than 2*L, and so if we choose M to include all proofs of that length, the disproof of [A=1] will be encountered within the first M steps, which would trigger the diagonal step. Since the diagonal step can't be triggered if S is sound, we've thus proved that presence of the diagonal step guarding the first M steps of the inference system allows to conclude that the agent won't infer any spurious statements of the form [A=A1 => U=U1], and will behave as expected.
This is the same guarantee that countable diagonal rule gives, albeit at the cost of having to know L.
Prisoner's dilemma
Prisoner's dilemma presents a situation where some measure of predictability is advantageous to both agents. Specifically, if an agent can prove that cooperating implies cooperation of the opponent, and defecting implies defection of the opponent, it can then cooperate, reaching the outcome (C,C) that is better than the equilibrium solution (D,D). It's harder to arrange cooperation without being able to predict the opponent's reaction to a hypothetical decision, so taking away all predictability of decisions will hurt both agents.
The trick is that UDT only wants that statements like [A1=C] be unprovable (in N steps), but statements like [A1=C => A2=C] may remain provable, and it's conditional statements like this that are needed to broker cooperation in PD. It seems that cooperation is possible both for simple agents without oracles or use of the diagonal step, and for agents with oracles that use countable diagonal step.
Agent simulates predictor
Finally, there is the ASP problem that involves a weak predictor and a powerful agent that need to cooperate on Newcomb's problem. The difficulty is that it seems that since the agent can anticipate predictor's decision by simulating it, it would take both boxes no matter what, and so the predictor has to predict two-boxing (which is unfortunate). The predictor is defenseless, it can't insist on making itself unpredictable, and as a result the agent's predictions of predictor's decision (given either hypothetical action of the agent) coincide.
Just as it's necessary to entertain multiple hypothetical actions to gauge their consequences, it's necessary for there to be multiple hypothetical outcomes for their comparison to yield a nontrivial decision. The preceding sections addressed the problem of making agent's own decision unpredictable, and in this case it seems that the agent has to make the decision of its opponent (the weak predictor) unpredictable (to the agent).
One way to patch this up seems to be requiring that the weak predictor is trying to prove that agent one-boxes (with however little computational power it has), and predicting two-boxing otherwise. This way, if predictor says that agent one-boxes, agent must really be one-boxing, and so we can use this condition in a diagonal step in the agent, making this situation unpredictable to the agent. Namely, the agent will two-box if it predicts that the predictor decides that the agent one-boxes (while running agent's inference system for at most M steps, as with finite diagonal step). As a result, the agent won't be able to predict that the predictor one-boxes (within M steps of agent's inference system), which might make the agent simple enough for the predictor to predict, and the two players seem to regain an equal footing.
Open problems