This post requires some knowledge of mathematical logic and computability theory. The basic idea is due to Vladimir Nesov and me.
Let the universe be a computer program U that can make calls to a halting oracle. Let the agent be a subprogram A within U that can also make calls to the oracle. The source code of both A and U is available to A.
def U():
# Fill boxes, according to predicted action.
box1 = 1000
box2 = 1000000 if (A() == 1) else 0
# Compute reward, based on actual action.
return box2 if (A() == 1) else (box1 + box2)
A complete definition of U should also include the definition of A, so let's define it. We will use the halting oracle only as a provability oracle for some formal system S, e.g. Peano arithmetic. Here's the algorithm of A:
- Play chicken with the universe: if S proves that A()≠a for some action a, then return a.
- For every possible action a, find some utility value u such that S proves that A()=a ⇒ U()=u. If such a proof cannot be found for some a, break down and cry because the universe is unfair.
- Return the action that corresponds to the highest utility found on step 2.
Now we want to prove that the agent one-boxes, i.e. A()=1 and U()=1000000. That will follow from two lemmas.
Lemma 1: S proves that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000. Proof: you can derive that from just the source code of U, without looking at A at all.
Lemma 2: S doesn't prove any other utility values for A()=1 or A()=2. Proof: assume, for example, that S proves that A()=1 ⇒ U()=42. But S also proves that A()=1 ⇒ U()=1000000, therefore S proves that A()≠1. According to the first step of the algorithm, A will play chicken with the universe and return 1, making S inconsistent unsound (thx Misha). So if S is sound, that can't happen.
We see that the agent defined above will do the right thing in Newcomb's problem. And the proof transfers easily to many other toy problems, like the symmetric Prisoner's Dilemma.
But why? What's the point of this result?
There's a big problem about formalizing UDT. If the agent chooses a certain action in a deterministic universe, then it's a true fact about the universe that choosing a different action would have caused Santa to appear. Moreover, if the universe is computable, then such silly logical counterfactuals are not just true but provable in any reasonable formal system. When we can't compare actual decisions with counterfactual ones, it's hard to define what it means for a decision to be "optimal".
For example, one previous formalization searched for formal proofs up to a specified length limit. Problem is, that limit is a magic constant in the code that can't be derived from the universe program alone. And if you try searching for proofs without a length limit, you might encounter a proof of a "silly" counterfactual which will make you stop early before finding the "serious" one. Then your decision based on that silly counterfactual can make it true by making its antecedent false... But the bigger problem is that we can't say exactly what makes a "silly" counterfactual different from a "serious" one.
In contrast, the new model with oracles has a nice notion of optimality, relative to the agent's formal system. The agent will always return whatever action is proved by the formal system to be optimal, if such an action exists. This notion of optimality matches our intuitions even though the universe is still perfectly deterministic and the agent is still embedded in it, because the oracle ensures that determinism is just out of the formal system's reach.
P.S. I became a SingInst research associate on Dec 1. They did not swear me to secrecy, and I hope this post shows that I'm still a fan of working in the open. I might just try to be a little more careful because I wouldn't want to discredit SingInst by making stupid math mistakes in public :-)
Wouldn't the same job be done by the agent using proper counterfactuals instead of logic ones, which seems like something that would also be needed for other purposes?
I don't know who (if anyone) has done any work on this, but when a human considers a counterfactual statement like "If Gore won in 2000" that is very underspecified, because the implicit assumption is to discard contradicting knowledge, but how to do that exactly is left open. Humans just know that they should assume something like "Bush didn't win Florida" instead of "266 > 271".
If an example agent needs to be able to use precisely defined proper counterfactuals I think it might be possible to do that with an ordering function for its current knowledge. The agent would start with the couterfactual under consideration, add items from its knowledge base in the order specified for that counterfactual, test for each item whether it can find a contradiction, and discard the current item whenever it finds a contradiction (from consideration for evaluating the counterfactual).
For the example I think the order would look like this: A=a, S, U except for A, the source code of A.
That would do much the same thing as "playing chicken with the universe" with respect of not being impressed by proofs about its output, no?
More generally I think the items would be further split up, particularly "A behaves like a program with this source code in all other cases" would come before "A behaves like a program with this source code in the case under consideration". Other instances of A would also have to be treated as instances of A and not as programs with the source code (i.e. statements like "A' has the same source code as A" would come before either of their source codes in the ordering).
Does that make sense?
The idea of using an ordering function for knowledge is new to me, thanks!
The hard part is getting "U except for A". Given the source code of U and the source code for A, we don't know how to "factor" one program by another (or equivalently, find all correlates of the agent in the universe). If we knew how to do that, it would help a lot with UDT in general.