Benja Fallenstein was the first to point out that spurious proofs pose a problem for UDT. Vladimir Nesov and orthonormal asked for a formalization of that intuition. In this post I will give an example of a UDT-ish agent that fails due to having a malicious proof searcher, which feeds the agent a spurious but valid proof.
The basic idea is to have an agent A that receives a proof P as input, and checks P for validity. If P is a valid proof that a certain action a is best in the current situation, then A outputs a, otherwise A tries to solve the current situation by its own means. Here's a first naive formalization, where U is the world program that returns a utility value, A is the agent program that returns an action, and P is the proof given to A:
def U():
if A(P)==1:
return 5
else:
return 10
def A(P):
if P is a valid proof that A(P)==a implies U()==u, and A(P)!=a implies U()<=u:
return a
else:
do whatever
This formalization cannot work because a proof P can never be long enough to contain statements about A(P) inside itself. To fix that problem, let's introduce a function Q that generates the proof P:
def U():
if A(Q())==1:
return 5
else:
return 10
def A(P):
if P is a valid proof that A(Q())==a implies U()==u, and A(Q())!=a implies U()<=u:
return a
else:
do whatever
In this case it's possible to write a function Q that returns a proof that makes A return the suboptimal action 1, which leads to utility 5 instead of 10. Here's how:
Let X be the statement "A(Q())==1 implies U()==5, and A(Q())!=1 implies U()<=5". Let Q be the program that enumerates all possible proofs trying to find a proof of X, and returns that proof if found. (The definitions of X and Q are mutually quined.) If X is provable at all, then Q will find that proof, and X will become true (by inspection of U and A). That reasoning is formalizable in our proof system, so the statement "if X is provable, then X" is provable. Therefore, by Löb's theorem, X is provable. So Q will find a proof of X, and A will return 1.
One possible conclusion is that a UDT agent cannot use just any proof searcher or "mathematical intuition module" that's guaranteed to return valid mathematical arguments, because valid mathematical arguments can make the agent choose arbitrary actions. The proof searchers from some previous posts were well-behaved by construction, but not all of them are.
The troubling thing is that you may end up with a badly behaved proof searcher by accident. For example, consider a variation of U that adds some long and complicated computation to the "else" branch of U, before returning 10. That increases the length of the "natural" proof that a=2 is optimal, but the spurious proof for a=1 stays about the same length as it was, because the spurious proof can just ignore the "else" branch of U. This way the spurious proof can become much shorter than the natural proof. So if (for example) your math intuition module made the innocuous design decision of first looking at actions that are likely to have shorter proofs, you may end up with a spurious proof. And as a further plot twist, if we make U return 0 rather than 10 in the long-to-compute branch, you might choose the correct action due to a spurious proof instead of the natural one.
To check the supplied proofs, A needs to be able to recognize statements of the form "A()==a implies U()==u, and A()!=a implies U()<=u", which seems to require knowing the source code of A. Or do you propose trying to check proofs of some different form instead, that would only mention U? What would such statements look like?
Not when the statements are already given with A() in them, as in the Newcomb's problem. The newcomb's problem doesn't say - here is the omega, and it has a brain sim with such and such neurons connected so and so with such and such weights, and it exposes this brain sim to this entire string, and if this brain sim tells it so and so it puts stuff into one box, otherwise it puts stuff into other box. No, it already says omega predicts you, using the symbol you which already means A() . Nobody has ever presented newcomb to you with an implicit, rather than ... (read more)