Benja Fallenstein was the first to point out that spurious proofs pose a problem for UDT. Vladimir Nesov and orthonormal asked for a formalization of that intuition. In this post I will give an example of a UDT-ish agent that fails due to having a malicious proof searcher, which feeds the agent a spurious but valid proof.
The basic idea is to have an agent A that receives a proof P as input, and checks P for validity. If P is a valid proof that a certain action a is best in the current situation, then A outputs a, otherwise A tries to solve the current situation by its own means. Here's a first naive formalization, where U is the world program that returns a utility value, A is the agent program that returns an action, and P is the proof given to A:
def U():
if A(P)==1:
return 5
else:
return 10
def A(P):
if P is a valid proof that A(P)==a implies U()==u, and A(P)!=a implies U()<=u:
return a
else:
do whatever
This formalization cannot work because a proof P can never be long enough to contain statements about A(P) inside itself. To fix that problem, let's introduce a function Q that generates the proof P:
def U():
if A(Q())==1:
return 5
else:
return 10
def A(P):
if P is a valid proof that A(Q())==a implies U()==u, and A(Q())!=a implies U()<=u:
return a
else:
do whatever
In this case it's possible to write a function Q that returns a proof that makes A return the suboptimal action 1, which leads to utility 5 instead of 10. Here's how:
Let X be the statement "A(Q())==1 implies U()==5, and A(Q())!=1 implies U()<=5". Let Q be the program that enumerates all possible proofs trying to find a proof of X, and returns that proof if found. (The definitions of X and Q are mutually quined.) If X is provable at all, then Q will find that proof, and X will become true (by inspection of U and A). That reasoning is formalizable in our proof system, so the statement "if X is provable, then X" is provable. Therefore, by Löb's theorem, X is provable. So Q will find a proof of X, and A will return 1.
One possible conclusion is that a UDT agent cannot use just any proof searcher or "mathematical intuition module" that's guaranteed to return valid mathematical arguments, because valid mathematical arguments can make the agent choose arbitrary actions. The proof searchers from some previous posts were well-behaved by construction, but not all of them are.
The troubling thing is that you may end up with a badly behaved proof searcher by accident. For example, consider a variation of U that adds some long and complicated computation to the "else" branch of U, before returning 10. That increases the length of the "natural" proof that a=2 is optimal, but the spurious proof for a=1 stays about the same length as it was, because the spurious proof can just ignore the "else" branch of U. This way the spurious proof can become much shorter than the natural proof. So if (for example) your math intuition module made the innocuous design decision of first looking at actions that are likely to have shorter proofs, you may end up with a spurious proof. And as a further plot twist, if we make U return 0 rather than 10 in the long-to-compute branch, you might choose the correct action due to a spurious proof instead of the natural one.
My next attempt :) Below, "PA" means Peano Arithmetic.
Proposition: A() returns 1 (if PA is consistent).
Proof:
Let X = "there exist a!=1 and u, such that: A()==a => U()==u and A()!=a => U()<=u".
Let S = "there exists u, such that: A()==1 => U()==u and A()!=1 => U()<u".
Let T = "A()==1".
Let ProvableU(P) = the provability formula of U (PA+Consistent(PA), with proofs of length < N).
Let ProvableA(P) = the provability formula of A (PA).
Let "U⊦ P" mean P is a theorem of U's proof system.
Then:
QED
Sorry for not noticing your comment earlier. It would be nice to have a way to subscribe to comment threads.
I don't understand why the definitions of X and S imply that S => ~X. Does that rely on specific features of your A and U? For example, if we take a different A and U so that A()==2 and U()==0, then X is true for a=2 and u=0, and S is true for u=1.