Benja Fallenstein was the first to point out that spurious proofs pose a problem for UDT. Vladimir Nesov and orthonormal asked for a formalization of that intuition. In this post I will give an example of a UDT-ish agent that fails due to having a malicious proof searcher, which feeds the agent a spurious but valid proof.
The basic idea is to have an agent A that receives a proof P as input, and checks P for validity. If P is a valid proof that a certain action a is best in the current situation, then A outputs a, otherwise A tries to solve the current situation by its own means. Here's a first naive formalization, where U is the world program that returns a utility value, A is the agent program that returns an action, and P is the proof given to A:
def U():
if A(P)==1:
return 5
else:
return 10
def A(P):
if P is a valid proof that A(P)==a implies U()==u, and A(P)!=a implies U()<=u:
return a
else:
do whatever
This formalization cannot work because a proof P can never be long enough to contain statements about A(P) inside itself. To fix that problem, let's introduce a function Q that generates the proof P:
def U():
if A(Q())==1:
return 5
else:
return 10
def A(P):
if P is a valid proof that A(Q())==a implies U()==u, and A(Q())!=a implies U()<=u:
return a
else:
do whatever
In this case it's possible to write a function Q that returns a proof that makes A return the suboptimal action 1, which leads to utility 5 instead of 10. Here's how:
Let X be the statement "A(Q())==1 implies U()==5, and A(Q())!=1 implies U()<=5". Let Q be the program that enumerates all possible proofs trying to find a proof of X, and returns that proof if found. (The definitions of X and Q are mutually quined.) If X is provable at all, then Q will find that proof, and X will become true (by inspection of U and A). That reasoning is formalizable in our proof system, so the statement "if X is provable, then X" is provable. Therefore, by Löb's theorem, X is provable. So Q will find a proof of X, and A will return 1.
One possible conclusion is that a UDT agent cannot use just any proof searcher or "mathematical intuition module" that's guaranteed to return valid mathematical arguments, because valid mathematical arguments can make the agent choose arbitrary actions. The proof searchers from some previous posts were well-behaved by construction, but not all of them are.
The troubling thing is that you may end up with a badly behaved proof searcher by accident. For example, consider a variation of U that adds some long and complicated computation to the "else" branch of U, before returning 10. That increases the length of the "natural" proof that a=2 is optimal, but the spurious proof for a=1 stays about the same length as it was, because the spurious proof can just ignore the "else" branch of U. This way the spurious proof can become much shorter than the natural proof. So if (for example) your math intuition module made the innocuous design decision of first looking at actions that are likely to have shorter proofs, you may end up with a spurious proof. And as a further plot twist, if we make U return 0 rather than 10 in the long-to-compute branch, you might choose the correct action due to a spurious proof instead of the natural one.
Not when the statements are already given with A() in them, as in the Newcomb's problem. The newcomb's problem doesn't say - here is the omega, and it has a brain sim with such and such neurons connected so and so with such and such weights, and it exposes this brain sim to this entire string, and if this brain sim tells it so and so it puts stuff into one box, otherwise it puts stuff into other box. No, it already says omega predicts you, using the symbol you which already means A() . Nobody has ever presented newcomb to you with an implicit, rather than explicit, self reference.
One could insert a quine, of course, into the algorithm, to allow the algorithm to recognize self; then such algorithm could start off by replacing the 'brain sim with such and such neurons connected so and so' with symbol A() in the input statements (or perhaps running a prover there to try and replace any obscured instances of A() ) and then solve it without further using knowledge of internals of A() .
More philosophically, this may be what our so called reflection is about: fully opaque black box substitution. Albeit one must not read much into it, because the non-opaque substitution does not only lead to 'pollution' of the theorem prover with contradictions any time you think "okay, suppose I give out answer a", but is also terribly impractical.
One could also fix it by assuming that your output is slightly uncertain - epsilon probability of changing the mind - which kills the wrong proofs (and also stops the prediction nonsense and other entirely un-physical ideas, with a side effect that the agent might not work ideally in an ideal world).
The usual problem with adding randomness is that all copies of the agent must have perfectly correlated sources of randomness (otherwise you end up with something like CDT), and that amounts to explicitly pointing out all copies of the agent.