# An example of self-fulfilling spurious proofs in UDT

Benja Fallenstein was the first to point out that spurious proofs pose a problem for UDT. Vladimir Nesov and orthonormal asked for a formalization of that intuition. In this post I will give an example of a UDT-ish agent that fails due to having a malicious proof searcher, which feeds the agent a spurious but valid proof.

The basic idea is to have an agent A that receives a proof P as input, and checks P for validity. If P is a valid proof that a certain action a is best in the current situation, then A outputs a, otherwise A tries to solve the current situation by its own means. Here's a first naive formalization, where U is the world program that returns a utility value, A is the agent program that returns an action, and P is the proof given to A:

def U():

if A(P)==1:

return 5

else:

return 10

def A(P):

if P is a valid proof that A(P)==a implies U()==u, and A(P)!=a implies U()<=u:

return a

else:

do whatever

This formalization cannot work because a proof P can never be long enough to contain statements about A(P) inside itself. To fix that problem, let's introduce a function Q that generates the proof P:

def U():

if A(Q())==1:

return 5

else:

return 10

def A(P):

if P is a valid proof that A(Q())==a implies U()==u, and A(Q())!=a implies U()<=u:

return a

else:

do whatever

In this case it's possible to write a function Q that returns a proof that makes A return the suboptimal action 1, which leads to utility 5 instead of 10. Here's how:

Let X be the statement "A(Q())==1 implies U()==5, and A(Q())!=1 implies U()<=5". Let Q be the program that enumerates all possible proofs trying to find a proof of X, and returns that proof if found. (The definitions of X and Q are mutually quined.) If X is provable at all, then Q will find that proof, and X will become true (by inspection of U and A). That reasoning is formalizable in our proof system, so the statement "if X is provable, then X" is provable. Therefore, by Löb's theorem, X is provable. So Q will find a proof of X, and A will return 1.

One possible conclusion is that a UDT agent cannot use just any proof searcher or "mathematical intuition module" that's guaranteed to return valid mathematical arguments, because valid mathematical arguments can make the agent choose arbitrary actions. The proof searchers from some previous posts were well-behaved by construction, but not all of them are.

The troubling thing is that you may end up with a badly behaved proof searcher by accident. For example, consider a variation of U that adds some long and complicated computation to the "else" branch of U, before returning 10. That increases the length of the "natural" proof that a=2 is optimal, but the spurious proof for a=1 stays about the same length as it was, because the spurious proof can just ignore the "else" branch of U. This way the spurious proof can become much shorter than the natural proof. So if (for example) your math intuition module made the innocuous design decision of first looking at actions that are likely to have shorter proofs, you may end up with a spurious proof. And as a further plot twist, if we make U return 0 rather than 10 in the long-to-compute branch, you might choose the *correct* action due to a spurious proof instead of the natural one.

## Comments (40)

Best*5 points [-]More generally, what makes Q dangerous is that (1) it only settles for a spurious moral argument, doesn't accept natural ones, and (2) what it finds is taken seriously by the agent, acted out. As a result, provability of a spurious moral argument provably implies its truth, which by Loeb's theorem makes it true and forces the agent to be thus misled.

The only difference between Q and normal proof search procedures is that the normal procedures are OK with any proof, while Q dislikes natural proofs and ignores them. And this bit of "motivated skepticism" is sufficient to make the preferred spurious proofs come true, Q doesn't just loop without finding anything.

This is a whole new level of Oracle AI unsafety... Take what it says seriously, and it can argue you into doing anything at all. :-)

*1 point [-]You mean Q instead of P, right? (Edit: Fixed.)

*0 points [-]Right, cousin_it changed some terminology from the post on the list, I didn't notice. (Fixed.)

Excellent! I wonder how far the malicious behavior can be extended.

Here the problem is that A directly uses a valid but malicious inference module Q. If A were built to enumerate proofs by length and act on the first one of the form "A()==a implies U()==u, and A()!=a implies U()<=u", can U be rewritten to guarantee that a particular spurious proof comes up first?

Or if A uses a halting oracle to check through statements of that form, can U be written (incorporating that oracle) so that the halting oracle fails to return for the genuine counterfactual and only returns "True" for a particular spurious one?

*4 points [-]These are reasonable questions. I don't know the answers. Would be cool if someone else figured them out ;-)

*2 points [-]My next attempt :) Below, "PA" means Peano Arithmetic.

Proposition: A() returns 1 (if PA is consistent).

Proof:

Let X = "there exist a!=1 and u, such that: A()==a => U()==u and A()!=a => U()<=u".

Let S = "there exists u, such that: A()==1 => U()==u and A()!=1 => U()<u".

Let T = "A()==1".

Let ProvableU(P) = the provability formula of U (PA+Consistent(PA), with proofs of length < N).

Let ProvableA(P) = the provability formula of A (PA).

Let "U⊦ P" mean P is a theorem of U's proof system.

Then:

QED

*0 points [-]Sorry for not noticing your comment earlier. It would be nice to have a way to subscribe to comment threads.

I don't understand why the definitions of X and S imply that S => ~X. Does that rely on specific features of your A and U? For example, if we take a different A and U so that A()==2 and U()==0, then X is true for a=2 and u=0, and S is true for u=1.

Use "Subscribe to RSS Feed" on the right.

*1 point [-]Here's another interesting followup question. My implementation of Q relies on enumerating all proofs in order, so running Q requires exponential time. Is there a "Henkin-style" implementation of Q that constructs the self-referential proof quickly and directly, maybe using the steps of Löb's theorem or something? That's how I first tried to construct Q after reading your comments, and failed, but it might still be possible.

*0 points [-]Maybe something like this:

where IsValidProof(Statement, Proof) is Goedel's Bew function, Proof1 is a formalization of the argument that "if Q() returns a valid proof of X, then A() will return 1, and X will be true", and Proof2 is a formalization of the argument that "a valid proof of A=>B, followed by a valid proof of A, followed by B, is a valid proof".

*0 points [-]The proof of the statement S: "A()==1 implies U()==5, and A()!=1 implies U()<=5" exists, so it will be found by both A() and U(). Any proof of another statement of the same form would imply "there was no proof of length <N of statement S", which would be false given high enough N, so it couldn't be found.

*1 point [-]I can't parse your source code for U, does it return a value in every case?

To write code in comments, prefix lines with four spaces, like this:

Thanks, edited.

*1 point [-]Why is S provable? As far as I can see, you can't easily prove it by Löb's theorem, because "if S is provable then S" is not apparent by inspection of A and U, because A could find a different proof earlier.

I thought it is similar to your proof here.

*1 point [-]In Lobian cooperation, the agents search for proofs of only one statement, never stopping early because they found a proof of something else. Your implementation of A doesn't seem to work like that. Or did I misunderstand and your version of A only looks for proofs where A()==1?

I see what you're trying to do, but can you explain why A would return 1?

The intuition is that the proof of "A()==1 implies U()==5, and A()!=1 implies U()<=5", if it exists, would not depend on N, whereas any proof of "A()==2 implies U()==10..." would have to be longer than N, so by making N large enough...

The setup should make "if S has a proof of length < N, then S" apparent by inspection, answering your earlier objection: if S is (<N)-provable, then U() will find the proof (because finding any other proof first would imply U() is unsound), and then return (A()==1 ? 5 : 0), which requires A() to return 1, otherwise U() is unsound again.

*0 points [-]I thought it couldn't find any other proofs of length < N, because it would imply there was no proof of S. But this is not a problem if S is false... Ok, modification:

EDIT: Wait, this is not good, now if(A()==2) is unreachable...

EDIT2: No, not actually unreachable, but any proof for a statement of the form "A()==2 => U()==10..." must be of length > N, which is what was needed, I suppose. Still feels like cheating, but I'm not sure why...

*1 point [-]What's the intended consequence of A()==2 in your implementation of U? Is it U()==0 or U()==10? And which of those would actually happen?

*0 points [-]What is your definition of UDT?

If you use the definition given in this paper, then that definition of A is not UDT, simply because it is not guaranteed to "[find] a function mapping sense data to actions that will maximize the utility of outcomes weighted by the probability that the outcome will be caused by the procedure returning that mapping."

It is implicit in the definition that the agent must be able to choose from a set of counter-factual action-selection-functions

{f: S -> A}, because it is in this choice the optimisation happens.P can only be a valid proof that

ais the best action exactly whenais the only option, because all other actions where excluded, by the use of non-counter-factual logic.I know that counter-factual calculations are problematic. But if you take the counter-factual-nes out of UDP, then there are literally nothing left to the definition. Without the counter-factual-nes, everything goes. Any choice is the optimal one, because it was the only one, because that was the choice that actually happened.

*0 points [-]Other conclusion is: don't feed the agent it's source code when you feed it a proof that it's output effects a change. Blackbox the output as a variable. When you output a, the world gives u, which is the largest U of any a. Works on newcomb's too (note that this makes no distinction between stuff being put into the boxes before, or after the decision; if the 100% accurate prediction is a given. It simply knows nothing of the causality beyond that implied in the problem)

*0 points [-]To check the supplied proofs, A needs to be able to recognize statements of the form "A()==a implies U()==u, and A()!=a implies U()<=u", which seems to require knowing the source code of A. Or do you propose trying to check proofs of some different form instead, that would only mention U? What would such statements look like?

*-1 points [-]Not when the statements are already given with A() in them, as in the Newcomb's problem. The newcomb's problem doesn't say - here is the omega, and it has a brain sim with such and such neurons connected so and so with such and such weights, and it exposes this brain sim to this entire string, and if this brain sim tells it so and so it puts stuff into one box, otherwise it puts stuff into other box. No, it already says omega predicts you, using the symbol you which already means A() . Nobody has ever presented newcomb to you with an implicit, rather than explicit, self reference.

One could insert a quine, of course, into the algorithm, to allow the algorithm to recognize self; then such algorithm could start off by replacing the 'brain sim with such and such neurons connected so and so' with symbol A() in the input statements (or perhaps running a prover there to try and replace any obscured instances of A() ) and then solve it without further using knowledge of internals of A() .

More philosophically, this may be what our so called reflection is about: fully opaque black box substitution. Albeit one must not read much into it, because the non-opaque substitution does not only lead to 'pollution' of the theorem prover with contradictions any time you think "okay, suppose I give out answer a", but is also terribly impractical.

One could also fix it by assuming that your output is slightly uncertain - epsilon probability of changing the mind - which kills the wrong proofs (and also stops the prediction nonsense and other entirely un-physical ideas, with a side effect that the agent might not work ideally in an ideal world).

*1 point [-]The usual problem with adding randomness is that all copies of the agent must have perfectly correlated sources of randomness (otherwise you end up with something like CDT), and that amounts to explicitly pointing out all copies of the agent.

*-1 points [-]Well too bad for all the theories that can't handle slight randomness, because in our universe you can't implement an agent which does not have slight uncorrelated randomness to it's output; that's just how our universe is.

*1 point [-]I don't understand your comment. Our proof-theoretic algorithms can handle worlds with randomness just fine, because the logical fact of the algorithm's return value will be more or less correlated with many physical facts. The proof-theoretic algorithms don't

requireslight randomness to work, and my grandparent comment gave a reason why it seems hard to write a version that reliescompletelyon slight randomness to kill spurious proofs, as you suggested. Of course if you do find a way, it will be cool.*0 points [-]Well, your grandparent's comment spoke of a problem with adding randomness, rather than with lack of necessity to add randomness. Maybe i simply misunderstood it.

Note btw that the self description will have to include randomness, to be an accurate description of an imperfect agent.

*2 points [-]Let me try to explain. Assume you have an agent that's very similar to our proof-theoretic ones, but also has a small chance of returning a random action. Moreover, the agent's self-description includes a mention of the random variable V that makes the agent return a random action.

The first problem is that statements like "A()==a" are no longer logical statements. That seems easy to fix by making the agent look at conditional expected values instead of logical implications.

The second and more serious problem is this: how many independent copies of V does the world contain? Imagine the agent is faced with Newcomb's Problem. If the world program contains only one copy of V that's referenced by both instances of the agent, that amounts to abusing the problem formulation to tell the agent "your copies are located here and here". And if the two copies of the agent use uncorrelated instances of V, then looking at conditional expected values based on V buys you nothing. Figuring out the best action reduces to same logical question that the proof-theoretic algorithms are faced with.