Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

An example of self-fulfilling spurious proofs in UDT

20 Post author: cousin_it 25 March 2012 11:47AM

Benja Fallenstein was the first to point out that spurious proofs pose a problem for UDT. Vladimir Nesov and orthonormal asked for a formalization of that intuition. In this post I will give an example of a UDT-ish agent that fails due to having a malicious proof searcher, which feeds the agent a spurious but valid proof.

The basic idea is to have an agent A that receives a proof P as input, and checks P for validity. If P is a valid proof that a certain action a is best in the current situation, then A outputs a, otherwise A tries to solve the current situation by its own means. Here's a first naive formalization, where U is the world program that returns a utility value, A is the agent program that returns an action, and P is the proof given to A:

def U():
  if A(P)==1:
    return 5
  else:
    return 10

def A(P):
  if P is a valid proof that A(P)==a implies U()==u, and A(P)!=a implies U()<=u:
    return a
  else:
    do whatever

This formalization cannot work because a proof P can never be long enough to contain statements about A(P) inside itself. To fix that problem, let's introduce a function Q that generates the proof P:

def U():
  if A(Q())==1:
    return 5
  else:
    return 10

def A(P):
  if P is a valid proof that A(Q())==a implies U()==u, and A(Q())!=a implies U()<=u:
    return a
  else:
    do whatever

In this case it's possible to write a function Q that returns a proof that makes A return the suboptimal action 1, which leads to utility 5 instead of 10. Here's how:

Let X be the statement "A(Q())==1 implies U()==5, and A(Q())!=1 implies U()<=5". Let Q be the program that enumerates all possible proofs trying to find a proof of X, and returns that proof if found. (The definitions of X and Q are mutually quined.) If X is provable at all, then Q will find that proof, and X will become true (by inspection of U and A). That reasoning is formalizable in our proof system, so the statement "if X is provable, then X" is provable. Therefore, by Löb's theorem, X is provable. So Q will find a proof of X, and A will return 1.

One possible conclusion is that a UDT agent cannot use just any proof searcher or "mathematical intuition module" that's guaranteed to return valid mathematical arguments, because valid mathematical arguments can make the agent choose arbitrary actions. The proof searchers from some previous posts were well-behaved by construction, but not all of them are.

The troubling thing is that you may end up with a badly behaved proof searcher by accident. For example, consider a variation of U that adds some long and complicated computation to the "else" branch of U, before returning 10. That increases the length of the "natural" proof that a=2 is optimal, but the spurious proof for a=1 stays about the same length as it was, because the spurious proof can just ignore the "else" branch of U. This way the spurious proof can become much shorter than the natural proof. So if (for example) your math intuition module made the innocuous design decision of first looking at actions that are likely to have shorter proofs, you may end up with a spurious proof. And as a further plot twist, if we make U return 0 rather than 10 in the long-to-compute branch, you might choose the correct action due to a spurious proof instead of the natural one.

Comments (40)

Comment author: Vladimir_Nesov 25 March 2012 12:12:25PM *  5 points [-]

More generally, what makes Q dangerous is that (1) it only settles for a spurious moral argument, doesn't accept natural ones, and (2) what it finds is taken seriously by the agent, acted out. As a result, provability of a spurious moral argument provably implies its truth, which by Loeb's theorem makes it true and forces the agent to be thus misled.

The only difference between Q and normal proof search procedures is that the normal procedures are OK with any proof, while Q dislikes natural proofs and ignores them. And this bit of "motivated skepticism" is sufficient to make the preferred spurious proofs come true, Q doesn't just loop without finding anything.

This is a whole new level of Oracle AI unsafety... Take what it says seriously, and it can argue you into doing anything at all. :-)

Comment author: orthonormal 25 March 2012 04:17:02PM *  1 point [-]

You mean Q instead of P, right? (Edit: Fixed.)

Comment author: Vladimir_Nesov 25 March 2012 04:27:15PM *  0 points [-]

Right, cousin_it changed some terminology from the post on the list, I didn't notice. (Fixed.)

Comment author: orthonormal 25 March 2012 04:26:12PM 2 points [-]

Excellent! I wonder how far the malicious behavior can be extended.

Here the problem is that A directly uses a valid but malicious inference module Q. If A were built to enumerate proofs by length and act on the first one of the form "A()==a implies U()==u, and A()!=a implies U()<=u", can U be rewritten to guarantee that a particular spurious proof comes up first?

Or if A uses a halting oracle to check through statements of that form, can U be written (incorporating that oracle) so that the halting oracle fails to return for the genuine counterfactual and only returns "True" for a particular spurious one?

Comment author: cousin_it 25 March 2012 09:56:48PM *  4 points [-]

These are reasonable questions. I don't know the answers. Would be cool if someone else figured them out ;-)

Comment author: gRR 01 April 2012 03:03:35PM *  2 points [-]

My next attempt :) Below, "PA" means Peano Arithmetic.

def A(): Enumerate proofs (in PA) by length if found a proof that "A()==a implies U()==u, and A()!=a implies U()<=u" then return a def U(): Enumerate proofs in PA+Consistent(PA) by length up to a very large N if found a proof that "A()==1" then return (A()==1 ? 5 : 0) if(A()==2) return 10 return 0

Proposition: A() returns 1 (if PA is consistent).
Proof:
Let X = "there exist a!=1 and u, such that: A()==a => U()==u and A()!=a => U()<=u".
Let S = "there exists u, such that: A()==1 => U()==u and A()!=1 => U()<u".
Let T = "A()==1".
Let ProvableU(P) = the provability formula of U (PA+Consistent(PA), with proofs of length < N).
Let ProvableA(P) = the provability formula of A (PA).
Let "U⊦ P" mean P is a theorem of U's proof system.
Then:

1. U⊦ ProvableU(T) => S // by inspection of U 2. U⊦ S => ~X // by definitions of X and S 3. U⊦ ProvableU(T) => ~X // from the previous two 4. U⊦ ~X => ~ProvableA(X) // by the axiom Consistent(PA) 5. U⊦ ProvableA(ProvableU(T) => S) // by A's inspection of U 6. U⊦ ProvableA(ProvableU(T)) => ProvableA(S) // from the previous 7. U⊦ ProvableU(T) => ProvableA(ProvableU(T))
// because PA is sufficient to model provability in any formal system 8. U⊦ ProvableU(T) => ProvableA(S) and ~ProvableA(X) // from 3, 4, 6, and 7 9. U⊦ ProvableA(S) and ~ProvableA(X) => T // by inspection of A 10. U⊦ ProvableU(T) => T // from the previous two 11. U⊦ T // by Loeb's theorem

QED

Comment author: cousin_it 29 May 2012 10:12:09AM *  0 points [-]

Sorry for not noticing your comment earlier. It would be nice to have a way to subscribe to comment threads.

I don't understand why the definitions of X and S imply that S => ~X. Does that rely on specific features of your A and U? For example, if we take a different A and U so that A()==2 and U()==0, then X is true for a=2 and u=0, and S is true for u=1.

Comment author: Vladimir_Nesov 29 May 2012 12:13:22PM 2 points [-]

It would be nice to have a way to subscribe to comment threads.

Use "Subscribe to RSS Feed" on the right.

Comment author: cousin_it 26 March 2012 09:20:43AM *  1 point [-]

Here's another interesting followup question. My implementation of Q relies on enumerating all proofs in order, so running Q requires exponential time. Is there a "Henkin-style" implementation of Q that constructs the self-referential proof quickly and directly, maybe using the steps of Löb's theorem or something? That's how I first tried to construct Q after reading your comments, and failed, but it might still be possible.

Comment author: gRR 01 April 2012 07:37:40PM *  0 points [-]

Maybe something like this:

def Q():
return Proof1("IsValidProof(X, Q()) => X") +
Proof2("IsValidProof(X, Q())") +
X

where IsValidProof(Statement, Proof) is Goedel's Bew function, Proof1 is a formalization of the argument that "if Q() returns a valid proof of X, then A() will return 1, and X will be true", and Proof2 is a formalization of the argument that "a valid proof of A=>B, followed by a valid proof of A, followed by B, is a valid proof".

Comment author: gRR 25 March 2012 11:51:34PM *  0 points [-]
def A(): Enumerate proofs by length and act on the first one of the form
"A()==a implies U()==u, and A()!=a implies U()<=u"
def U(): Enumerate proofs by length up to a very large N
if found a proof that "A()==1 implies U()==5, and A()!=1 implies U()<=5"
then if(A()==1) return 5 if(A()==2) return 10
return 0

The proof of the statement S: "A()==1 implies U()==5, and A()!=1 implies U()<=5" exists, so it will be found by both A() and U(). Any proof of another statement of the same form would imply "there was no proof of length <N of statement S", which would be false given high enough N, so it couldn't be found.

Comment author: cousin_it 26 March 2012 12:10:31AM *  1 point [-]

I can't parse your source code for U, does it return a value in every case?

To write code in comments, prefix lines with four spaces, like this:

# Hello world
print 'Hello World!'
Comment author: gRR 26 March 2012 12:18:08AM 0 points [-]

Thanks, edited.

Comment author: cousin_it 26 March 2012 12:39:17AM *  1 point [-]

Why is S provable? As far as I can see, you can't easily prove it by Löb's theorem, because "if S is provable then S" is not apparent by inspection of A and U, because A could find a different proof earlier.

Comment author: gRR 26 March 2012 01:38:15AM 0 points [-]

I thought it is similar to your proof here.

Comment author: cousin_it 26 March 2012 01:44:17AM *  1 point [-]

In Lobian cooperation, the agents search for proofs of only one statement, never stopping early because they found a proof of something else. Your implementation of A doesn't seem to work like that. Or did I misunderstand and your version of A only looks for proofs where A()==1?

Comment author: gRR 26 March 2012 12:18:08PM 0 points [-]
def U(): Enumerate proofs by length up to a very large N if found a proof that "A()==1 implies U()==5, and A()!=1 implies U()<=5" then return (A()==1 ? 5 : 0) if found any other proof of the form "A()==a implies U()==u, and A()!=a implies U()<=u" then return (A()==a ? u-1 : u+1) if(A()==2) return 10 return 0
Comment author: cousin_it 26 March 2012 12:46:45PM 1 point [-]

I see what you're trying to do, but can you explain why A would return 1?

Comment author: gRR 26 March 2012 01:06:58PM 1 point [-]

The intuition is that the proof of "A()==1 implies U()==5, and A()!=1 implies U()<=5", if it exists, would not depend on N, whereas any proof of "A()==2 implies U()==10..." would have to be longer than N, so by making N large enough...

The setup should make "if S has a proof of length < N, then S" apparent by inspection, answering your earlier objection: if S is (<N)-provable, then U() will find the proof (because finding any other proof first would imply U() is unsound), and then return (A()==1 ? 5 : 0), which requires A() to return 1, otherwise U() is unsound again.

Comment author: gRR 26 March 2012 01:57:52AM *  0 points [-]

I thought it couldn't find any other proofs of length < N, because it would imply there was no proof of S. But this is not a problem if S is false... Ok, modification:

def U(): Enumerate proofs by length up to a very large N
if found a proof that "A()==1 implies U()==5, and A()!=1 implies U()<=5"
then if(A()==1) return 5 if found any other proof of the same form
then whatever A() return 0
if(A()==2) return 10
return 0

EDIT: Wait, this is not good, now if(A()==2) is unreachable...
EDIT2: No, not actually unreachable, but any proof for a statement of the form "A()==2 => U()==10..." must be of length > N, which is what was needed, I suppose. Still feels like cheating, but I'm not sure why...

Comment author: cousin_it 26 March 2012 07:56:48AM *  1 point [-]

What's the intended consequence of A()==2 in your implementation of U? Is it U()==0 or U()==10? And which of those would actually happen?

Comment author: Wind 30 March 2017 07:11:19PM *  0 points [-]

What is your definition of UDT?

If you use the definition given in this paper, then that definition of A is not UDT, simply because it is not guaranteed to "[find] a function mapping sense data to actions that will maximize the utility of outcomes weighted by the probability that the outcome will be caused by the procedure returning that mapping."

It is implicit in the definition that the agent must be able to choose from a set of counter-factual action-selection-functions {f: S -> A}, because it is in this choice the optimisation happens.

P can only be a valid proof that a is the best action exactly when a is the only option, because all other actions where excluded, by the use of non-counter-factual logic.

I know that counter-factual calculations are problematic. But if you take the counter-factual-nes out of UDP, then there are literally nothing left to the definition. Without the counter-factual-nes, everything goes. Any choice is the optimal one, because it was the only one, because that was the choice that actually happened.

Comment author: Dmytry 26 March 2012 06:56:28AM *  0 points [-]

Other conclusion is: don't feed the agent it's source code when you feed it a proof that it's output effects a change. Blackbox the output as a variable. When you output a, the world gives u, which is the largest U of any a. Works on newcomb's too (note that this makes no distinction between stuff being put into the boxes before, or after the decision; if the 100% accurate prediction is a given. It simply knows nothing of the causality beyond that implied in the problem)

Comment author: cousin_it 26 March 2012 09:04:45AM *  0 points [-]

To check the supplied proofs, A needs to be able to recognize statements of the form "A()==a implies U()==u, and A()!=a implies U()<=u", which seems to require knowing the source code of A. Or do you propose trying to check proofs of some different form instead, that would only mention U? What would such statements look like?

Comment author: Dmytry 26 March 2012 09:21:55AM *  -1 points [-]

Not when the statements are already given with A() in them, as in the Newcomb's problem. The newcomb's problem doesn't say - here is the omega, and it has a brain sim with such and such neurons connected so and so with such and such weights, and it exposes this brain sim to this entire string, and if this brain sim tells it so and so it puts stuff into one box, otherwise it puts stuff into other box. No, it already says omega predicts you, using the symbol you which already means A() . Nobody has ever presented newcomb to you with an implicit, rather than explicit, self reference.

One could insert a quine, of course, into the algorithm, to allow the algorithm to recognize self; then such algorithm could start off by replacing the 'brain sim with such and such neurons connected so and so' with symbol A() in the input statements (or perhaps running a prover there to try and replace any obscured instances of A() ) and then solve it without further using knowledge of internals of A() .

More philosophically, this may be what our so called reflection is about: fully opaque black box substitution. Albeit one must not read much into it, because the non-opaque substitution does not only lead to 'pollution' of the theorem prover with contradictions any time you think "okay, suppose I give out answer a", but is also terribly impractical.

One could also fix it by assuming that your output is slightly uncertain - epsilon probability of changing the mind - which kills the wrong proofs (and also stops the prediction nonsense and other entirely un-physical ideas, with a side effect that the agent might not work ideally in an ideal world).

Comment author: cousin_it 26 March 2012 09:50:35AM *  1 point [-]

The usual problem with adding randomness is that all copies of the agent must have perfectly correlated sources of randomness (otherwise you end up with something like CDT), and that amounts to explicitly pointing out all copies of the agent.

Comment author: Dmytry 26 March 2012 09:53:48AM *  -1 points [-]

Well too bad for all the theories that can't handle slight randomness, because in our universe you can't implement an agent which does not have slight uncorrelated randomness to it's output; that's just how our universe is.

Comment author: cousin_it 26 March 2012 10:08:14AM *  1 point [-]

I don't understand your comment. Our proof-theoretic algorithms can handle worlds with randomness just fine, because the logical fact of the algorithm's return value will be more or less correlated with many physical facts. The proof-theoretic algorithms don't require slight randomness to work, and my grandparent comment gave a reason why it seems hard to write a version that relies completely on slight randomness to kill spurious proofs, as you suggested. Of course if you do find a way, it will be cool.

Comment author: Dmytry 26 March 2012 10:13:42AM *  0 points [-]

Well, your grandparent's comment spoke of a problem with adding randomness, rather than with lack of necessity to add randomness. Maybe i simply misunderstood it.

Note btw that the self description will have to include randomness, to be an accurate description of an imperfect agent.

Comment author: cousin_it 26 March 2012 10:27:17AM *  2 points [-]

Let me try to explain. Assume you have an agent that's very similar to our proof-theoretic ones, but also has a small chance of returning a random action. Moreover, the agent's self-description includes a mention of the random variable V that makes the agent return a random action.

The first problem is that statements like "A()==a" are no longer logical statements. That seems easy to fix by making the agent look at conditional expected values instead of logical implications.

The second and more serious problem is this: how many independent copies of V does the world contain? Imagine the agent is faced with Newcomb's Problem. If the world program contains only one copy of V that's referenced by both instances of the agent, that amounts to abusing the problem formulation to tell the agent "your copies are located here and here". And if the two copies of the agent use uncorrelated instances of V, then looking at conditional expected values based on V buys you nothing. Figuring out the best action reduces to same logical question that the proof-theoretic algorithms are faced with.