# Naturalistic trust among AIs: The parable of the thesis advisor's theorem

Eliezer and Marcello's article on tiling agents and the Löbian obstacle discusses several things that you intuitively would expect a rational agent to be able to do that, because of Löb's theorem, are problematic for an agent using logical reasoning. One of these desiderata is *naturalistic trust*: Imagine that you build an AI that uses PA for its mathematical reasoning, and this AI happens to find in its environment an automated theorem prover which, the AI carefully establishes, *also* uses PA for its reasoning. Our AI looks at the theorem prover's display and sees that it flashes a particular lemma that would be very useful for our AI in its own reasoning; the fact that it's on the prover's display means that the prover has just completed a formal proof of this lemma. Can our AI now use the lemma? Well, even if it can establish in its own PA-based reasoning module that *there exists a proof* of the lemma, by Löb's theorem this doesn't imply in PA that the lemma is in fact true; as Eliezer would put it, our agent treats proofs checked inside the boundaries of its own head different from proofs checked somewhere in the environment. (The above isn't fully formal, but the formal details can be filled in.)

At the MIRI's December workshop (which started today), we've been discussing a suggestion by Nik Weaver for how to handle this problem. Nik starts from a simple suggestion (which he doesn't consider to be *entirely* sufficient, and his linked paper is mostly about a much more involved proposal that addresses some remaining problems, but the simple idea will suffice for this post): Presumably there's some instrumental reason that our AI proves things; suppose that in particular, the AI will only take an action after it has proven that it is "safe" to take this action (e.g., the action doesn't blow up the planet). Nik suggests to relax this a bit: The AI will only take an action after it has (i) proven in PA that taking the action is safe; **OR** (ii) proven in PA that it's provable in PA that the action is safe; **OR** (iii) proven in PA that it's provable in PA that it's provable in PA that the action is safe; etc.

Now suppose that our AI sees that lemma, A, flashing on the theorem prover's display, and suppose that our AI can prove that A implies that action X is safe. Then our AI can also prove that *it's provable* that A -> safe(X), and it can prove that A is provable because it has established that the theorem prover works correctly; thus, it can prove that *it's provable* that safe(X), and therefore take action X.

Even if the theorem prover has only proved that A is provable, so that the AI only knows that it's provable that A is provable, it can use the same sort of reasoning to prove that it's provable that it's provable that safe(X), and again take action X.

But on hearing this, Eliezer and I had the same skeptical reaction: It seems that our AI, in an informal sense, "trusts" that A is true if it finds (i) a proof of A, or (ii) a proof that A is provable, or -- etc. Now suppose that the theorem prover our AI is looking at flashes statements on its display after it has established that they are "trustworthy" in this sense -- if it has found a proof, or a proof that there is a proof, etc. Then when A flashes on the display, our AI can only prove that there exists some *n* such that it's "provable^*n*" that A, and that's *not* enough for it to use the lemma. If the theorem prover flashed *n* on its screen together with A, everything would be fine and dandy; but if the AI doesn't know *n*, it's not able to use the theorem prover's work. So it still seems that the AI is unwilling to "trust" another system that reasons just like the AI itself.

I want to try to shed some light on this obstacle by giving an intuition for why the AI's behavior here could, in some sense, be considered to be the right thing to do. Let me tell you a little story.

One day you talk with a bright young mathematician about a mathematical problem that's been bothering you, and she suggests that it's an easy consequence of a theorem in cohistonomical tomolopy. You haven't heard of this theorem before, and find it rather surprising, so you ask for the proof.

"Well," she says, "I've heard it from my thesis advisor."

"Oh," you say, "fair enough. Um--"

"Yes?"

"You're sure that your advisor checked it carefully, right?"

"Ah! Yeah, I made quite sure of that. In fact, I established very carefully that my thesis advisor uses exactly the same system of mathematical reasoning that I use myself, and only states theorems after she has checked the proof beyond any doubt, so as a rational agent I am compelled to accept anything as true that she's convinced herself of."

"Oh, I see! Well, fair enough. I'd still like to understand why this theorem is true, though. You wouldn't happen to know your advisor's proof, would you?"

"Ah, as a matter of fact, I do! She's heard it from her thesis advisor."

"..."

"Something the matter?"

"Er, have you considered..."

"*Oh!* I'm glad you asked! In fact, I've been curious myself, and yes, it *does* happen to be the case that there's an infinitely descending chain of thesis advisors all of which have established the truth of this theorem solely by having heard it from the previous advisor in the chain." *(This parable takes place in a world without a big bang -- human history stretches infinitely far into the past.)* "But never to worry -- they've all checked very carefully that the previous person in the chain used the same formal system as themselves. Of course, that was obvious by induction -- my advisor wouldn't have accepted it from her advisor without checking his reasoning first, and he would have accepted it from his advisor without checking, etc."

"Uh, doesn't it bother you that nobody has ever, like, actually *proven* the theorem?"

"Whatever in the world are you talking about? I've proven it myself! In fact, I just told you that *infinitely many* people have each proved it in slightly different ways -- for example my own proof made use of the fact that my advisor had proven the theorem, whereas her proof used *her* advisor instead..."

This can't *literally* happen with a sound proof system, but the reason is that that a system like PA can only accept things as true if they have been proven in a system *weaker* than PA -- i.e., because we have Löb's theorem. Our mathematician's advisor would have to use a weaker system than the mathematician herself, and the advisor's advisor a weaker system still; this sequence would have to terminate after a finite time *(I don't have a formal proof of this, but I'm fairly sure you can turn the above story into a formal proof that something like this has to be true of sound proof systems)*, and so *someone* will actually have to have proved the actual theorem on the object level.

So here's my intuition: A satisfactory solution of the problems around the Löbian obstacle will have to make sure that the buck doesn't get passed on indefinitely -- you can accept a theorem because someone reasoning like you has established that someone else reasoning like you has proven the theorem, but there can only be a finite number of links between you and someone who has actually done the object-level proof. We know how to do this by decreasing the mathematical strength of the proof system, and that's not satisfactory, but my intuition is that a satisfactory solution will still have to make sure that there's *something* that decreases when you go up the chain of thesis advisors, and when that thing reaches zero you've found the thesis advisor that has actually proven the theorem. (I sense ordinals entering the picture.)

...aaaand in fact, I *can* now tell you one way to do *something* like this: Nik's idea, which I was talking about above. Remember how our AI "trusts" the theorem prover that flashes the number *n* which says how many times you have to iterate "that it's provable in PA that", but doesn't "trust" the prover that's exactly the same except it doesn't tell you this number? That's the thing that decreases. If the theorem prover actually establishes A by observing a *different* theorem prover flashing A and the number 1584, then it can flash A, but only with a number at least 1585. And hence, if you go 1585 thesis advisors up the chain, you find the gal who actually proved A.

The cool thing about Nik's idea is that it *doesn't* change mathematical strength while going down the chain. In fact, it's not hard to show that if PA proves a sentence A, then it also proves that PA proves A; and the other way, we believe that everything that PA proves is actually true, so if PA proves *PA proves A*, then it follows that PA proves A.

I can guess what Eliezer's reaction to my argument here might be: The problem I've been describing can only occur in infinitely large worlds, which have all sorts of other problems, like utilities not converging and stuff.

We settled for a large finite TV screen, but we could have had an arbitrarily larger finite TV screen. #infiniteworldproblems

We have Porsches for every natural number, but at every time t we have to trade down the Porsche with number t for a BMW. #infiniteworldproblems

We have ever-rising expectations for our standard of living, but the limit of our expectations doesn't equal our expectation of the limit. #infiniteworldproblems

*-- Eliezer, not coincidentally after talking to me*

I'm not going to be able to resolve that argument in this post, but briefly: I agree that we *probably* live in a finite world, and that finite worlds have many properties that make them nice to handle mathematically, but we can formally reason about infinite worlds of the kind I'm talking about here using standard, extremely well-understood mathematics.

Because proof systems like PA (or more conveniently ZFC) allow us to formalize this standard mathematical reasoning, a solution to the Löbian obstacle has to "work" properly in these infinite worlds, or we would be able to turn our story of the thesis advisors' proof that 0=1 into a formal proof of an inconsistency in PA, say. To be concrete, consider the system PA*, which consists of PA + the axiom schema "if PA* proves phi, then phi" for every formula phi; this is easily seen to be inconsistent by Löb's theorem, but if we didn't know that yet, we could translate the story of the thesis advisors (which are using PA* as their proof system this time) into a formal proof of the inconsistency of PA*.

Therefore, thinking intuitively in terms of infinite worlds can give us insight into why many approaches to the Löbian family of problems fail -- as long as we make sure that these infinite worlds, and their properties that we're using in our arguments, really *can* be formalized in standard mathematics, of course.

## Comments (20)

Best*8 points [-]Forgive me if this is a stupid question, but all of this seems to be very fragile and dependent on subtle properties. It seems like even the slightest modifications, like giving the theorem prover a (possibly very tiny, but nonzero) probability of error, which would always occur in a real-life scenario, would ruin both the Lobian argument and the solution given here. How is this argument at all relevant to real-life ai development?

*9 points [-]Yes, a real-life reasoner would have to use probabilistic reasoning to carry out these sorts of inference. We do not have a real understanding yet of how to do probabilistic reasoning about logical statements, though, although there has been a bit of work about it in the past. This is one topic MIRI is currently doing research on. In the meantime, we also examine problems of self-reference in ordinary deductive logic, since we understand it very well. It's not certain that the results there will carry over in any way into the probabilistic setting, and it's

possiblethat these problems simply disappear if you go to probabilistic reasoning, but there's no reason to consider this overwhelmingly likely, and if they don't it seems likely that at least some insights gained while thinking about deductive logic will carry over. In addition, when an AI reasons about another AI, it seems likely that it will use deductive logic when reasoning about the other AI's source code, even if it also has to use probabilistic reasoning to connect the results obtained that way to the real world, where the other AI runs on an imperfect processor and its source code isn't known with certainty.More here.

I'll accept that doing

everythingprobabilistically is expensive, but I really don't see how it wouldn't solve the problem to at least assign probabilities toimportedstatements. The more elements in the chain of trust, the weaker it is. Eventually, someone needs it reliably enough that it becomes necessary to check it.And of course any chain of trust like that ought to have a system for providing proof upon demand, which will be invoked roughly every N steps of trust. The recipients of the proof would then become nodes of authority on the issue.

This seems rather how actual people operate (though we often skip the 'where to get proof of this' step), and so any proof that it will become unworkable has a bit of a steep hill to climb.

I see. Thanks for the link.

*3 points [-]I'm under the impression that all these "Löbian obstacles" essentially boil down to the "you can't be sure that you are not insane" kind of stuff.

IIUC, your thesis advisors scenario can be solved by having a first thesis advisor who only outputs theorems which are provable within some formal system and every other thesis advisor trusting by axiom his predecessor.

This doesn't work in your (MIRI's) framework because you want the agent to trust not its ancestors but its successors, and you want this to hold for an unbounded number of generations.

But this may just be a problem of your framework, not some fundamental propriety of computational agents. At least, you haven't produced a satisfactory argument for that.

Anyway, I think that framing these problems in terms of provability logic might be inappropriate, as concepts like "beliefs", "trust" or "agents" are difficult to formalize in that language, hence you get a mix of formal and informal reasoning, which is the traditional recipe for paradoxes and errors.

It's much better to frame these problems directly in terms of computer programs. This way agents, internal states, world physics, etc., can be defined in a much more precise way.

In this framework the main "Gödelian" result is Rice's theorem, which, among other things, implies that functional equivalence between arbitrary programs is uncomputable. Nevertheless, there exist under-approximations of functional equivalence which are computable, and this can be exploited by "cliquing" strategies in scenarios where programs have to "trust" each other (e.g. program-equilibrium prisoner's dilemma).

Cliquing based on a lexical equivalence predicate is restrictive, obviously, but it can be generalized on any predicate that under-approximates functional equivalence.

Of course, this assumes that you start with a program that is free of bugs, OS and compilers are free of bugs, hardware never glitches, and so on. But, unrealistic as it is, this seems to be an unavoidable assumption: no program will ever be able to debug itself, or to certify that the hardware it runs on works as advertised.

*1 point [-]Yeah, good points. Previously discussed in this post and this comment. (I see you also commented there.) I'd love to see more progress in that direction.

Thanks. Added a comment in the WD/EY subthread.

Great post. Thank you for improving my intuition.

Probably a stupid question: What is the difference between a proof, and a proof that something is provable? I understand the difference between a constructive proof and an existence proof, but in the case of proofs it seems that they should be the same.

Do we have any actual examples of proof-by-proving-provability?

There is a way to write a predicate Proves(p,f) in the language of PA which is true if f is the Gödel number of a formula and p is the Gödel number of a proof of that formula from the axioms of PA. You can then define a predicate Provable(f) := exists p. Proves(p,f); then Provable(f) says that f is the Gödel number of a provable formula. Writing "A" for the Gödel number of the formula A, we can then write

PA |- Provable("A")

to say that there's a proof that A is provable, and

PA |- Provable("Provable("A")")

to say that there's a proof that there's a proof that A is provable. (Of course, PA |- A just says that there is a proof of A.)

On the metalevel, we know that

PA |- A

if and only if

PA |- Provable("A").

On the other hand, PA proves neither A -> Provable("A") nor Provable("A") -> A for general A.

(Hope that helps?)

Let me see if I can put that in my own words; if not, I didn't understand it. You are saying that humans, who do not operate strictly by PA, know that a proof of the existence of a proof is itself a proof; but a reasoner strictly limited to PA would not know any such thing, because it's not a theorem of PA. (PA being just an example - it could be any formal system, or at least any formal system that doesn't include the concept of proofs among its atoms, or concepts.) So such a reasoner can be shown a proof that a proof of A exists, but will not know that A is therefore a theorem of PA. Correct?

To me this seems more like a point about limitations of PA than about AI or logic per se; my conclusion would be "therefore, any serious AI needs a formal system with more oomph than PA". Is this a case of looking at PA "because that's where the light is", ie it's easy to reason about; or is there a case that solving such problems can inform reasoning about more realistic systems?

Strictly speaking, Lob's Theorem doesn't show that PA doesn't prove that the provability of any statement implies that statement. It just shows that if you have a statement in PA of the form (If S is provable, then S), you can use this to prove S. The part about PA not proving any implications of that form for a false S only follows if we assume that PA is sound.

Therefore, replacing PA with a stronger system or adding primitive concepts of provability in place of PA's complicated arithmetical construction won't help. As long as it can do everything PA can do (for example, prove that it can prove things it can prove), it will always be able to get from (If S is provable, then S) to S, even if S is 3*5=56..

*0 points [-]Let me see what happens if I put in a specific example. Suppose that

is a theorem of PA. Let me refer to "3*5=35 is a theorem" as sentence 1; "3*5=35" as sentence 2, and the implication 1->2 as sentence 3. Now, if 3 is a theorem, then you can use PA to prove 2 even without actually showing 1; and then PA has proven a falsehood, and is inconsistent. Is that a correct statement of the problem?

If so... I seem to have lost track of the original difficulty, sorry. Why is it a worry that PA will assert that it can prove something false, but not a worry that it will assert something false? If you're going to worry that sentence 3 is a theorem, why not go straight to worrying that sentence 2 is a theorem?

We generally take it for granted that sentence 2 (because it is false) is not a theorem (and therefore sentence 1 is false), and the argument is meant to show that sentence 3 is therefore also not a theorem (even though it is true, since sentence 1 is false). This is a problem because we would like to use reasoning along the lines of sentence 3.

To see the real issue, you should replace sentence 2 with a conjecture whose truth you don't yet know but would like to know (and modify sentences 1 and 3 correspondingly). Now wouldn't you like sentence 3 to be a true? If you knew that sentence 1 was true, wouldn't you like to conclude that sentence 2 is true? Yet if you're PA, then you can't do that.

The problem with that is that the same argument goes through in exactly the same way with any stronger system replacing PA. You might first try something like adding a rule "if PA proves that PA proves S, then S". This solves your original problem, but introduces new ones: there are now new statements that your system can prove that it can prove, but that it can't prove. Eliezer discusses this system, under the name PA+1, in You Provably Can't Trust Yourself .

Why is that a problem? Except in the same sense that gravity is a problem when you want to lift something.

It's just a fact that people and machines make mistakes. The ideal approach to results proved by other beings is to treat them as less than certain. Indeed, an ideal reasoner might want to extend this principle to its own thoughts too, if only a sensible way to do this could be found.

If you want an ideal reasoner to treat conclusions it has derived itself, and conclusions derived by another reasoner which is probably also ideal, with the same degree of credence, then the ideal approach is for the first reasoner to doubt its own conclusions in the way that it doubts the conclusions of the other reasoner.

Throughout, I have meant "ideal" from a perspective of epistemic caution, where you value not believing wrong things. If you're in a hurry, or really do trust certain other reasoners, then of course you can simply design a reasoner to treat the output of certain other reasoners as apriori valid.

*1 point [-]The "tiling agents" issue mostly wasn't relevant to this article, but that

isthe goal of getting around Lob's Theorem. If and only if you can get around Lob's Theorem and prove things regarding formal systems as complex as yourself,then and only thencan you construct a reasoning system more powerful than yourself and set it to work on your own task.Otherwise, you can't prove that an agent with a higher-order logic than yourself will function according to your goals or, put another way, retain your beliefs (ie: I don't want to replace myself with an agent who will reason that the sky is green when I'm quite sure it's blue).

Is biting the bullet on the thesis advisor analogy really such a problem? Given an infinite human history, it seems like if the proposition was actually false then at some point someone would have noticed.

You're thinking of real human beings, when this is just a parable used to make a mathematical point. The "advisors" are formal deterministic algorithms without the ability to jump out of the system and question their results.

*0 points [-]Can the concept of "my" reasoning, as opposed to "someone else's" reasoning, be formalized? Suppose I manage to write an AI that can run on my laptop. I boot my laptop up, run the program, and it finds a proof of X. It writes "a proof of X exists" on my hard drive in the "what statements has proofs database". I shut my laptop down, reboot my laptop, and restart the program. It checks the "what statements has proofs database", and finds the "a proof of X exists" statement. Is that "its" reasoning? It was created by that program. But it wasn't created by

that instanceof the program. What if a bunch of people are running AIs, all with the same source code, on different laptops. If one of those AIs queries another, is that "its own reasoning", or another AI's? What if Saturday, an AI says "I knew on Friday that X is true, I didn't record my proof, but I know that I am a rational thinker, so I know it's true". But the reason it knew on Friday that X is true was that it knew on Thursday that X is true, and so on. Is this the same paradox?Also, there is a difference between

creatingan AI that you trust, andfindingone. One can imagine a universe in which an infinite past chain of AIs exists, but the idea ofcreatingan infinite chain of reasoning is a lot harder to imagine.There are some problems with the idea that PA can be perfectly substantiated physically, but putting that aside, there's even further problems in extrapolating into the past. What's to stop me from creating an AI with the same source code as you, and putting "2+2 =5 is true" in its memory banks? Can you tell the difference between an AI that has put "there is a proof of X" in its memory banks according to its programming, and an AI that was simply created with "there is a proof of X" in its memory banks to begin with? Even if it "remembers" going through the process of coming up with a proof, that memory could have been there to begin with. If it remembers all of its reasoning, you could check that that reasoning is consistent with its source code, but that would take as much work as just running the calculations over again.