You Provably Can't Trust Yourself

Eliezer Yudkowsky

Followup to: Where Recursive Justification Hits Bottom, Löb's Theorem

Peano Arithmetic seems pretty trustworthy. We've never found a case where Peano Arithmetic proves a theorem T, and yet T is false in the natural numbers. That is, we know of no case where []T ("T is provable in PA") and yet ~T ("not T").

We also know of no case where first order logic is invalid: We know of no case where first-order logic produces false conclusions from true premises. (Whenever first-order statements H are true of a model, and we can syntactically deduce C from H, checking C against the model shows that C is also true.)

Combining these two observations, it seems like we should be able to get away with adding a rule to Peano Arithmetic that says:

All T: ([]T -> T)

But Löb's Theorem seems to show that as soon as we do that, everything becomes provable. What went wrong? How can we do worse by adding a true premise to a trustworthy theory? Is the premise not true—does PA prove some theorems that are false? Is first-order logic not valid—does it sometimes prove false conclusions from true premises?

Actually, there's nothing wrong with reasoning from the axioms of Peano Arithmetic plus the axiom schema "Anything provable in Peano Arithmetic is true." But the result is a different system from PA, which we might call PA+1. PA+1 does not reason from identical premises to PA; something new has been added. So we can evade Löb's Theorem because PA+1 is not trusting itself—it is only trusting PA.

If you are not previously familiar with mathematical logic, you might be tempted to say, "Bah! Of course PA+1 is trusting itself! PA+1 just isn't willing to admit it! Peano Arithmetic already believes anything provable in Peano Arithmetic—it will already output anything provable in Peano Arithmetic as a theorem, by definition! How does moving to PA+1 change anything, then? PA+1 is just the same system as PA, and so by trusting PA, PA+1 is really trusting itself. Maybe that dances around some obscure mathematical problem with direct self-reference, but it doesn't evade the charge of self-trust."

But PA+1 and PA really are different systems; in PA+1 it is possible to prove true statements about the natural numbers that are not provable in PA. If you're familiar with mathematical logic, you know this is because some nonstandard models of PA are ruled out in PA+1. Otherwise you'll have to take my word that Peano Arithmetic doesn't fully describe the natural numbers, and neither does PA+1, but PA+1 characterizes the natural numbers slightly better than PA.

The deeper point is the enormous gap, the tremendous difference, between having a system just like PA except that it trusts PA, and a system just like PA except that it trusts itself.

If you have a system that trusts PA, that's no problem; we're pretty sure PA is trustworthy, so the system is reasoning from true premises. But if you have a system that looks like PA—having the standard axioms of PA—but also trusts itself, then it is trusting a self-trusting system, something for which there is no precedent. In the case of PA+1, PA+1 is trusting PA which we're pretty sure is correct. In the case of Self-PA it is trusting Self-PA, which we've never seen before—it's never been tested, despite its misleading surface similarity to PA. And indeed, Self-PA collapses via Löb's Theorem and proves everything—so I guess it shouldn't have trusted itself after all! All this isn't magic; I've got a nice Cartoon Guide to how it happens, so there's no good excuse for not understanding what goes on here.

I have spoken of the Type 1 calculator that asks "What is 2 + 3?" when the buttons "2", "+", and "3" are pressed; versus the Type 2 calculator that asks "What do I calculate when someone presses '2 + 3'?" The first calculator answers 5; the second calculator can truthfully answer anything, even 54.

But this doesn't mean that all calculators that reason about calculators are flawed. If I build a third calculator that asks "What does the first calculator answer when I press '2 + 3'?", perhaps by calculating out the individual transistors, it too will answer 5. Perhaps this new, reflective calculator will even be able to answer some questions faster, by virtue of proving that some faster calculation is isomorphic to the first calculator.

PA is the equivalent of the first calculator; PA+1 is the equivalent of the third calculator; but Self-PA is like unto the second calculator.

As soon as you start trusting yourself, you become unworthy of trust. You'll start believing any damn thing that you think, just because you thought it. This wisdom of the human condition is pleasingly analogous to a precise truth of mathematics.

Hence the saying: "Don't believe everything you think."

And the math also suggests, by analogy, how to do better: Don't trust thoughts because you think them, but because they obey specific trustworthy rules.

PA only starts believing something—metaphorically speaking—when it sees a specific proof, laid out in black and white. If you say to PA—even if you prove to PA—that PA will prove something, PA still won't believe you until it sees the actual proof. Now, this might seem to invite inefficiency, and PA+1 will believe you—if you prove that PA will prove something, because PA+1 trusts the specific, fixed framework of Peano Arithmetic; not itself.

As far as any human knows, PA does happen to be sound; which means that what PA proves is provable in PA, PA will eventually prove and will eventually believe. Likewise, anything PA+1 can prove that it proves, it will eventually prove and believe. It seems so tempting to just make PA trust itself—but then it becomes Self-PA and implodes. Isn't that odd? PA believes everything it proves, but it doesn't believe "Everything I prove is true." PA trusts a fixed framework for how to prove things, and that framework doesn't happen to talk about trust in the framework.

You can have a system that trusts the PA framework explicitly, as well as implicitly: that is PA+1. But the new framework that PA+1 uses, makes no mention of itself; and the specific proofs that PA+1 demands, make no mention of trusting PA+1, only PA. You might say that PA implicitly trusts PA, PA+1 explicitly trusts PA, and Self-PA trusts itself.

For everything that you believe, you should always find yourself able to say, "I believe because of [specific argument in framework F]", not "I believe because I believe".

Of course, this gets us into the +1 question of why you ought to trust or use framework F. Human beings, not being formal systems, are too reflective to get away with being unable to think about the problem. Got a superultimate framework U? Why trust U?

And worse: as far as I can tell, using induction is what leads me to explicitly say that induction seems to often work, and my use of Occam's Razor is implicated in my explicit endorsement of Occam's Razor. Despite my best efforts, I have been unable to prove that this is inconsistent, and I suspect it may be valid.

But it does seem that the distinction between using a framework and mentioning it, or between explicitly trusting a fixed framework F and trusting yourself, is at least important to unraveling foundational tangles—even if Löb turns out not to apply directly.

Which gets me to the reason why I'm saying all this in the middle of a sequence about morality.

I've been pondering the unexpectedly large inferential distances at work here—I thought I'd gotten all the prerequisites out of the way for explaining metaethics, but no. I'm no longer sure I'm even close. I tried to say that morality was a "computation", and that failed; I tried to explain that "computation" meant "abstracted idealized dynamic", but that didn't work either. No matter how many different ways I tried to explain it, I couldn't get across the distinction my metaethics drew between "do the right thing", "do the human thing", and "do my own thing". And it occurs to me that my own background, coming into this, may have relied on having already drawn the distinction between PA, PA+1 and Self-PA.

Coming to terms with metaethics, I am beginning to think, is all about distinguishing between levels. I first learned to do this rigorously back when I was getting to grips with mathematical logic, and discovering that you could prove complete absurdities, if you lost track even once of the distinction between "believe particular PA proofs", "believe PA is sound", and "believe you yourself are sound". If you believe any particular PA proof, that might sound pretty much the same as believing PA is sound in general; and if you use PA and only PA, then trusting PA (that is, being moved by arguments that follow it) sounds pretty much the same as believing that you yourself are sound. But after a bit of practice with the actual math—I did have to practice the actual math, not just read about it—my mind formed permanent distinct buckets and built walls around them to prevent the contents from slopping over.

Playing around with PA and its various conjugations, gave me the notion of what it meant to trust arguments within a framework that defined justification. It gave me practice keeping track of specific frameworks, and holding them distinct in my mind.

Perhaps that's why I expected to communicate more sense than I actually succeeded in doing, when I tried to describe right as a framework of justification that involved being moved by particular, specific terminal values and moral arguments; analogous to an entity who is moved by encountering a specific proof from the allowed axioms of Peano Arithmetic. As opposed to a general license to do whatever you prefer, or a morally relativistic term like "utility function" that can eat the values of any given species, or a neurological framework contingent on particular facts about the human brain. You can make good use of such concepts, but I do not identify them with the substance of what is right.

Gödelian arguments are inescapable; you can always isolate the framework-of-trusted-arguments if a mathematical system makes sense at all. Maybe the adding-up-to-normality-ness of my system will become clearer, after it becomes clear that you can always isolate the framework-of-trusted-arguments of a human having a moral argument.

Part of The Metaethics Sequence

Next post: "No License To Be Human"

Previous post: "The Bedrock of Morality: Arbitrary?"

Followup to: Where Recursive Justification Hits Bottom, Löb's Theorem

Combining these two observations, it seems like we should be able to get away with adding a rule to Peano Arithmetic that says:

All T: ([]T -> T)

The deeper point is the enormous gap, the tremendous difference, between having a system just like PA except that it trusts PA, and a system just like PA except that it trusts itself.

PA is the equivalent of the first calculator; PA+1 is the equivalent of the third calculator; but Self-PA is like unto the second calculator.

Hence the saying: "Don't believe everything you think."

And the math also suggests, by analogy, how to do better: Don't trust thoughts because you think them, but because they obey specific trustworthy rules.

For everything that you believe, you should always find yourself able to say, "I believe because of [specific argument in framework F]", not "I believe because I believe".

Which gets me to the reason why I'm saying all this in the middle of a sequence about morality.

Part of The Metaethics Sequence

Next post: "No License To Be Human"

Previous post: "The Bedrock of Morality: Arbitrary?"

49

You Provably Can't Trust Yourself

49

49

49

You Provably Can't Trust Yourself

49

49