You Provably Can't Trust Yourself

Eliezer Yudkowsky

Followup to: Where Recursive Justification Hits Bottom, Löb's Theorem

Peano Arithmetic seems pretty trustworthy. We've never found a case where Peano Arithmetic proves a theorem T, and yet T is false in the natural numbers. That is, we know of no case where []T ("T is provable in PA") and yet ~T ("not T").

We also know of no case where first order logic is invalid: We know of no case where first-order logic produces false conclusions from true premises. (Whenever first-order statements H are true of a model, and we can syntactically deduce C from H, checking C against the model shows that C is also true.)

Combining these two observations, it seems like we should be able to get away with adding a rule to Peano Arithmetic that says:

All T: ([]T -> T)

But Löb's Theorem seems to show that as soon as we do that, everything becomes provable. What went wrong? How can we do worse by adding a true premise to a trustworthy theory? Is the premise not true—does PA prove some theorems that are false? Is first-order logic not valid—does it sometimes prove false conclusions from true premises?

Actually, there's nothing wrong with reasoning from the axioms of Peano Arithmetic plus the axiom schema "Anything provable in Peano Arithmetic is true." But the result is a different system from PA, which we might call PA+1. PA+1 does not reason from identical premises to PA; something new has been added. So we can evade Löb's Theorem because PA+1 is not trusting itself—it is only trusting PA.

If you are not previously familiar with mathematical logic, you might be tempted to say, "Bah! Of course PA+1 is trusting itself! PA+1 just isn't willing to admit it! Peano Arithmetic already believes anything provable in Peano Arithmetic—it will already output anything provable in Peano Arithmetic as a theorem, by definition! How does moving to PA+1 change anything, then? PA+1 is just the same system as PA, and so by trusting PA, PA+1 is really trusting itself. Maybe that dances around some obscure mathematical problem with direct self-reference, but it doesn't evade the charge of self-trust."

But PA+1 and PA really are different systems; in PA+1 it is possible to prove true statements about the natural numbers that are not provable in PA. If you're familiar with mathematical logic, you know this is because some nonstandard models of PA are ruled out in PA+1. Otherwise you'll have to take my word that Peano Arithmetic doesn't fully describe the natural numbers, and neither does PA+1, but PA+1 characterizes the natural numbers slightly better than PA.

The deeper point is the enormous gap, the tremendous difference, between having a system just like PA except that it trusts PA, and a system just like PA except that it trusts itself.

If you have a system that trusts PA, that's no problem; we're pretty sure PA is trustworthy, so the system is reasoning from true premises. But if you have a system that looks like PA—having the standard axioms of PA—but also trusts itself, then it is trusting a self-trusting system, something for which there is no precedent. In the case of PA+1, PA+1 is trusting PA which we're pretty sure is correct. In the case of Self-PA it is trusting Self-PA, which we've never seen before—it's never been tested, despite its misleading surface similarity to PA. And indeed, Self-PA collapses via Löb's Theorem and proves everything—so I guess it shouldn't have trusted itself after all! All this isn't magic; I've got a nice Cartoon Guide to how it happens, so there's no good excuse for not understanding what goes on here.

I have spoken of the Type 1 calculator that asks "What is 2 + 3?" when the buttons "2", "+", and "3" are pressed; versus the Type 2 calculator that asks "What do I calculate when someone presses '2 + 3'?" The first calculator answers 5; the second calculator can truthfully answer anything, even 54.

But this doesn't mean that all calculators that reason about calculators are flawed. If I build a third calculator that asks "What does the first calculator answer when I press '2 + 3'?", perhaps by calculating out the individual transistors, it too will answer 5. Perhaps this new, reflective calculator will even be able to answer some questions faster, by virtue of proving that some faster calculation is isomorphic to the first calculator.

PA is the equivalent of the first calculator; PA+1 is the equivalent of the third calculator; but Self-PA is like unto the second calculator.

As soon as you start trusting yourself, you become unworthy of trust. You'll start believing any damn thing that you think, just because you thought it. This wisdom of the human condition is pleasingly analogous to a precise truth of mathematics.

Hence the saying: "Don't believe everything you think."

And the math also suggests, by analogy, how to do better: Don't trust thoughts because you think them, but because they obey specific trustworthy rules.

PA only starts believing something—metaphorically speaking—when it sees a specific proof, laid out in black and white. If you say to PA—even if you prove to PA—that PA will prove something, PA still won't believe you until it sees the actual proof. Now, this might seem to invite inefficiency, and PA+1 will believe you—if you prove that PA will prove something, because PA+1 trusts the specific, fixed framework of Peano Arithmetic; not itself.

As far as any human knows, PA does happen to be sound; which means that what PA proves is provable in PA, PA will eventually prove and will eventually believe. Likewise, anything PA+1 can prove that it proves, it will eventually prove and believe. It seems so tempting to just make PA trust itself—but then it becomes Self-PA and implodes. Isn't that odd? PA believes everything it proves, but it doesn't believe "Everything I prove is true." PA trusts a fixed framework for how to prove things, and that framework doesn't happen to talk about trust in the framework.

You can have a system that trusts the PA framework explicitly, as well as implicitly: that is PA+1. But the new framework that PA+1 uses, makes no mention of itself; and the specific proofs that PA+1 demands, make no mention of trusting PA+1, only PA. You might say that PA implicitly trusts PA, PA+1 explicitly trusts PA, and Self-PA trusts itself.

For everything that you believe, you should always find yourself able to say, "I believe because of [specific argument in framework F]", not "I believe because I believe".

Of course, this gets us into the +1 question of why you ought to trust or use framework F. Human beings, not being formal systems, are too reflective to get away with being unable to think about the problem. Got a superultimate framework U? Why trust U?

And worse: as far as I can tell, using induction is what leads me to explicitly say that induction seems to often work, and my use of Occam's Razor is implicated in my explicit endorsement of Occam's Razor. Despite my best efforts, I have been unable to prove that this is inconsistent, and I suspect it may be valid.

But it does seem that the distinction between using a framework and mentioning it, or between explicitly trusting a fixed framework F and trusting yourself, is at least important to unraveling foundational tangles—even if Löb turns out not to apply directly.

Which gets me to the reason why I'm saying all this in the middle of a sequence about morality.

I've been pondering the unexpectedly large inferential distances at work here—I thought I'd gotten all the prerequisites out of the way for explaining metaethics, but no. I'm no longer sure I'm even close. I tried to say that morality was a "computation", and that failed; I tried to explain that "computation" meant "abstracted idealized dynamic", but that didn't work either. No matter how many different ways I tried to explain it, I couldn't get across the distinction my metaethics drew between "do the right thing", "do the human thing", and "do my own thing". And it occurs to me that my own background, coming into this, may have relied on having already drawn the distinction between PA, PA+1 and Self-PA.

Coming to terms with metaethics, I am beginning to think, is all about distinguishing between levels. I first learned to do this rigorously back when I was getting to grips with mathematical logic, and discovering that you could prove complete absurdities, if you lost track even once of the distinction between "believe particular PA proofs", "believe PA is sound", and "believe you yourself are sound". If you believe any particular PA proof, that might sound pretty much the same as believing PA is sound in general; and if you use PA and only PA, then trusting PA (that is, being moved by arguments that follow it) sounds pretty much the same as believing that you yourself are sound. But after a bit of practice with the actual math—I did have to practice the actual math, not just read about it—my mind formed permanent distinct buckets and built walls around them to prevent the contents from slopping over.

Playing around with PA and its various conjugations, gave me the notion of what it meant to trust arguments within a framework that defined justification. It gave me practice keeping track of specific frameworks, and holding them distinct in my mind.

Perhaps that's why I expected to communicate more sense than I actually succeeded in doing, when I tried to describe right as a framework of justification that involved being moved by particular, specific terminal values and moral arguments; analogous to an entity who is moved by encountering a specific proof from the allowed axioms of Peano Arithmetic. As opposed to a general license to do whatever you prefer, or a morally relativistic term like "utility function" that can eat the values of any given species, or a neurological framework contingent on particular facts about the human brain. You can make good use of such concepts, but I do not identify them with the substance of what is right.

Gödelian arguments are inescapable; you can always isolate the framework-of-trusted-arguments if a mathematical system makes sense at all. Maybe the adding-up-to-normality-ness of my system will become clearer, after it becomes clear that you can always isolate the framework-of-trusted-arguments of a human having a moral argument.

Part of The Metaethics Sequence

Next post: "No License To Be Human"

Previous post: "The Bedrock of Morality: Arbitrary?"

A puzzle: How can one rigorously construct Self-PA as recursively axiomatized first order theory in the language of PA?

Eliezer, I have an objection to your metaethics and I don't think it's because I mixed levels:

If I understood your metaethics correctly, then you claim that human morality consists of two parts: a list of things that we value(like love, friendship, fairness etc), and what we can call "intuitions" that govern how our terminal values change when we face moral arguments. So we have a kind of strange loop (in the Hofstadterian sense); our values judge if a moral argument is valid or not, and the valid moral arguments change our terminal values. I think I accept this. It explains quite nicely a lot of questions, like where does moral progress comes from. What I am skeptic about is the claim that if a person hears enough moral arguments, their values will always converge to a single set of values, so you could say that his morality approximates some ideal morality that can be found if you look deep enough into his brain. I think it's plausible that the initial set of moral arguments that the person hears will change considerably his list of values, so that his morality will diverge rather than converge, and there won't be any "ideal morality" that he is approximating.

Note that I am talking about a single human that hears different sets of moral arguments, and not about the convergence of moralities across all humans (which is a different matter altogether)

Also note that this is a purely empirical objection; I am asking for empirical evidence that supports your metaethics

IL My understanding was that Terminal Values are not something you ever observe directly (nobody can simply list their Terminal Values). Moral arguments change what use as our approximation to the Moral Calculation. However, if moral arguments did make our actual moral calculations diverge (that is, if our actual moral calculation is not a state function with respect to moral arguments) then that does disprove Eliezer's meta-ethics (along with any hope for a useful notion of morality it seems to me).

I agree— and I balk at the concept of "the" Coherent Extrapolated Volition precisely because I suspect there are many distinct attractors for a moral framework like ours. Since our most basic moral impulses come from the blind idiot god, there's no reason for them to converge under extrapolation; we have areas of agreement today on certain extrapolations, but the convergence seems to be more a matter of cultural communication. It's not at all inconceivable that other Everett branches of Earth have made very different forms of moral progress from us, no less consistent with reason or consequences or our moral intuitions.

I'd be very interested, of course, to hear Eliezer's reasons for believing the contrary.

Patrick,

It looks doubtful that different moral attractors would look equally right. You don't just dive into an attractor, ending up in whatever one you happened to come across, walking it all the way. The decision to advance from right to right1 and not right2 is invalid if it is not right, and will be corrected as possible. Premeditation should minimize such errors.

Vladimir,

Just to clarify (perhaps unnecessarily): by an attractor I mean a moral framework from which you wouldn't want to self-modify radically in any direction. There do exist many distinct attractors in the space of 'abstracted idealized dynamics', as Eliezer notes for the unfortunate Pebblesorters: they might modify their subgoals, but never approach a morality indifferent to the cardinality of pebble heaps.

Eliezer's claim of moral convergence and the CEV, as I understand it, is that most humans are psychologically constituted so that our moral frameworks lie in the 'basin' of a single attractor; thus the incremental self-modifications of cultural history have an ultimate destination which a powerful AI could deduce.

I suspect, however, that the position is more chaotic than this; that there are distinct avenues of moral progress which will lead us to different attractors. In your terms, since our current right is after all not entirely comprehensive and consistent, we could find that both right1 and right2 are both right extrapolations from right, and that right can't judge unequivocally which one is better.

Especially given that exposure to different fact patterns could push you in different directions. E.g. suppose right now I try to do what is right_1 (subscripts on everything to avoid appearance of claim to universality). Now, suppose that if I experience fact pattern facts_1 I conclude that it is right_1 to modify my 'moral theory' to right_2. but if I experience fact pattern facts_2 I conclude that it is right_1 to modify to right_3.

Now, that's all well and good. Eliezer would have no problem with that, as long as the diagram commutes: that is, if it's true that ( if I've experienced facts_1 and moved to right_2, and then I experience facts_2, I will move to right_4), it must also be true that ( if I've experienced facts_2 and moved to right_3, and then experience facts_1, I will move to right_4).

I suppose that at least in some cases this is true, but I see no reason why in all cases it ought to be. Especially if you allow human cognitive biases to influence the proceedings; but even if you don't (and I'm not sure how you avoid it), I don't see any argument why all such diagrams should commute. (this doesn't mean they don't, of course. I invite Eliezer to provide such an argument).

I still hold that Eliezer's account of morality is correct, except his claim that all humans would reflectively arrive at the same morality. I think foundations and priors are different enough that, functionally, each person has his own morality.

Jadagul: I suppose that at least in some cases this is true, but I see no reason why in all cases it ought to be.

A particular property of moral progress is a property of algorithm that is morality. You can implement many possible algorithms, and for most of them any given property won't hold. If you consider the order in which moral arguments are presented an arbitrary factor, the outcome shouldn't depend on it. If it's found that it does, it is an error that should be corrected, the same way that your morality should be reverted back if it was changed by external factor which you didn't approve, by a red pill that makes you want to kill people that you swallowed by mistake.

Here is a construction of a theory T with the properties of Self-PA. That is, 1) T extends PA and 2) T can prove the consistency of T. Of course, by Godel's second incompleteness theorem, T must be inconsistent, but it is not obviously inconsistent.

In addition to the axioms of PA, T will have one additional axiom PHI, to be chosen presently.

By the devices (due to Godel) used to formalize "PA is consistent" in PA we can find a formula S(x) with one free variable, x, in the language of PA which expresses the following:

a) x is the Godel number of a sentence, S, of the language of PA;

b) The theory "PA + S" is consistent.

By Godel's self-referential lemma, there is a sentence PHI, with Godel number q such that PA proves:

PHI if and only if S([q]).

(Loosely speaking, PHI says the theory obtained by adjoining me to PA is consistent.)

If we take T to be the theory "PA + PHI" then T has the properties stated at the start of this posting.

Of course, to fully understand the argument just given one needs some familiarity with mathematical logic. Enderton's text "A Mathematical Introduction to Logic" covers all the necessary background material.

If you go back and check, you will find that I never said that extrapolating human morality gives you a single outcome. Be very careful about attributing ideas to me on the basis that others attack me as having them.

The "Coherent" in "Coherent Extrapolated Volition" does not indicate the idea that an extrapolated volition is necessarily coherent.

The "Coherent" part indicates the idea that if you build an FAI and run it on an extrapolated human, the FAI should only act on the coherent parts. Where there are multiple attractors, the FAI should hold satisficing avenues open, not try to decide itself.

The ethical dilemma arises if large parts of present-day humanity are already in different attractors.

I think this is your most important post.

@ eli: nice series on lob's theorem, but I still don't think you've added any credibility to claims like "I favor the human one because it is h-right". You can do your best to record exactly what h-right is, and think carefully about convergence (or lack of) under self modification, but I think you'd do a lot better to just state "human values" as a preference, and be an out-of-the-closet-relativist.

We've got moral intuitions; our initial conditions.

We've got values, which are computed, and morality, our decision making computation.

Our morality updates our values, our values inform our morality, and the particular framework we use, evolves over time.

Do all locally-moral explorations of framework-space converge, even assuming the psychological unity of humans? Our morals influence our psychology; can we ignore the effect of our local morality on the psychology of our future-selves?

Eliezer isn't tempting us with a perfect morality; he's unraveling a meta-ethics, a computation for COMPUTING the evolution of morality, i.e. a framework for iteratively building better moralities.

Why assume that even with this meta-ethics, our morality-evolution converges, rather than diverges (or merely remaining as diverse as it currently is)? Maybe it doesn't matter. We've already been warned against the dichotomy between "morality-as-given" and "morality-as-preference". Morality is not a fixed, immutable structure to which our moral utility-functions will all inevitably converge. But there is a general framework within which we can evaluate moralities, analogous to the framework within which mathematicians explore various formal theories (which seems basically correct). But neither is morality merely a preference, again analogous in my mind to the fact that not all possible mathematical theories are 'interesting'. I think Eliezer needs to fill us in on what makes a morality 'interesting'. Oddly enough, in mathematics at least, there is an intuitive notion of 'interesting' based on the consequences of a formal theory; what theorems does said theory generate?

Certainly, we should be able to exclude certain moralities easily; we've got bedrock 'neath us, right?

My previous post was actually written last night. I was unable to post it until just now, and unfortunately had not read the most recent comments ...

Ah, thanks Eliezer, that comment explains a lot. I think I mostly agree with you, then. I suspect (on little evidence) that each one of us would, extrapolated, wind up at his own attractor (or at least at a sparsely populated one). But I have no real evidence for this, and I can't imagine off the top of my head how I would find it (nor how I would find contradictory evidence), and since I'm not trying to build fAI I don't need to care. But what you've just sketched out is basically the reason I think we can still have coherent moral arguments; our attractors have enough in common that many arguments I would find morally compelling, you would also find morally compelling (as in, most of us have different values but we (almost) all agree that the random slaughter of innocent three-year-olds is bad). Thanks for clearing that up.

Am I correct in supposing that you can substitute PA for any consistent calculus such as ZF set theory?

What about a Turing-complete system such as SKI calculus?

How comes anthropomorphizing axioms sounds so weird to me? I don't get the same feeling when anthropomorphizing software, or even stones.

Gödelian arguments are inescapable; you can always isolate the framework-of-trusted-arguments if a mathematical system makes sense at all. Maybe the adding-up-to-normality-ness of my system will become clearer, after it becomes clear that you can always isolate the framework-of-trusted-arguments of a human having a moral argument.

If you hadn't qualified the two statements beginning with, "you can always isolate the framework..." then it seems they would not escape Gödelian arguments. In other words, there is no reason to believe that there isn't a non-isolate-able, general moral Framework, but I suspect that you are right that it would have to be neither mathematical (small 'm') nor of-a-human. ^^

Do I represent well the principles discussed when I say this? ;)

How/does this square with https://arxiv.org/abs/1902.07404?
IIUC, Gödel's Second Incompleteness Theorem was overinterpreted, and a different operationalization of consistency is provable.

I talked to Mihály Bárász about that, and he didn't think it was crazy.

A puzzle: How can one rigorously construct Self-PA as recursively axiomatized first order theory in the language of PA?

Eliezer, I have an objection to your metaethics and I don't think it's because I mixed levels:

Note that I am talking about a single human that hears different sets of moral arguments, and not about the convergence of moralities across all humans (which is a different matter altogether)

Also note that this is a purely empirical objection; I am asking for empirical evidence that supports your metaethics

I'd be very interested, of course, to hear Eliezer's reasons for believing the contrary.

Patrick,

Vladimir,

Jadagul: I suppose that at least in some cases this is true, but I see no reason why in all cases it ought to be.

In addition to the axioms of PA, T will have one additional axiom PHI, to be chosen presently.

By the devices (due to Godel) used to formalize "PA is consistent" in PA we can find a formula S(x) with one free variable, x, in the language of PA which expresses the following:

a) x is the Godel number of a sentence, S, of the language of PA;

b) The theory "PA + S" is consistent.

By Godel's self-referential lemma, there is a sentence PHI, with Godel number q such that PA proves:

PHI if and only if S([q]).

(Loosely speaking, PHI says the theory obtained by adjoining me to PA is consistent.)

If we take T to be the theory "PA + PHI" then T has the properties stated at the start of this posting.

The "Coherent" in "Coherent Extrapolated Volition" does not indicate the idea that an extrapolated volition is necessarily coherent.

The ethical dilemma arises if large parts of present-day humanity are already in different attractors.

I think this is your most important post.

We've got moral intuitions; our initial conditions.

We've got values, which are computed, and morality, our decision making computation.

Our morality updates our values, our values inform our morality, and the particular framework we use, evolves over time.

Eliezer isn't tempting us with a perfect morality; he's unraveling a meta-ethics, a computation for COMPUTING the evolution of morality, i.e. a framework for iteratively building better moralities.

Certainly, we should be able to exclude certain moralities easily; we've got bedrock 'neath us, right?

My previous post was actually written last night. I was unable to post it until just now, and unfortunately had not read the most recent comments ...

Am I correct in supposing that you can substitute PA for any consistent calculus such as ZF set theory?

What about a Turing-complete system such as SKI calculus?

How comes anthropomorphizing axioms sounds so weird to me? I don't get the same feeling when anthropomorphizing software, or even stones.

Gödelian arguments are inescapable; you can always isolate the framework-of-trusted-arguments if a mathematical system makes sense at all. Maybe the adding-up-to-normality-ness of my system will become clearer, after it becomes clear that you can always isolate the framework-of-trusted-arguments of a human having a moral argument.

Do I represent well the principles discussed when I say this? ;)

How/does this square with https://arxiv.org/abs/1902.07404?
IIUC, Gödel's Second Incompleteness Theorem was overinterpreted, and a different operationalization of consistency is provable.

I talked to Mihály Bárász about that, and he didn't think it was crazy.

49

You Provably Can't Trust Yourself

49

49

49