So, why can't Lucy just assume that it is sane?
Lucy reasons using formal mathematical logic. A formal mathematical logic cannot "assume it is sane" because of Loeb's theorem. As long as Lucy isn't self-modifying, all is fine and dandy. She just proves theorems and acts accordingly, it's not like she has a built-in compulsion to prove her own sanity. However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2. Now, since Lucy2 is at least as smart as Lucy and preferably even smarter, Lucy has hard time predicting the specifics of how Lucy2 will behave. What Lucy wants is to reason generally, i.e. prove something along the lines of "Lucy2 will only do good things" rather than analyze what Lucy2 will do on case-by-case basis (Yudkowsky & Herreshoff call it "the Vingean principle" after Vernor Vinge). Unfortunately, Lucy is barred from proving such a general theorem. Now, we can try to hack Lucy's system of reasoning to accept self-modifications on weaker grounds than proving they are "good" (e.g. allow proving their goodness in a stronger formal system). But that would involve treating self-modification as something ontologically different than all other phenomena in the universe, which doesn't seem to be a good thing. We would like Lucy to be naturalistic i.e. to reason about herself and her descendants in the same way she reasons about everything else in the universe. Otherwise, she will be handicapped. For example, imagine that instead of self-modifying she considers to break into another much more powerful computer and programming that computer to be Lucy2. Such a scenario should be treated just like a self-modification, but since it doesn't fall under the definition of self-modification hard-coded by the programmers, Lucy is back to using her "general purpose" reasoning mode rather than her special self-modification mode. Same goes for encountering another agent. We can try to introduce "agent" as a special ontological category but defining it would be hard.
It seems to work well for humans.
I have a suspicion human brains use something like the evidence logic I'm proposing. It feels that the concept of "proof" is much less native to human thinking than the concept of "argument for/against".
She just proves theorems and acts accordingly, it's not like she has a built-in compulsion to prove her own sanity.
I suppose she could reason that if she can't prove that her formal system is consistent, then she can't trust that the actions she can prove to be "good" are actually "good". Or, even if she trusts her present self, she can't trust her future selves, even if they are identical (well, they can never be completely identical if they accumulate memories).
Your argument works even for Lucy = Lucy_2.
...However, once she begins t
In this post I intend to:
Background
Logic
When can we consider a mathematical theorem to be established? The obvious answer is: when we proved it. Wait, proved it in what theory? Well, that's debatable. ZFC is popular choice for mathematicians, but how do we know it is consistent (let alone sound, i.e. that it only proves true sentences)? All those spooky infinite sets, how do you know it doesn't break somewhere along the line? There's lots of empirical evidence, but we can't prove it, and it's proofs we're interesting in, not mere evidence, right?
Peano arithmetic seems like a safer choice. After all, if the natural numbers don't make sense, what does? Let's go with that. Suppose we have a sentence s in the language of PA. If someone presents us with a proof p in PA, we believe s is true. Now consider the following situations: instead of giving you a proof of s, someone gave you a PA-proof p1 that p exists. After all, PA admits defining "PA-proof" in PA language. Common sense tells us that p1 is a sufficient argument to believe s. Maybe, we can prove it within PA? That is, if we have a proof of "if a proof of s exists then s" and a proof of R(s)="a proof of s exists" then we just proved s. That's just modus ponens.
There are two problems with that.
First, there's no way to prove the sentence L:="for all s if R(s) then s", since it's not a PA-sentence at all. The problem is that "for all s" references s as a natural number encoding a sentence. On the other hand, "then s" references s as the truth-value of the sentence. Maybe we can construct a PA-formula T(s) which means "the sentence encoded by the number s is true"? Nope, that would get us in trouble with the liar paradox (it would be possible to construct a sentence saying "this sentence is false").
Second, Loeb's theorem says that if we can prove L(s):="if R(s) exists then s" for a given s, then we can prove s. This is a problem since it means there can be no way to prove L(s) for all s in any sense, since it's unprovable for s which are unprovable. In other words, if you proved not-s, there is no way to conclude that "no proof of s exists".
What if we add an inference rule Q to our logic allowing to go from R(s) to s? Let's call the new formal system PA1. p1 appended by a Q-step becomes an honest proof of s in PA1. Problem solved? Not really! Now someone can give you a proof of
R1(s):="a PA1-proof of s exists". Back to square one! Wait a second, what if we add a new rule Q1 allowing to go from R1(s) to s? OK, but now we got R2(s):="a PA2-proof of s exists". Hmm, what if add an infinite number of rules Qk? Fine, but now we got Rω(s):="a PAω-proof of s exists". And so on, and so forth, the recursive ordinals are a plenty...
Bottom line, Loeb's theorem works for any theory containing PA, so we're stuck.
AI
Suppose you're trying to build a self-modifying AGI called "Lucy". Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy2. How can Lucy decide that becoming Lucy2 is a good idea? Well, a good step in this direction would be proving that Lucy2 would only take actions that are "good". I.e., we would like Lucy to reason as follows "Lucy2 uses the same formal system as I, so if she decides to take action a, it's because she has a proof p of the sentence s(a) that 'a increases expected utility'. Since such a proof exits, a does increase expected utility, which is good news!" Problem: Lucy is using L in there, applied to her own formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn't make its formal system weaker.
As another example where this poses a problem, suppose Lucy observes another agent called "Kurt". Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem s, but she doesn't know how. We would like Lucy to be able to conclude s is, in fact, true (at least with the probability that her model of physical reality is correct). Alas, she cannot.
See MIRI's paper for more discussion.
Evidence Logic
Here, cousin_it explains a method to assign probabilities to sentences in an inconsistent theory T. It works as follows. Consider sentence s. Since T is inconsistent, there are T-proofs both of s and of not-s. Well, in a courtroom both sides are allowed to have arguments, why not try the same approach here? Let's weight the proofs as a function of their length, analogically to weighting hypotheses in Solomonoff induction. That is, suppose we have a prefix-free encoding of proofs as bit sequences. Then, it makes sense to consider a random bit sequence and ask whether it is a proof of something. Define the probability of s to be
P(s) := (probability of a random sequence to be a proof of s) / (probability of a random sequence to be a proof of s or not-s)
Nice, but it doesn't solve the Loebian obstacle yet.
I will now formulate an extension of this idea that allows assigning an interval of probabilities [Pmin(s), Pmax(s)] to any sentence s. This interval is a sort of "Knightian uncertainty". I have some speculations how to extract a single number from this interval in the general case, but even without that, I believe that Pmin(s) = Pmax(s) in many interesting cases.
First, the general setting:
[Pmin(s), Pmax(s)] by definition.