Problem: Lucy is using L in there, applied to her own formal system! That cannot work!
Lucy can assume a priori that it only takes "good" actions and use this to prove properties about Lucy_2, without having to prove the consistency of its own formal system.
I might be missing something, but it seems to me that this thing that you people call "Löbian obstacle" isn't really about self-modification (it occurs even if Lucy = Lucy_2), but about internal consistency: it's the old question of: "how can I know that I'm not insane?"
Well, you can't know, and there fundamentally is no way around it: Gödel's incompleteness theorem, Löb's theorem, Rice's theorem, etc. are all formalized restatements of this fact.
So, why can't Lucy just assume that it is sane? It seems to work well for humans.
So, why can't Lucy just assume that it is sane?
Lucy reasons using formal mathematical logic. A formal mathematical logic cannot "assume it is sane" because of Loeb's theorem. As long as Lucy isn't self-modifying, all is fine and dandy. She just proves theorems and acts accordingly, it's not like she has a built-in compulsion to prove her own sanity. However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2. Now, since Lucy2 is at least as smart as Lucy and prefera...
In this post I intend to:
Background
Logic
When can we consider a mathematical theorem to be established? The obvious answer is: when we proved it. Wait, proved it in what theory? Well, that's debatable. ZFC is popular choice for mathematicians, but how do we know it is consistent (let alone sound, i.e. that it only proves true sentences)? All those spooky infinite sets, how do you know it doesn't break somewhere along the line? There's lots of empirical evidence, but we can't prove it, and it's proofs we're interesting in, not mere evidence, right?
Peano arithmetic seems like a safer choice. After all, if the natural numbers don't make sense, what does? Let's go with that. Suppose we have a sentence s in the language of PA. If someone presents us with a proof p in PA, we believe s is true. Now consider the following situations: instead of giving you a proof of s, someone gave you a PA-proof p1 that p exists. After all, PA admits defining "PA-proof" in PA language. Common sense tells us that p1 is a sufficient argument to believe s. Maybe, we can prove it within PA? That is, if we have a proof of "if a proof of s exists then s" and a proof of R(s)="a proof of s exists" then we just proved s. That's just modus ponens.
There are two problems with that.
First, there's no way to prove the sentence L:="for all s if R(s) then s", since it's not a PA-sentence at all. The problem is that "for all s" references s as a natural number encoding a sentence. On the other hand, "then s" references s as the truth-value of the sentence. Maybe we can construct a PA-formula T(s) which means "the sentence encoded by the number s is true"? Nope, that would get us in trouble with the liar paradox (it would be possible to construct a sentence saying "this sentence is false").
Second, Loeb's theorem says that if we can prove L(s):="if R(s) exists then s" for a given s, then we can prove s. This is a problem since it means there can be no way to prove L(s) for all s in any sense, since it's unprovable for s which are unprovable. In other words, if you proved not-s, there is no way to conclude that "no proof of s exists".
What if we add an inference rule Q to our logic allowing to go from R(s) to s? Let's call the new formal system PA1. p1 appended by a Q-step becomes an honest proof of s in PA1. Problem solved? Not really! Now someone can give you a proof of
R1(s):="a PA1-proof of s exists". Back to square one! Wait a second, what if we add a new rule Q1 allowing to go from R1(s) to s? OK, but now we got R2(s):="a PA2-proof of s exists". Hmm, what if add an infinite number of rules Qk? Fine, but now we got Rω(s):="a PAω-proof of s exists". And so on, and so forth, the recursive ordinals are a plenty...
Bottom line, Loeb's theorem works for any theory containing PA, so we're stuck.
AI
Suppose you're trying to build a self-modifying AGI called "Lucy". Lucy works by considering possible actions and looking for formal proofs that taking one of them will increase expected utility. In particular, it has self-modifying actions in its strategy space. A self-modifying action creates essentially a new agent: Lucy2. How can Lucy decide that becoming Lucy2 is a good idea? Well, a good step in this direction would be proving that Lucy2 would only take actions that are "good". I.e., we would like Lucy to reason as follows "Lucy2 uses the same formal system as I, so if she decides to take action a, it's because she has a proof p of the sentence s(a) that 'a increases expected utility'. Since such a proof exits, a does increase expected utility, which is good news!" Problem: Lucy is using L in there, applied to her own formal system! That cannot work! So, Lucy would have a hard time self-modifying in a way which doesn't make its formal system weaker.
As another example where this poses a problem, suppose Lucy observes another agent called "Kurt". Lucy knows, by analyzing her sensory evidence, that Kurt proves theorems using the same formal system as Lucy. Suppose Lucy found out that Kurt proved theorem s, but she doesn't know how. We would like Lucy to be able to conclude s is, in fact, true (at least with the probability that her model of physical reality is correct). Alas, she cannot.
See MIRI's paper for more discussion.
Evidence Logic
Here, cousin_it explains a method to assign probabilities to sentences in an inconsistent theory T. It works as follows. Consider sentence s. Since T is inconsistent, there are T-proofs both of s and of not-s. Well, in a courtroom both sides are allowed to have arguments, why not try the same approach here? Let's weight the proofs as a function of their length, analogically to weighting hypotheses in Solomonoff induction. That is, suppose we have a prefix-free encoding of proofs as bit sequences. Then, it makes sense to consider a random bit sequence and ask whether it is a proof of something. Define the probability of s to be
P(s) := (probability of a random sequence to be a proof of s) / (probability of a random sequence to be a proof of s or not-s)
Nice, but it doesn't solve the Loebian obstacle yet.
I will now formulate an extension of this idea that allows assigning an interval of probabilities [Pmin(s), Pmax(s)] to any sentence s. This interval is a sort of "Knightian uncertainty". I have some speculations how to extract a single number from this interval in the general case, but even without that, I believe that Pmin(s) = Pmax(s) in many interesting cases.
First, the general setting:
[Pmin(s), Pmax(s)] by definition.