I tried to write down my idea a few times, but it was badly wrong each time. Now, Instead of solving the problem, I'm just going to give a more conservative summary of what the problem is.
-
Eliezer's talk at the 2011 Singularity Summit focused largely on the AI reflection problem (how to build AI that can prove things about its own proofs, and execute self modifications on the basis of those proofs, without thereby reducing its self modification mojo). To that end, it would be nice to have a "reflection principle" by which an AI (or its theorem prover) can know in a self-referential way that its theorem proving activities are working as they should.
The naive way to do this is to use the standard provability predicate, ◻, which can be thought of as asking whether a proof of a given formula exists. Using this we can try to formalize our intuition that a fully reflective AI, one that can reason about itself in order to improve itself, should understand that its proof deriving behavior does in fact produce sentences derivable from its axioms:
AI ⊢ ◻P → P,
which is intended to be read as "The formal system "AI" understand in general that "If a sentence is provable then it is true" ", though literally it means something a bit more like "It is derivable from the formal system "AI" that "if there exists a proof of sentence P, then P", in general for P".
Surprisingly, attempting to add this to a formal system, like Peano Arithmetic, doesn't work so well. In particular, it was shown by Löb that adding this reflection principle in general lets us derive any statement, including contradictions.
So our nice reflection principle is broken. We don't understand reflective reasoning as well as we'd like. At this point, we can brainstorm some new reflection principles: Maybe our reflection principles should be derivable within our formal system, instead of being tacked on. Also, we can try to figure out in a deeper way why "AI ⊢ ◻P → P" doesn't work: If we can derive all sentences, does that mean that proofs of contradictions actually do exist? If so, why weren't those proofs breaking our theorem provers before we added the reflection principle? Or do the proofs for all sentences only exist for the augmented theorem prover, not for the initial formal system? That would suggest our reflection principle is allowing the AI to trust itself too much, allowing it to derive things because deriving them allows them to be derived. Though it doesn't really look like that if you just stare at the old reflection principle. We are confused. Let's unconfuse ourselves.
Thanks for the link! Now I understand Eliezer's question a little better. Let me try to think out loud and see if I'm right...
The Gödel machine works like this: it outputs a stream of actions according to some not-very-good algorithm, and keeps looping through proofs on the side. As soon as it finds a proof that replacing its whole source code leads to higher utility (from the next turn onwards) than keeping it as is, it replaces its whole source code. There's no talk about replacing just the proof verifier.
Any such improvement is "not greedy" by Schmidhuber's "Global optimality theorem", which notes that the utility of staying with the old source code for another turn already takes into account all other possible self-improvements that the machine could do in the future. So the very first self-improvement found by the machine will likely look like an immediate jump to optimality.
The question arises, if the first self-improvement is guaranteed to be so awesome, will the machine ever find it at all, or will the proof search loop forever?
In some simple scenarios the proof search will in fact terminate. For example, if the machine has the option of pressing a lever that maxes out the reward, it can easily prove that self-modifying to always press the lever has optimal utility compared to any other self-improvement it can come up with. Note that in such simple scenarios the machine doesn't actually get to upgrade its proof verifier to a more advanced one, it just jumps to the best strategy.
There are also more complex scenarios. For example, sometimes you can improve your starting algorithm a lot but the ultimate optimal strategy is uncomputable, or maybe it's computable but the proof of optimality is out of your axiom system's reach. In such cases it's not clear to me why the machine's proof search should ever terminate, and I couldn't find a convincing answer to that in Schmidhuber's writings, but Löb's theorem seems to be only one of the many possible obstacles here.
Eliezer's question seems to be mostly about such complex scenarios. I think it's more enlightening to ask directly whether the Gödel machine will execute even one self-improvement, without dragging in Löb's theorem. This gives us a hard math question that I'm too lazy and/or stupid to answer :-(
Well I don't think I can solve this either, at least not quickly enough to justify the use of time. The combination of Loeb's theorem and the requirement for absolute proof (it couldn't even tentatively justify the use of, say, RSA without a proof that it is unbreakable) severely limits its abilities. Rereading the paper, I notic... (read more)