AI reflection problem

I tried to write down my idea a few times, but it was badly wrong each time. Now, Instead of solving the problem, I'm just going to give a more conservative summary of what the problem is.

Eliezer's talk at the 2011 Singularity Summit focused largely on the AI reflection problem (how to build AI that can prove things about its own proofs, and execute self modifications on the basis of those proofs, without thereby reducing its self modification mojo). To that end, it would be nice to have a "reflection principle" by which an AI (or its theorem prover) can know in a self-referential way that its theorem proving activities are working as they should.

The naive way to do this is to use the standard provability predicate, ◻, which can be thought of as asking whether a proof of a given formula exists. Using this we can try to formalize our intuition that a fully reflective AI, one that can reason about itself in order to improve itself, should understand that its proof deriving behavior does in fact produce sentences derivable from its axioms:

AI ⊢ ◻P → P,

which is intended to be read as "The formal system "AI" understand in general that "If a sentence is provable then it is true" ", though literally it means something a bit more like "It is derivable from the formal system "AI" that "if there exists a proof of sentence P, then P", in general for P".

Surprisingly, attempting to add this to a formal system, like Peano Arithmetic, doesn't work so well. In particular, it was shown by Löb that adding this reflection principle in general lets us derive any statement, including contradictions.

So our nice reflection principle is broken. We don't understand reflective reasoning as well as we'd like. At this point, we can brainstorm some new reflection principles: Maybe our reflection principles should be derivable within our formal system, instead of being tacked on. Also, we can try to figure out in a deeper way why "AI ⊢ ◻P → P" doesn't work: If we can derive all sentences, does that mean that proofs of contradictions actually do exist? If so, why weren't those proofs breaking our theorem provers before we added the reflection principle? Or do the proofs for all sentences only exist for the augmented theorem prover, not for the initial formal system? That would suggest our reflection principle is allowing the AI to trust itself too much, allowing it to derive things because deriving them allows them to be derived. Though it doesn't really look like that if you just stare at the old reflection principle. We are confused. Let's unconfuse ourselves.

I tried to write down my idea a few times, but it was badly wrong each time. Now, Instead of solving the problem, I'm just going to give a more conservative summary of what the problem is.

AI ⊢ ◻P → P,

the AI has to prove that this is true for every possible proof

Scott Aaronson wrote

Consider the "self-hating theory" PA+Not(Con(PA)), or Peano Arithmetic plus the assertion of its own inconsistency. We know that if PA is consistent, then this strange theory must be consistent as well -- since otherwise PA would prove its own consistency, which the Incompleteness Theorem doesn't allow. It follows, by the Completeness Theorem, that PA+Not(Con(PA)) must have a model. But what could such a model possibly look like? In particular, what you happen if, within that model, you just asked to see the proof that PA was inconsistent?

I'll tell you what would happen: the axioms would tell you that proof of PA's inconsistency is encoded by a positive integer X. And then you would say, "but what is X?" And the axioms would say, "X." And you would say, "But what is X, as an ordinary positive integer?"

"No, no, no! Talk to the axioms."

"Alright, is X greater or less than 10500,000?"

"Greater." (The axioms aren't stupid: they know that if they said "smaller", then you could simply try every smaller number and verify that none of them encode a proof of PA's inconsistency.)

"Alright then, what's X+1?"

"Y."

And so on. The axioms will keep cooking up fictitious numbers to satisfy your requests, and assuming that PA itself is consistent, you'll never be able to trap them in an inconsistency.

Taking this seriously, it seems that if PA is consistent, then PA + ¬Cons(PA) ⊢ ◻¬Cons(PA) is true, i.e. there exists a term, X, encoding a proof of ¬Cons(PA). It also seems, from "you'll never be able to trap them", that we can't construct this proof. Which is to say, within some PA-like systems of interest, there are things (existentially) provable which are not (constructively) provable. So I agree that an AI should be able to accept statements provable in the second system before modifying to use that system. Our difference is between requiring existential proofs (which are not interesting to me, since things can be provable and yet never proved) and constructive proofs (a set whose membership is undecidable, yet which, by showing equivalence between the two systems or between the first and a subsystem of the second, we can accept). Again however, this is not a real limitation, because the constructive proof of a statement lets us infer that the statement is existentially provable. Thus we recover the "sensible" existentially provable statements, and discard the ones which are provable and yet unprovable.

PA + ¬Cons(PA) ⊢ ◻¬Cons(PA)

I don't think this is quite what you meant. This means PA + ¬Cons(PA) allows you to prove that you can prove ¬Cons(PA), where the second use of the word 'prove' does not make reference to any specific formal system. The turnstile refers to the specific theory that is being used, here PA + ¬Cons(PA), and the box symbol is used when the meaning is clear from context, but you've used both.

Our difference is between requiring existential proofs (which are not interesting to me, since things can be provable and yet never proved) a

... (read more)

9

AI reflection problem

9

9

9

AI reflection problem

9

9