AI reflection problem — LessWrong

9 AI reflection problem

by [anonymous]

27th Nov 2011

2 min read

9

I tried to write down my idea a few times, but it was badly wrong each time. Now, Instead of solving the problem, I'm just going to give a more conservative summary of what the problem is.

Eliezer's talk at the 2011 Singularity Summit focused largely on the AI reflection problem (how to build AI that can prove things about its own proofs, and execute self modifications on the basis of those proofs, without thereby reducing its self modification mojo). To that end, it would be nice to have a "reflection principle" by which an AI (or its theorem prover) can know in a self-referential way that its theorem proving activities are working as they should.

The naive way to do this is to use the standard provability predicate, ◻, which can be thought of as asking whether a proof of a given formula exists. Using this we can try to formalize our intuition that a fully reflective AI, one that can reason about itself in order to improve itself, should understand that its proof deriving behavior does in fact produce sentences derivable from its axioms:

AI ⊢ ◻P → P,

which is intended to be read as "The formal system "AI" understand in general that "If a sentence is provable then it is true" ", though literally it means something a bit more like "It is derivable from the formal system "AI" that "if there exists a proof of sentence P, then P", in general for P".

Surprisingly, attempting to add this to a formal system, like Peano Arithmetic, doesn't work so well. In particular, it was shown by Löb that adding this reflection principle in general lets us derive any statement, including contradictions.

So our nice reflection principle is broken. We don't understand reflective reasoning as well as we'd like. At this point, we can brainstorm some new reflection principles: Maybe our reflection principles should be derivable within our formal system, instead of being tacked on. Also, we can try to figure out in a deeper way why "AI ⊢ ◻P → P" doesn't work: If we can derive all sentences, does that mean that proofs of contradictions actually do exist? If so, why weren't those proofs breaking our theorem provers before we added the reflection principle? Or do the proofs for all sentences only exist for the augmented theorem prover, not for the initial formal system? That would suggest our reflection principle is allowing the AI to trust itself too much, allowing it to derive things because deriving them allows them to be derived. Though it doesn't really look like that if you just stare at the old reflection principle. We are confused. Let's unconfuse ourselves.

Personal Blog

9

New Comment

27 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:52 PM

[-]cousin_it14y60

It's nice that more people are starting to think about AI-related logic issues. But I don't understand this part:

When we set them equivalent, we get A1 ⊢ ∀t. ∀P. [[t]]P → P which is okay.

Why is it okay? It seems to me that if the AI can prove ∀t. ∀P. [[t]]P → P, then it can prove that ∀P. (∃t. [[t]]P) → P and blow up by Löb's Theorem again. Like this:

Assume that ∃t. [[t]]P
Apply your statement and obtain ∃t. P
Note that t is unused and obtain P

Also, here's Wei Dai's solution from the original SL4 thread, which seems correct to me:

I think you probably misunderstood the paper. The purpose of searching for new theorem provers is to speed up the theorem proving, not to be able to prove new theorems that were unprovable using the old theorem prover. So the Gödel Machine only needs to prove that the new prover will prove a theorem if and only if the current prover will. I don't think this should be a problem.

[-]endoself14y40

This solution does not work. In order to implement it, you'd need your AI to preform actions that are judged to be good after conditioning on its own consistency. It could then replicate its theorem prover, but it could not replicate the decision rule that allows it to replicate the theorem prover, since it can't prove that actions that are good conditional on its consistency are good without conditioning on consistency.

Edit: I found Eliezer's response.

[-]cousin_it14y60

Thanks for the link! Now I understand Eliezer's question a little better. Let me try to think out loud and see if I'm right...

The Gödel machine works like this: it outputs a stream of actions according to some not-very-good algorithm, and keeps looping through proofs on the side. As soon as it finds a proof that replacing its whole source code leads to higher utility (from the next turn onwards) than keeping it as is, it replaces its whole source code. There's no talk about replacing just the proof verifier.

Any such improvement is "not greedy" by Schmidhuber's "Global optimality theorem", which notes that the utility of staying with the old source code for another turn already takes into account all other possible self-improvements that the machine could do in the future. So the very first self-improvement found by the machine will likely look like an immediate jump to optimality.

The question arises, if the first self-improvement is guaranteed to be so awesome, will the machine ever find it at all, or will the proof search loop forever?

In some simple scenarios the proof search will in fact terminate. For example, if the machine has the option of pressing a lever that maxes out the reward, it can easily prove that self-modifying to always press the lever has optimal utility compared to any other self-improvement it can come up with. Note that in such simple scenarios the machine doesn't actually get to upgrade its proof verifier to a more advanced one, it just jumps to the best strategy.

There are also more complex scenarios. For example, sometimes you can improve your starting algorithm a lot but the ultimate optimal strategy is uncomputable, or maybe it's computable but the proof of optimality is out of your axiom system's reach. In such cases it's not clear to me why the machine's proof search should ever terminate, and I couldn't find a convincing answer to that in Schmidhuber's writings, but Löb's theorem seems to be only one of the many possible obstacles here.

Eliezer's question seems to be mostly about such complex scenarios. I think it's more enlightening to ask directly whether the Gödel machine will execute even one self-improvement, without dragging in Löb's theorem. This gives us a hard math question that I'm too lazy and/or stupid to answer :-(

[-]endoself14y40

I think it's more enlightening to ask directly whether the Gödel machine will execute even one self-improvement, without dragging in Löb's theorem. This gives us a hard math question that I'm too lazy and/or stupid to answer :-(

Well I don't think I can solve this either, at least not quickly enough to justify the use of time. The combination of Loeb's theorem and the requirement for absolute proof (it couldn't even tentatively justify the use of, say, RSA without a proof that it is unbreakable) severely limits its abilities. Rereading the paper, I noticed that, since it explicitly compares the new design with the old one, it gets around a few Loebian difficulties, but this is clearly not a solution because we end up with a weird asymmetry between agent that the Goedel machine is willing to self-modify into and agents that the Godel machine is willing to replace itself with through any mechanism other that the self-improvement module.

On a completely different note, this post made me realize how horribly our knowledge is organized. This is an open problem that Eliezer believes is critical to the creation of FAI and in order to find information on it we have to read 7 year old sl4 posts and think about aspects of papers that have clearly been thought about before by very smart people, and yet we are still left with questions for which it is reasonably likely that someone knows the answer. You're one of our decision theory people - you've been thinking about these things for years! If this was sanely organized, you would have heard about the ideas that someone already understand many times over the course of the decision theory discussions.

[-]cousin_it14y40

After thinking about the problem some more, I no longer understand why Eliezer wants the machine to be fully meta and self-modifying. Imagine the following design:

1) An object-level program O that outputs actions into the world. Initially it just sits still and does nothing.

2) A meta-level program M that contains a prior about the world and a utility function, and slowly enumerates all possible proofs in some formal system. When M finds a proof that some object-level program O' achieves higher expected utility than O, it replaces O with O' and goes on searching. When M finds a proof that the current O is optimal, it stops.

In this setup M can't modify itself, but that doesn't matter. If it were able to self-modify, it would spend perhaps a century on finding the initial self-modification, but now it can spend the same century on making O self-modifying instead, so it seems to me that we keep most of the hypothetical speedup. Another bonus is that the combination of O and M looks much easier to reason about than a fully meta machine, e.g. M will obviously find many easy speedups for O without running into Löb or anything. Is there any reason to want a fully meta machine over this?

(Of course all such models share the drawback of AIXI: they think programs live outside the universe, so O can inadvertently destroy M's or its own hardware with its mining claws. Still, these things are interesting to figure out.)

[-]endoself14y00

Of course all such models share the drawback of AIXI: they think programs live outside the universe, so O can inadvertently destroy M's or its own hardware with its mining claws. Still, these things are interesting to figure out.

Wouldn't any solution to this problem enable M to reason about itself, forcing it to consider the effects of being modified by O?

[-]cousin_it14y20

Yeah. This family of questions is the most important one we don't know how to answer. Maybe Eliezer and Marcello have solved it, but they're keeping mum.

[-]endoself14y00

I don't think so. For one thing, Eliezer keeps talking about how important it is to solve this. I don't think his personal ethics would let him try to spread awareness of an open problem if he had actually solved it. Also, he seems to think that if the public knew the answer, it would reduce the chance of the creation of uFAI, so, unless the solution was very different than he expected - I'm thinking of it providing a huge clue toward AGI, but that sounds unlikely - he would have incentive to make it public.

[-]cousin_it14y50

Does Eliezer keep talking about the thing I called "the drawback of AIXI"? It seems to me that he keeps talking about the "AI reflection problem", which is different. And yeah, solving any of these problems would make AGI easier without making FAI much easier.

[-]endoself14y00

No, I was referring to the AI reflection problem in the grandparent.

I don't know if that would make AGI much easier. Even with a good reflective decision theory, you'd still need a more efficient framework for inference than an AIXI-style brute force algorithm. On the other hand, if you could do inference well, you might still make a working AI without solving reflection, but it would be harder to understand its goal system, making it less likely to be friendly. The lack of reflectivity could be an obstacle, but I think that it is more likely that, given a powerful inference algorithm and a lack of concern for the dangers of AI, it wouldn't be that hard to make something dangerous.

[-][anonymous]14y20

In that reply he says the the meta-system would need to make an assumption that the meta system works, which isn't much different from requiring that A1 proves A2 consistent before accepting it, which he mentioned in the first post briefly. That requirement of consistency is not what he originally asked for: that A1 proves that if A2 accepts P, then P, which Wei Dai does solve for the case of equivalent A1 and A2. It's true he doesn't show that such a modification is necessarily an improvement in all environments, but if Schmidhuber's usefulness means "This modification has expected utility greater than or equal to all other modifications including not modifying", as Eliezer used in his talk, then we don't require that, any more than we would require a Bayesian to always make true predictions. The proof of usefulness, like any full expected utility calculation, would require considering the probabilities and values of consequences as they follow from the AIs actions across possible worlds consistent with the AIs observations. That is a huge computation, but I don't see how it's logically impossible.

However, getting back to the separate criterion of consistency, why? Where does consist(A2) ever enter into a calculation? What do we gain by it? If A2 is inconsistent, I can see how that would usually limit the expected utility of becoming A2, but that's just not the same. "Good" shouldn't be conditional on consistency, it's just expectation of valuable consequences.

[-]endoself14y10

if Schmidhuber's usefulness means "This modification has expected utility greater than or equal to all other modifications including not modifying"

Schmidhuber's Goedel machine can handle decision rules other than the expected value, but apart from that, this is correct.

That is a huge computation, but I don't see how it's logically impossible.

It is required to compute its own expected actions. The naive way of doing this collapses into infinite recursion. We can imagine more sophisticated ways that usually work, but usually is not a proof; the Goedel machine only takes expectations over possible states of the environment, not over possible truth-values of mathematical statements. It is unclear how one could preform estimates over the latter; if you have an algorithm that can estimate the probability of, say, Goldbach's conjecture, you have made quite a significant amount of progress in AGI.

Where does consist(A2) ever enter into a calculation? What do we gain by it?

Without either a proof of consist(A2) or some way of estimating the probability of consist(A2), we will not be able to calculate expected utilities.

[-][anonymous]14y20

I agree with Wei Dai's solution, though somehow I didn't see it before. The formula I wrote was incorrect. Do you know why Eliezer is still talking about the problem seven years after Wei Dai provided the solution? Or more specifically, why he wants to have A1 ⊢ ∀P . ◻(A2⊢P) → P instead of A1 ⊢ ∀P . (A2⊢P) → P ?

I've rewritten the post, giving credit to Wei Dai, and expanding on my contribution. Does that look okay?

[-]cousin_it14y30

Your rewrite seems to have the same error as the previous version. A correct formalization of Wei's solution could look like this:

A1 ⊢ ∀P. (A1 ⊢ P) ↔ (A2 ⊢ P)

If A1=A2, that equation collapses into a tautology, as it should.

[-]Manfred14y30

Cool post.

What I thought of when reading this was the idea that the best way to learn Occam's razor is to get born with the right genes, because if you think unlikely things are better beliefs you're already screwed. The analogous "hope you got born right" approach wouldn't care about proving P, but instead wants to prove that the new system and the old system prove the same things.

[-]Nisan14y20

What's the difference between ◻_A2(P) and (A2 ⊢P) ? Do they represent different strings in first-order logic?

[-]endoself14y10

A1 ⊢ ∀P. [[t]]_A2(P) → P

I don't get this. Is t supposed to be a free variable here? How does this mean anything without quantifying over t?

[-][anonymous]14y00

I suppose in addition to confirming that it's wrong I should tell you what it was supposed to mean. For that I'll appeal to some concepts that I considered when I was first working on the problem last Wednesday. Imagine a formula taking encoding terms to encoding terms, such that whenver we're given t which codes a proof for P from A2 then the formula applied to t produces s, a term encoding a proof of P from A1. If the A1 can construct that formula, then it should trust A2. That isn't hugely helpful for solving the problem, because when A1 and A2 are equivalent, we can just use t-type codes as s-type codes, and when they're not then we need to impose heavy restrictions on allowable t-type codes. Also that absolutely does require limiting the set of P. Once you satisfy both of those, you basically can embed a fragment of A2 in A1 and A1 in the fragment, though that's not quite how you go about doing it. There, epistemic duty mostly satisfied for the night.

[This comment is no longer endorsed by its author]Reply

[-][anonymous]14y00

I just wrote it incorrectly. Too tired right now, I'll fix it in the morning.

[-]endoself14y00

If instead we formalize the criterion as

A1 ⊢ ∀P. (A2 ⊢ P) → P

Then it doesn't break down in the case of logical equivalence,

A1 ⊢ ∀P. (A1 ⊢ P) → P

which is okay.

Umm, isn't this exactly the sentence that Loeb's theorem says should not be provable in general?

[-][anonymous]14y00

No, Löb's theorem is about ◻P, "there exists a proof of P", while this is about A1⊢P, "we have proved P". In particular, Löb's theorem says that if we assume "the existence of a proof of P gives P" then we're already assuming P, even if we haven't proved that existence. Which is weird, but correct.

[-]endoself14y00

A1⊢P means that A1 can prove P, not that a proof has been found.

What you're proposing is essentially replacing the axiom (using your notation [[t]]P for "t is a proof of P")

∀P. (∃t. [[t]]P) → P

with an axiom schema

∀P. ([[t]]P) → P

where each value of t gives a different axiom. I'll have to think about whether this works - my initial feeling is no, but I'll look into it more when I have a chance.

Edit: The ∀P should also be interpreted as a schema; it's rather nonstandard to be able to quantify over statements.

[-][anonymous]14y00

In On Explicit Reflection in Theorem Proving and Formal Verification S. Artemov claims that for any t and P, it is derivable that [[t]]P→P, provided that the system has a few properties: the underlying logic is classical or intuitionistic, proofhood in the system is decidable / theoremhood is computably enumerable, the system can represent any computable function and decidable relation, and there is a function rep that maps syntactic terms to ground terms. PatrickRobotham tells me the paper is very messy, so maybe that doesn't matter. I'm clearly over my head.

[-]endoself14y10

The statement is definitely provable for any t and P; it's easy to check whether a proof proves a statement in any normal axiomatic system.

The thing that doesn't work about this is that we can't use it to express our confidence in a proposed self-modification. Given any proof that an AI's modified self can come up with, the original AI can prove that it is valid, but in order to have reason to want to self-modify, the AI has to prove that this is true for every possible proof, which, by Loeb's theorem, is impossible (assuming, of course, that the AI does not think that it would gain utility if it self-modified so as to be inconsistent).

[-][anonymous]14y20

the AI has to prove that this is true for every possible proof

Scott Aaronson wrote

Consider the "self-hating theory" PA+Not(Con(PA)), or Peano Arithmetic plus the assertion of its own inconsistency. We know that if PA is consistent, then this strange theory must be consistent as well -- since otherwise PA would prove its own consistency, which the Incompleteness Theorem doesn't allow. It follows, by the Completeness Theorem, that PA+Not(Con(PA)) must have a model. But what could such a model possibly look like? In particular, what you happen if, within that model, you just asked to see the proof that PA was inconsistent?

I'll tell you what would happen: the axioms would tell you that proof of PA's inconsistency is encoded by a positive integer X. And then you would say, "but what is X?" And the axioms would say, "X." And you would say, "But what is X, as an ordinary positive integer?"

"No, no, no! Talk to the axioms."

"Alright, is X greater or less than 10500,000?"

"Greater." (The axioms aren't stupid: they know that if they said "smaller", then you could simply try every smaller number and verify that none of them encode a proof of PA's inconsistency.)

"Alright then, what's X+1?"

"Y."

And so on. The axioms will keep cooking up fictitious numbers to satisfy your requests, and assuming that PA itself is consistent, you'll never be able to trap them in an inconsistency.

Taking this seriously, it seems that if PA is consistent, then PA + ¬Cons(PA) ⊢ ◻¬Cons(PA) is true, i.e. there exists a term, X, encoding a proof of ¬Cons(PA). It also seems, from "you'll never be able to trap them", that we can't construct this proof. Which is to say, within some PA-like systems of interest, there are things (existentially) provable which are not (constructively) provable. So I agree that an AI should be able to accept statements provable in the second system before modifying to use that system. Our difference is between requiring existential proofs (which are not interesting to me, since things can be provable and yet never proved) and constructive proofs (a set whose membership is undecidable, yet which, by showing equivalence between the two systems or between the first and a subsystem of the second, we can accept). Again however, this is not a real limitation, because the constructive proof of a statement lets us infer that the statement is existentially provable. Thus we recover the "sensible" existentially provable statements, and discard the ones which are provable and yet unprovable.

[-]endoself14y00

PA + ¬Cons(PA) ⊢ ◻¬Cons(PA)

I don't think this is quite what you meant. This means PA + ¬Cons(PA) allows you to prove that you can prove ¬Cons(PA), where the second use of the word 'prove' does not make reference to any specific formal system. The turnstile refers to the specific theory that is being used, here PA + ¬Cons(PA), and the box symbol is used when the meaning is clear from context, but you've used both.

Our difference is between requiring existential proofs (which are not interesting to me, since things can be provable and yet never proved) and constructive proofs (a set whose membership is undecidable, yet which, by showing equivalence between the two systems or between the first and a subsystem of the second, we can accept). Again however, this is not a real limitation, because the constructive proof of a statement lets us infer that the statement is existentially provable. Thus we recover the "sensible" existentially provable statements, and discard the ones which are provable and yet unprovable.

This seems like a description of some constructive logic version of PA. I'm not sure exactly how that would work or if there is a standard constructive axiomatization of the natural numbers, but the set of theorems of constructive logic is a subset of those of classical logic, so this will not be able to prove that some given axiom system is trustable if the original system cannot do so.

You could add constructive theorems if you also add axioms, which I'm not sure whether you're proposing. For example, given a universal Turing machine U and a number x, there is a shortest program p such that U(p) = x and it could be the case, depending on U and x, that PA can prove that such a p exists, but not what it is. Then, by adding axioms that U never halts on certain inputs, it can be possible to deduce the value of p.

I don't see how that could help with the reflection problem though. Do you propose adding soundness of some formal system as an axiom, but preventing Loeb's theorem from applying by switching to constructive logic? I think Loeb's theorem is still valid constructively; it doesn't use negation, but I'm not sure if that is a sufficient criterion to be provable constructively.

[-][anonymous]14y00

It's called Heyting arithmetic, but no, nowhere above do I mean intuitionistic logic when I say constructive. Or I don't think I do. I recognize Löb's theorem and am not trying restrict any system's expressive power to avoid it (i.e. using Dan Willard's self verifying theories).

Maybe I'll try to prove some sufficient conditions for a Gödel machine to self modify in a toy universe given this perspective. That should make it more convincing / concrete.

Moderation Log

Curated and popular this week

27Comments

Comment Permalink

[anonymous]14y20

the AI has to prove that this is true for every possible proof

Scott Aaronson wrote

Consider the "self-hating theory" PA+Not(Con(PA)), or Peano Arithmetic plus the assertion of its own inconsistency. We know that if PA is consistent, then this strange theory must be consistent as well -- since otherwise PA would prove its own consistency, which the Incompleteness Theorem doesn't allow. It follows, by the Completeness Theorem, that PA+Not(Con(PA)) must have a model. But what could such a model possibly look like? In particular, what you happen if, within that model, you just asked to see the proof that PA was inconsistent?

I'll tell you what would happen: the axioms would tell you that proof of PA's inconsistency is encoded by a positive integer X. And then you would say, "but what is X?" And the axioms would say, "X." And you would say, "But what is X, as an ordinary positive integer?"

"No, no, no! Talk to the axioms."

"Alright, is X greater or less than 10500,000?"

"Greater." (The axioms aren't stupid: they know that if they said "smaller", then you could simply try every smaller number and verify that none of them encode a proof of PA's inconsistency.)

"Alright then, what's X+1?"

"Y."

And so on. The axioms will keep cooking up fictitious numbers to satisfy your requests, and assuming that PA itself is consistent, you'll never be able to trap them in an inconsistency.

endoself14y00

PA + ¬Cons(PA) ⊢ ◻¬Cons(PA)

Our difference is between requiring existential proofs (which are not interesting to me, since things can be provable and yet never proved) a

... (read more)

See in context