Clarification of AI Reflection Problem

paulfchristiano

Consider an agent A, aware of its own embedding in some lawful universe, able to reason about itself and use that reasoning to inform action. By interacting with the world, A is able to modify itself or construct new agents, and using these abilities effectively is likely to be an important component of AGI. Our current understanding appears to be inadequate for guiding such an agent's behavior, for (at least) the following reason:

If A does not believe "A's beliefs reflect reality," then A will lose interest in creating further copies of itself, improving its own reasoning, or performing natural self-modifications. Indeed, if A's beliefs don't reflect reality then creating more copies of A or spending more time thinking may do more harm than good. But if A does believe "A's beliefs reflect reality," then A runs immediately into Gödelian problems: for example, does A become convinced of the sentence Q = "A does not believe Q"? We need to find a way for A to have some confidence in its own behavior without running into these fundamental difficulties with reflection.

This problem has been discussed occasionally at Less Wrong, but I would like to clarify and lay out some examples before trying to start in on a resolution.

Gödel Machines

The Gödel machine is a formalism described by Shmidhuber for principled self-modification. A Gödel machine is designed to solve some particular object level problem in its allotted time. I will describe one Gödel machine implementation.

The initial machine A has an arbitrary object level problem solver. Before running the object level problem solver, however, A spends half of its time enumerating pairs of strings (A', P); for each one, if A' is a valid description of an agent and P is a proof that A' does better on the object level task than A, then A transforms into A'.

Now suppose that A's initial search for self-modifications is inefficient: a new candidate agent A' has a more efficient proof checker, and so is able to examine more proofs before defaulting to the original object level problem solver. Intuitively, A' is an improvement over A, and this is (a very mild version of) the sort of self-modification which we would like our agents to perform. For example, a human is easily able to recognize that it can achieve higher utility by modifying itself to think faster. But we will see that under most conditions A cannot prove that A' is an improvement, because A is not able to assume its own consistency (as we do in our justification that A' is an improvement).

Suppose A' considers some new self-modification (A'', P) not considered by A. We would like A to be able to prove that if A' executes this self-modification--if P is a proof that A'' is an improvement over A'--then A'' is really an improvement over A' (and therefore over A). We say that a pair (A'', P) is safe if it has this property, i.e., if either P is not a proof that A'' is an improvement over A', or if A'' is really an improvement over A'.

A can prove that particular pairs are safe, but suppose A' considers many candidate modifications (A'', P1), (A'', P2), ..., (A'', Pk). It is now much harder for A to show that all of these self-modifications are safe, without relying directly on the consistency of A's axioms system (which A can never assume, by the incompleteness theorem). If the set of considered proofs is sufficiently rich, then by a bounded version of Löb's theorem A can only prove that all of these self-modifications are safe if it can prove directly that A'' is an improvement over A'. But in that case A might as well dispense with theorem proving altogether and just commit to transforming into A'' after exhausting all other self-modifications.

(Originally this post was centered around a formal version of the above argument, but although I believe the argument works it turns out to be simultaneously very difficult and very unenlightening. Here is the main difficulty: in order to justify converting into A', A only needs to prove that A' won't execute any destructive modifications if A' doesn't make some other modification before reaching any of the new proposals, and this additional assumption is very difficult to work with. In general, it is very hard to prove anything at all about a Gödel machine's behavior, which is unfortunate because a Gödel machine must do exactly this in order to self-modify.)

This informal argument does not show that a Gödel machine will not execute any self-improvements, but it does show that if it does execute any non-trivial self-improvements then it must appeal to some justification other than the intuitive one. Other modifications, for example any modification which is so radical as to change the set of self-modifications considered in a more complicated way or to change the architecture fundamentally, seem to be even harder to justify than the one described here. Gödel machines may be able to self-modify, but this argument shows that this conclusion is not supported by intuitive arguments about the existence of beneficial self-modifications.

Creating New Agents

We can also try a non-reflective technique for designing an "optimal problem solver," in the spirit of Hutter search. Again, suppose we have a particular object level problem and some allotted time.

Consider the agent A which behaves as follows: in the first half of the allotted time, it enumerates pairs (B, P), and if B is a valid description of an algorithm and P is a proof that B has utility at least U on the object level problem (using at most half of the allotted time), then A stores the pair (B, U). After half of the allotted time, the pair (B, U) with maximal U is selected, and then the algorithm B is run using the remaining time.

This solution seems like it may avoid some of the trouble encountered by the Gödel machine, in exchange for being somewhat less ambitious in its goals. But we will see that the same reflection problem arises.

If the allotted time is very large, and if self-modification is an important component of general intelligence, then we expect A to select a new algorithm B which is itself capable of (provably useful) self-modification. In principle A could simply run long enough to become a very powerful non-self-modifying agent; but it may be that finding a very powerful agent B directly is much more difficult then proceeding through a sequence of successively more powerful agents, each intelligent enough to design the next but not to leap directly to B.

But suppose A creates a new algorithm B which uses the same axioms as A to prove theorems, and uses these theorems to guide its own self-modification. Again, by Löb's theorem we see that A cannot guarantee that B's self-modifications are really beneficial. This means that A is unable to rule out the possibility that B executes a very bad self-modification (or creates a new agent with very bad behavior); but without ruling out the possibility that B performs very badly, A cannot run B instead of some more conservative algorithm which does not do any self-modification.

This does not rule out the possibility that A creates a new algorithm B' which pursues self-modification using a similar strategy but with a more restricted set of axioms--one which can be proven consistent from within A. And in turn, if B' takes the same form as A, then any algorithm C which B' creates or modifies into will use a smaller still set of axioms by the same argument. Eliezer describes the proof-theoretic strength of such an agent as a 'telomere'--with every self-modification it becomes weaker, and for Gödelian reasons it cannot be increased without sacrificing provable correctness.

(The distinction between the Gödel machine and creating new agents is inessential; for example, we could imagine a Gödel machine attempt to self-modify by improving the efficiency of its proof searcher and then using the additional time to consider some rewrites which are justified by a more restricted axiom system.)

Possible Solutions

There are some natural solutions which have been considered.

Long Telomeres

Recognizing that an agent must decrease its proof-theoretic strength when performing a self-modification, we may just try to construct an initial agent with substantial proof-theoretic strength. For example, if we believe that PA probably has sufficient logical expressiveness to facilitate useful superintelligence, we may create an agent with axioms PA + "PA is consistent" + "PA + 'PA is consistent' is consistent" and so on. We may iterate this process a finite number of times to allow for a few self-modifications, or perhaps add an entire countable chain so that the original AI can decide how much proof-theoretic strength its first descendant needs, or perhaps add a few countable chains so that the agent can self-modify a handful of times before needing to make that decision, or so on.

The primary objection to this solution is philosophical: humans appear to be able to reason about self-modification much more robustly (or at least with better results) than an AI using any of the logical frameworks discussed so far, and it is quite ambiguous whether or not a system with 'long telomeres' can ever justify becoming a system which is actually able to reason about self-modification correctly (or, if we have doubts about humans' capacities, whether such a system even exists). Minimally, if we want to understand the form an AI is likely to take after self-modification, we need to understand how such a system would eventually escape from problems that currently seem inescapable; without such understanding the resulting behavior is hard to get a handle on (although we might still be satisfied, if that behavior is backed up by formal justification produced by the AI).

Lesser objections focus on the additional complexity required to maintain such a long telomere. Each additional consistency axiom added to a system essentially requires recursively specifying the entire agent as it exists so far, and the complexity of the mathematical manipulations required seems to grow rapidly. This argument seems to be fairly unimportant, and so if the philosophical objections were dispatched with I would consider building agents with very strong proof systems a feasible path forward.

Beliefs rather than Proofs

Humans seem to be able to avoid Gödelian problems with reflection, but they may do this by having confidence in statements which they cannot prove, and manipulating uncertainty about logical facts rather than dealing in absolute assertions of truth. The set of "statements I strongly believe" seems to differ in important ways from the set of "statements I can prove" in several important ways; for example, if I am confident of a statement at one point in time I am not committed to remaining confident indefinitely.

I suspect that a satisfactory theory of reflective beliefs would go far towards resolving the AI reflection problem, but this is hardly more than a restatement of the problem. Translating from proofs to beliefs does not immediately resolve the problems with self-reference, it just replaces them with subtly different issues. For example, it is easy to see that an agent should not believe that its own beliefs are well-calibrated on all questions, and so we must attempt to formalize a version of the self-consistency hypothesis which is weak enough to be defensible but still strong enough to escape the shortcomings described above. I hope to make several posts on this topic in the near future.

Ignorance

Understanding this issue may not be necessary to building safe AGI. Indeed, self-modification may ultimately play a minimal role in intelligence, or we may settle for executing self-modifications using weaker justification. However if we accept usual arguments about the importance of FAI, then we should not be satisfied with this solution.

The importance of self-modification is an open question which has received much discussion here and elsewhere. It is worth adding that, to the extent that we are concerned with influencing probable outcomes for humanity, the highest leverage scenarios seem to be those in which self-modification tends to result in positive feedback loops and takeoff (if we assign such scenarios significant weight). That is, in such scenarios we should be particularly cautious about building self-modifying systems, but there is also a much greater imperative to understand how to design safe and stable AI.

Standard arguments surrounding FAI (particularly, the importance of early AI goal systems and the fragility of humane value) suggest that agents should have high degrees of confidence in a change before executing it. If an agent's beliefs are not correctly related to reality, the resulting behavior may be as dangerous as if the agent's valus were modified. For example, incorrect beliefs about logical structure which cause that agent to fail to preserve its own values in subsequent rewrites, or incorrect beliefs about the relationships between value and reality.

[-]Anatoly_Vorobey14y30

I would like to request some clarifications.

Given an agent A as you describe, let AO be its object-level algorithm, and AM, for meta, be its agent-search algorithm. A itself stands for "execute AM for 1/2 the time, and if that didn't transform yourself, execute AO for the remaining time".

When you say "if A' is a valid description of an agent and P is a proof that A' does better on the object level task than A", do you mean that P proves: 1) that A'O is better than AO at the task T; 2) that A' is better than A at T; 3) that A' is better than AO at T?

If you mean 2), doesn't it follow that you can get at most 1/2 speedup from any transformation? After all, if A solves T through transforming into A', that is still a case of A solving T, and so the only way A' could be better at it than A is due to the penalty A incurs by searching for (A',P); that penalty is at most 1/2 the time spent by A. It also depends a lot on the search strategy and looks "brittle" as a measure of difference between A and A'.
It's not clear to me what you mean in "but suppose A' considers many candidate modifications (A'', P1), (A'', P2), ..., (A'', Pk). It is now much harder for A to show that all of these self-modifications are safe..." - are these different agents A"1, A"2 etc., or the same agent with different proofs? If the latter, it's not clear what "these self-modifications" refer to, as there's just one modification, with different potential proofs; if the former, the language later on that seems to refer to one unique A" is confusing.
If we look at the simplest case, where A' runs twice faster than A but as a program is equivalent to A (e.g. suppose they run on the same underlying hardware platform, but A has a systematic slowdown in its code that's trivial to remove and trivial to show it doesn't affect its results), and A is able to prove that, do you claim that this case falls under your scenario, and A in fact should not be able to self-modify into A' w/o assuming its consistency? If you do, are you able, for this specific simple case, to debug the reasoning and show how it carries through?

[-]paulfchristiano14y10

I mean (following Shmidhuber) that A' is better than A. Of course if A transforms into A' then A' and A must be about equally good, but what we should really ask is "how much better is A then some less reflective approach?" Then we could imagine large improvements: for example, A could use the first fourth of its time doubling the speed of its proof searcher, the next eighth doubling it again, the next sixteenth doubling it again, and end up finding a better object level solver that would have taken much longer than the entire allotted time to discover using the original proof searcher. The argument I gave suggests that this can't happen in any obvious way, so that A is actually only a factor of 2 faster than a particular non-reflective proof searcher, but that seems to be a different issue.
When I wrote "these self-modifications" I meant "these (self-modifications + justifications)", as safety was a property of a justified self-modification. I'll try and clarify.
If A can prove that AO always does better with more time, then A can modify into A' and simultaneously commit to spending less time doing proof search. If it spends much more time doing proof search, it runs into the sort of issue I highlighted (consider the additional justified self-modifications it looks at; it can't prove they are all safe, so it can't prove that considering them is good). I could try and make this argument more clear/explicit.

[-]Anatoly_Vorobey14y00

Thanks, this whole part of your post is clearer to me now. I think the post would benefit from integrating these explanations to some degree into the text.

[-]endoself14y20

For example, it is easy to see that an agent should not believe that its own beliefs are well-calibrated on all questions

Hmm. If it believed itself to be well-calibrated on questions where it is certain we have Loeb's paradox, but are there any obvious problems with an agent that thinks it is well-calibrated on all questions where it is not certain?

Nice choice of phrase.

[-]Cheyne14y00

[. . .] incorrect beliefs about the relationships between value and reality.

To anyone who would help, what are some examples of both correct and incorrect beliefs about the relationships between value and reality?

A simple, maybe somewhat silly, example: I care about making as many happy humans as possible, but believe that I am in a simulation and everyone is a zombie, so spend all of my time trying to escape the simulation (thereby killing all humans).

LESSWRONG
LW

33