Tiling Agents for Self-Modifying AI (OPFAI #2)

Eliezer Yudkowsky

An early draft of publication #2 in the Open Problems in Friendly AI series is now available: Tiling Agents for Self-Modifying AI, and the Lobian Obstacle. ~20,000 words, aimed at mathematicians or the highly mathematically literate. The research reported on was conducted by Yudkowsky and Herreshoff, substantially refined at the November 2012 MIRI Workshop with Mihaly Barasz and Paul Christiano, and refined further at the April 2013 MIRI Workshop.

Abstract:

We model self-modication in AI by introducing 'tiling' agents whose decision systems will approve the construction of highly similar agents, creating a repeating pattern (including similarity of the offspring's goals). Constructing a formalism in the most straightforward way produces a Godelian difficulty, the Lobian obstacle. By technical methods we demonstrate the possibility of avoiding this obstacle, but the underlying puzzles of rational coherence are thus only partially addressed. We extend the formalism to partially unknown deterministic environments, and show a very crude extension to probabilistic environments and expected utility; but the problem of finding a fundamental decision criterion for self-modifying probabilistic agents remains open.

Commenting here is the preferred venue for discussion of the paper. This is an early draft and has not been reviewed, so it may contain mathematical errors, and reporting of these will be much appreciated.

The overall agenda of the paper is introduce the conceptual notion of a self-reproducing decision pattern which includes reproduction of the goal or utility function, by exposing a particular possible problem with a tiling logical decision pattern and coming up with some partial technical solutions. This then makes it conceptually much clearer to point out the even deeper problems with "We can't yet describe a probabilistic way to do this because of non-monotonicity" and "We don't have a good bounded way to do this because maximization is impossible, satisficing is too weak and Schmidhuber's swapping criterion is underspecified." The paper uses first-order logic (FOL) because FOL has a lot of useful standard machinery for reflection which we can then invoke; in real life, FOL is of course a poor representational fit to most real-world environments outside a human-constructed computer chip with thermodynamically expensive crisp variable states.

As further background, the idea that something-like-proof might be relevant to Friendly AI is not about achieving some chimera of absolute safety-feeling, but rather about the idea that the total probability of catastrophic failure should not have a significant conditionally independent component on each self-modification, and that self-modification will (at least in initial stages) take place within the highly deterministic environment of a computer chip. This means that statistical testing methods (e.g. an evolutionary algorithm's evaluation of average fitness on a set of test problems) are not suitable for self-modifications which can potentially induce catastrophic failure (e.g. of parts of code that can affect the representation or interpretation of the goals). Mathematical proofs have the property that they are as strong as their axioms and have no significant conditionally independent per-step failure probability if their axioms are semantically true, which suggests that something like mathematical reasoning may be appropriate for certain particular types of self-modification during some developmental stages.

Thus the content of the paper is very far off from how a realistic AI would work, but conversely, if you can't even answer the kinds of simple problems posed within the paper (both those we partially solve and those we only pose) then you must be very far off from being able to build a stable self-modifying AI. Being able to say how to build a theoretical device that would play perfect chess given infinite computing power, is very far off from the ability to build Deep Blue. However, if you can't even say how to play perfect chess given infinite computing power, you are confused about the rules of the chess or the structure of chess-playing computation in a way that would make it entirely hopeless for you to figure out how to build a bounded chess-player. Thus "In real life we're always bounded" is no excuse for not being able to solve the much simpler unbounded form of the problem, and being able to describe the infinite chess-player would be substantial and useful conceptual progress compared to not being able to do that. We can't be absolutely certain that an analogous situation holds between solving the challenges posed in the paper, and realistic self-modifying AIs with stable goal systems, but every line of investigation has to start somewhere.

Parts of the paper will be easier to understand if you've read Highly Advanced Epistemology 101 For Beginners including the parts on correspondence theories of truth (relevant to section 6) and model-theoretic semantics of logic (relevant to 3, 4, and 6), and there are footnotes intended to make the paper somewhat more accessible than usual, but the paper is still essentially aimed at mathematically sophisticated readers.

Abstract:

The opportunity to kill yourself in exchange for $10 is a prototypical case. It's well and good to say "this is only a problem for an agent who uses proofs," but that's not a scenario, that's a type of agent. Yes, real agents will probably not use mathematical logic in the naive way. But as I pointed out in response to Wei Dai, probabilism doesn't make the issues go away. It softens the impossibility proofs, but we still lack possibility proofs (which is what MIRI is working on). So this seems like a weak objection.

Thank you for the example. I do want to say "this is only a problem for an agent who uses proofs" if that's indeed true. It sounds like you agree, but are saying that some analogous but more complicated problem might arise for probabilistic agents, and that it might not be resolved be whoever else is making AI unless this research is done by MIRI. If you have an example of a complication that you think would plausibly arise in practice and have further thoughts on why we shouldn't expect this complication to be avoided by default in the course of the ordinary development of AI, I would be interested in hearing more. These do seem like crucial questions to me if we want to argue that this is an important line of research for the future of AI. Do you agree that these questions are crucial?

Creation process A [natural and cultural evolution] led to agents who don't require a formalized deductive system By analogy, creation process C [people making AGI] will lead to agents who don't require a formalized deductive system Therefore, it is not necessary to take special precautions to ensure that deductive systems are formalized. Do you object to the analogy?

I do object to this analogy, though I now have a better idea of where you are coming from. Here's a stab at how the arguments are different (first thing that came to mind):

My argument says that if creation process A led to agents who overcome obstacle X to doing Z, then the ordinary development of AGI will lead to agents who overcome obstacle X to doing Z.
Your argument says that if creation process A led to agents who overcome obstacle X to doing Z in way Y, then the ordinary development of AGI will lead to agents who overcome obstacle X to doing Z in way Y.

We might want to insert some qualifiers like "obstacle X needs to be essential to the proper functioning of the agent" or something along those lines, and other conditions I haven't thought of may be relevant as well (often the case with analogies). But, basically, though I think the analogy suggests that the ordinary development of AI will overcome Lobian obstacles, I think it is much less supported that AGIs will overcome these obstacles in the same way as humans overcome them. Likewise, humans overcome obstacles to reasoning effectively in certain ways, and I don't think there is much reason to suspect that AGIs will overcome these obstacles in the same ways. Therefore, I don't think that the line of argument I was advancing supports the view that formalizing math and doing mathematical logic will be unhelpful in developing AI.

No one thinks that the world will be destroyed because people built AI's that couldn't handle the Lobian obstruction. That doesn't seem like a sensible position, and I think Eliezer explicitly disavows it in the writeup. The point is that we have some frameworks for reasoning about reasoning. Those formalisms don't capture reflective reasoning, i.e. they don't provide a formal account of how reflective reasoning could work in principle. The problem Eliezer points to is an obvious problem that any consistent framework for reflective reasoning must resolve.

I think what you're saying is that getting a good framework for reasoning about reasoning could be important for making AGI go well. This is plausible to me. And then you're also saying that working on this Lobian stuff is a reasonable place to start. This is not obvious to me, but this seems like something that could be subtle, and I understand the position better now. I also don't think that however you're doing it should necessarily seem reasonable to me right now, even if it is.

Big picture: the big questions I had about this were:

What are some plausible concrete examples of self-modifications where Lob issues might cause you to stumble?
Do you think that people building AGI in the future will stumble over Lob issues if MIRI doesn't work on those issues? If so, why?

I would now ask those questions differently:

What are some plausible concrete examples of cases where machines might fail to reason about self-modification properly if this research isn't done? Why do you think it might happen in these cases?
Do you think that people building AGI in the future will fail to do this research, if it is in fact necessary for building well-functioning AIs? If so, why?

I now have a better understanding of what your answer the first question might look like, though I'm still struggling to imagine what plausibly go wrong in practice if the research isn't done. As far as I can tell, there hasn't been any effort directed at addressing the second question in this specific context so far. Maybe that's because it's thought that it's just part of the general question of whether future elites will handle AI development just fine. I'm not sure it is because it sounds like this may be part of making an AGI work at all, and the arguments I've heard for future elites not navigating it properly seems to turn on safety issues rather than basic functionality issues.

(As usual, I have somewhat less extreme views here than Eliezer.)

saying that some analogous but more complicated problem might arise for probabilistic agents

There is a problem here, we have an impossibility proof for a broad class of agents, and we know of no agents that don't have the problem. Indeed, this limits the relevance of the impossibility proof, but it doesn't limit the realness of the problem.

If you have an example of a complication that you think would plausibly arise in practice and have further thoughts on why we shouldn't expect this

... (read more)

13Eliezer Yudkowsky13y

That's not it, rather: [...] Yep. We have reasoning frameworks like the currently dominant forms of decision theory, but they don't handle reflectivity well. The Lob Problem isn't a top-priority scary thing that is carved upon the tombstones of worlds, it's more like, "Look! We managed to crisply exhibit something very precise that would go wrong with standard methods and get started on analyzing and fixing it! Before we just saw in a more intuitive sense that something would go wrong when we applied standard theories to reflective problems but now we can state three problems very precisely!" (Lob and coherent quantified belief sec. 3, nonmonotonicity of probabilistic reasoning sec. 5.2 & 7, maximizing / satisficing not being good-enough idioms for bounded agents sec. 8.) Problems with reflectivity in general are expectedly carved upon the tombstones of worlds because they expectedly cause problems with goal stability during self-modification. But to make progress on that you need crisp problems to provide fodder for getting started on finding a good shape for a reflective decision theory / tiling self-improving agent.

88

Tiling Agents for Self-Modifying AI (OPFAI #2)

88

88

88

Tiling Agents for Self-Modifying AI (OPFAI #2)

88

88