An early draft of publication #2 in the Open Problems in Friendly AI series is now available: Tiling Agents for Self-Modifying AI, and the Lobian Obstacle. ~20,000 words, aimed at mathematicians or the highly mathematically literate. The research reported on was conducted by Yudkowsky and Herreshoff, substantially refined at the November 2012 MIRI Workshop with Mihaly Barasz and Paul Christiano, and refined further at the April 2013 MIRI Workshop.
Abstract:
We model self-modication in AI by introducing 'tiling' agents whose decision systems will approve the construction of highly similar agents, creating a repeating pattern (including similarity of the offspring's goals). Constructing a formalism in the most straightforward way produces a Godelian difficulty, the Lobian obstacle. By technical methods we demonstrate the possibility of avoiding this obstacle, but the underlying puzzles of rational coherence are thus only partially addressed. We extend the formalism to partially unknown deterministic environments, and show a very crude extension to probabilistic environments and expected utility; but the problem of finding a fundamental decision criterion for self-modifying probabilistic agents remains open.
Commenting here is the preferred venue for discussion of the paper. This is an early draft and has not been reviewed, so it may contain mathematical errors, and reporting of these will be much appreciated.
The overall agenda of the paper is introduce the conceptual notion of a self-reproducing decision pattern which includes reproduction of the goal or utility function, by exposing a particular possible problem with a tiling logical decision pattern and coming up with some partial technical solutions. This then makes it conceptually much clearer to point out the even deeper problems with "We can't yet describe a probabilistic way to do this because of non-monotonicity" and "We don't have a good bounded way to do this because maximization is impossible, satisficing is too weak and Schmidhuber's swapping criterion is underspecified." The paper uses first-order logic (FOL) because FOL has a lot of useful standard machinery for reflection which we can then invoke; in real life, FOL is of course a poor representational fit to most real-world environments outside a human-constructed computer chip with thermodynamically expensive crisp variable states.
As further background, the idea that something-like-proof might be relevant to Friendly AI is not about achieving some chimera of absolute safety-feeling, but rather about the idea that the total probability of catastrophic failure should not have a significant conditionally independent component on each self-modification, and that self-modification will (at least in initial stages) take place within the highly deterministic environment of a computer chip. This means that statistical testing methods (e.g. an evolutionary algorithm's evaluation of average fitness on a set of test problems) are not suitable for self-modifications which can potentially induce catastrophic failure (e.g. of parts of code that can affect the representation or interpretation of the goals). Mathematical proofs have the property that they are as strong as their axioms and have no significant conditionally independent per-step failure probability if their axioms are semantically true, which suggests that something like mathematical reasoning may be appropriate for certain particular types of self-modification during some developmental stages.
Thus the content of the paper is very far off from how a realistic AI would work, but conversely, if you can't even answer the kinds of simple problems posed within the paper (both those we partially solve and those we only pose) then you must be very far off from being able to build a stable self-modifying AI. Being able to say how to build a theoretical device that would play perfect chess given infinite computing power, is very far off from the ability to build Deep Blue. However, if you can't even say how to play perfect chess given infinite computing power, you are confused about the rules of the chess or the structure of chess-playing computation in a way that would make it entirely hopeless for you to figure out how to build a bounded chess-player. Thus "In real life we're always bounded" is no excuse for not being able to solve the much simpler unbounded form of the problem, and being able to describe the infinite chess-player would be substantial and useful conceptual progress compared to not being able to do that. We can't be absolutely certain that an analogous situation holds between solving the challenges posed in the paper, and realistic self-modifying AIs with stable goal systems, but every line of investigation has to start somewhere.
Parts of the paper will be easier to understand if you've read Highly Advanced Epistemology 101 For Beginners including the parts on correspondence theories of truth (relevant to section 6) and model-theoretic semantics of logic (relevant to 3, 4, and 6), and there are footnotes intended to make the paper somewhat more accessible than usual, but the paper is still essentially aimed at mathematically sophisticated readers.
Sorry for the long-delayed reply, Wei!
Yup.
I have two main reasons in mind. First, if you are willing to grant that (a) this is a problem that would require humans years of serial research to solve and (b) that it looks much easier to build this into an AI designed from scratch rather than bolting it on to an existing AI design that was created without taking these considerations into account, but you still think that (c) it would be a good plan to have the first-generation FAI solve this problem when building the next-generation FAI, then it seems that you need to assume that the FAI will be much better at AGI design than its human designers before it executes its first self-rewrite, since the human team would by assumption still need years to solve the problem at that point and the plan wouldn't be particularly helpful if the first-generation FAI would need a similar amount of time or longer. But it seems unlikely to me that we first need to build ultraintelligent machines a la I.J. Good, far surpassing humans, before we can get an intelligence explosion: it seems to me that most of the probability mass should be in the required level of AGI research ability being <= the level of the human research team working on the AGI. I admit that one possible strategy could be to continue having humans improve the initial FAI until it is superintelligent and then ask it to write a successor from scratch, solving the Löbstacle in the process, but it doesn't seem particularly likely that this is cheaper than solving the problem beforehand.
Second, if we followed this plan, when building the initial FAI we would be unable to use mathematical logic (or other tools sufficiently similar to be subject to the same issues) in a straight-forward way when having it reason about its potential successor. This cuts off a large part of design-space that I'd naturally be looking to. Yes, if we can do it then it's possible in principle to get an FAI to do it, but mimicking human reasoning doesn't seem likely to me to be the easiest way to build a safe AGI.
I agree with you that relying on an FAI team to solve a large number of philosophical problems correctly seems dangerous, although I'm sympathetic to Eliezer's criticism of your outside-view arguments -- I essentially agree with your conclusions, but I think I use more inside-view reasoning to arrive at them (would need to think longer to tease this apart). I agree with Paul that something like CEV for philosophy in addition to values should probably part of an FAI design. I agree with you that progress in metaphilosophy would be very valuable, but I do not have any concrete leads to follow. But I think that having good solutions to some of these problems is not unlikely to be helpful for FAI design (and more helpful to FAI than uFAI) so I still think that some amount of work allocated to these philosophical problems looks like a good thing; and I also think that working on these problems does on average reduce the probability of making a bad mistake even if we manage to have the FAI do philosophy itself and have it checked by "coherent extrapolated philosophy".
You quoted my earlier comment that I think that making object-level progress is important enough that it seems a net positive despite making AGI research more interesting, but I don't really feel that your post or the discussion below that contains much in the way of arguments about that -- could you elaborate on the connection?