An early draft of publication #2 in the Open Problems in Friendly AI series is now available: Tiling Agents for Self-Modifying AI, and the Lobian Obstacle. ~20,000 words, aimed at mathematicians or the highly mathematically literate. The research reported on was conducted by Yudkowsky and Herreshoff, substantially refined at the November 2012 MIRI Workshop with Mihaly Barasz and Paul Christiano, and refined further at the April 2013 MIRI Workshop.
Abstract:
We model self-modication in AI by introducing 'tiling' agents whose decision systems will approve the construction of highly similar agents, creating a repeating pattern (including similarity of the offspring's goals). Constructing a formalism in the most straightforward way produces a Godelian difficulty, the Lobian obstacle. By technical methods we demonstrate the possibility of avoiding this obstacle, but the underlying puzzles of rational coherence are thus only partially addressed. We extend the formalism to partially unknown deterministic environments, and show a very crude extension to probabilistic environments and expected utility; but the problem of finding a fundamental decision criterion for self-modifying probabilistic agents remains open.
Commenting here is the preferred venue for discussion of the paper. This is an early draft and has not been reviewed, so it may contain mathematical errors, and reporting of these will be much appreciated.
The overall agenda of the paper is introduce the conceptual notion of a self-reproducing decision pattern which includes reproduction of the goal or utility function, by exposing a particular possible problem with a tiling logical decision pattern and coming up with some partial technical solutions. This then makes it conceptually much clearer to point out the even deeper problems with "We can't yet describe a probabilistic way to do this because of non-monotonicity" and "We don't have a good bounded way to do this because maximization is impossible, satisficing is too weak and Schmidhuber's swapping criterion is underspecified." The paper uses first-order logic (FOL) because FOL has a lot of useful standard machinery for reflection which we can then invoke; in real life, FOL is of course a poor representational fit to most real-world environments outside a human-constructed computer chip with thermodynamically expensive crisp variable states.
As further background, the idea that something-like-proof might be relevant to Friendly AI is not about achieving some chimera of absolute safety-feeling, but rather about the idea that the total probability of catastrophic failure should not have a significant conditionally independent component on each self-modification, and that self-modification will (at least in initial stages) take place within the highly deterministic environment of a computer chip. This means that statistical testing methods (e.g. an evolutionary algorithm's evaluation of average fitness on a set of test problems) are not suitable for self-modifications which can potentially induce catastrophic failure (e.g. of parts of code that can affect the representation or interpretation of the goals). Mathematical proofs have the property that they are as strong as their axioms and have no significant conditionally independent per-step failure probability if their axioms are semantically true, which suggests that something like mathematical reasoning may be appropriate for certain particular types of self-modification during some developmental stages.
Thus the content of the paper is very far off from how a realistic AI would work, but conversely, if you can't even answer the kinds of simple problems posed within the paper (both those we partially solve and those we only pose) then you must be very far off from being able to build a stable self-modifying AI. Being able to say how to build a theoretical device that would play perfect chess given infinite computing power, is very far off from the ability to build Deep Blue. However, if you can't even say how to play perfect chess given infinite computing power, you are confused about the rules of the chess or the structure of chess-playing computation in a way that would make it entirely hopeless for you to figure out how to build a bounded chess-player. Thus "In real life we're always bounded" is no excuse for not being able to solve the much simpler unbounded form of the problem, and being able to describe the infinite chess-player would be substantial and useful conceptual progress compared to not being able to do that. We can't be absolutely certain that an analogous situation holds between solving the challenges posed in the paper, and realistic self-modifying AIs with stable goal systems, but every line of investigation has to start somewhere.
Parts of the paper will be easier to understand if you've read Highly Advanced Epistemology 101 For Beginners including the parts on correspondence theories of truth (relevant to section 6) and model-theoretic semantics of logic (relevant to 3, 4, and 6), and there are footnotes intended to make the paper somewhat more accessible than usual, but the paper is still essentially aimed at mathematically sophisticated readers.
Neither 2 nor 3 is the sort of argument I would ever make (there's such a thing as an attempted steelman which by virtue of its obvious weakness doesn't really help). You already know(?) that I vehemently reject all attempts to multiply tiny probabilities by large utilities in real-life practice, or to claim that the most probable assumption about a background epistemic question leads to a forgone doom and use this to justify its improbable negation. The part at which you lost me is of course part 1.
I still don't understand what you could be thinking here, and feel like there's some sort of basic failure to communicate going on. I could guess something along the lines of "Maybe Jonah is imagining that Friendly AI will be built around principles completely different from modern decision theory and any notion of a utility function..." (but really, is something like that one of just 10^10 equivalent candidates?) "...and more dissimilar to that than logical AI is from decision theory" (that's a lot of dissimilarity but we already demonstrated conceptual usefulness over a gap that size). Still, that's the sort of belief someone might end up with if their knowledge of AI was limited to popular books extolling the wonderful chaos of neural networks, but that idea is visibly stupid so my mental model of Anna warns me not to attribute it to you. Or I could guess, "Maybe Jonah is Holden-influenced and thinks that all of this discussion is irrelevant because we're going to build a Google Maps AGI", where in point of fact it would be completely relevant, not a tiniest bit less relevant, if we were going to build a planning Oracle. (The experience with Holden does give me pause and make me worry that EA people may think they already know how to build FAI using their personal wonderful idea, just like vast numbers of others think they already know how to build FAI.) But I still can't think of any acceptable steel version of what you mean, and I say again that it seems to me that you're saying something that a good mainstream AI person would also be staring quizzically at.
What would be one of the other points in the 10^10-sized space? If it's something along the lines of "an economic model" then I just explained why if you did something analogous with an economic model it could also be interesting progress, just as AIXI was conceptually important to the history of ideas in the field. I could explain your position by supposing that you think that mathematical ideas never generalize across architectures and so only analyzing the exact correct architecture of a real FAI could be helpful even at the very beginning of work, but this sounds like a visibly stupid position so the model of Anna in my head is warning me not to attribute it to you. On the other hand, some version of, "It is easier to make progress than Jonah thinks because the useful generalization of mathematical ideas does not require you to select correct point X out of 10^10 candidates" seems like it would almost have to be at work here somewhere.
I seriously don't understand what's going on in your head here. It sounds like any similar argument should Prove Too Much by showing that no useful work or conceptual progress could have occurred due to AI work in the last 60 years because there would be 10^10 other models for AI. Each newly written computer program is unique but the ideas behind them generalize, the resulting conceptual space can be usefully explored, that's why we don't start over with every new computer program. You can do useful things once you've collected enough treasure nuggets and your level of ability builds up, it's not a question of guessing the one true password out of 10^10 tries with nothing being progress until then. This is true on a level of generality which applies across computer science and also to AI and also to FAI and also to decision theory and also to math. Everyone takes this for granted as an obvious background fact of doing research which is why I would expect a good mainstream AI person to also be staring quizzically at your statements here. I do not feel like the defense I'm giving here is in any way different from the defense I'd give of a randomly selected interesting AI paper if you said the same thing about it. "That's just how research works," I'd say.
Please amplify point 1 in much greater detail using concrete examples and as little abstraction as possible.
I continue to appreciate your cordiality.
A number of people have recently told me that they have trouble understanding me unless I elaborate further, because I don't spell out my reasoning in sufficient detail. I think that this is more a matter of the ideas involved being complicated, and there being a lot of inferential distance, than it is lack of effort on my part, but I can see how it would be frustrating to my interlocutors. It seems that I'm subject to the illusion of transparency. I appreciate your patience.
... (read more)