The framework
Here, I will briefly introduce what I hope is a fundamental and potentially comprehensive set of questions that an AGI safety research agenda would need to answer correctly in order to be successful. In other words, I am claiming that a research agenda that neglects these questions would probably not actually be viable for the goal of AGI safety work arrived at in the previous post: to minimize the risk of AGI-induced existential threat.
I have tried to make this set of questions hierarchical, by which I simply mean that particular questions make sense to ask—and attempt to answer—before other questions; that there is something like a natural progression to building an AGI safety research agenda. As such, each question in this framework basically accepts the (hypothesized) answer from the previous question as input. Here are the questions:
- What is the predicted architecture of the learning algorithm(s) used by AGI?
- What are the most likely bad outcomes of this learning architecture?
- What are the control proposals for minimizing these bad outcomes?
- What are the implementation proposals for these control proposals?
- What is the predicted timeline for the development of AGI?
Some immediate notes and qualifications:
- As stated above, note that each question Q directly builds from whatever one’s hypothesized answer is to Q-1. This is why I am calling this question-framework hierarchical.
- Question 5 is not strictly hierarchical in this sense like questions 1-4. I consider one’s hypothesized AGI development timeline to serve as an important ‘hyperparameter’ that calibrates the search strategies that researchers adopt to answer questions 1-4.
- I do not intend to rigidly claim that it is impossible to say anything useful about bad outcomes, for example, without first knowing an AGI’s learning algorithm architecture. In fact, most of the outcomes I will actually discuss in this sequence will be architecture-independent (I discuss them for that very reason). I do claim, however, that it is probably impossible to exhaustively mitigate bad outcomes without knowing the AGI’s learning algorithm architecture. Surely, the devil will be at least partly in the details.
- I also do not intend to claim that AGI must consist entirely of learning algorithms (as opposed to learning algorithms being just one component of AGI). Rather, I claim that what makes the AGI safety control problem hard is that the AGI will presumably build many of its own internal algorithms through whatever learning architecture is instantiated. If there are other static or ‘hardcoded’ algorithms present in the AGI, these probably will not meaningfully contribute to what makes the control problem hard (largely because we will know about them in advance).
- If we interpret the aforementioned goal of AGI safety research (minimize existential risk) as narrowly as possible, then we should consider “bad outcomes” in question 2 to be shorthand for “any outcomes that increase the likelihood of existential risk.” However, it seems totally conceivable that some researchers might wish to expand the scope of “bad outcomes” such that existential risk avoidance is still prioritized, but clearly-suboptimal-but-non-existential risks are still worth figuring out how to avoid.
- Control proposals ≠ implementation proposals. I will be using the former to refer to things like imitative amplification, safety via debate, etc., while I’m using the latter to refer to the distinct problem of getting the people who build AGI to actually adopt these control proposals (i.e., to implement them).
Prescriptive vs. descriptive interpretations
I have tailored the order of the proposed question progression to be both logically necessary and methodologically useful. Because of this, I think that this framework can be read in two different ways: first, with its intended purpose in mind—to sharpen how the goals of AGI safety research constrain the space of plausible research frameworks from which technical work can subsequently emerge (i.e., it can be read as prescriptive). A second way of thinking about this framework, however, is as a kind of low-resolution prediction about what the holistic progression of AGI safety research will ultimately end up looking like (i.e., it can be read as descriptive). Because each step in the question-hierarchy is logically predicated on the previous step, I believe this framework could serve as a plausible end-to-end story for how AGI safety research will move all the way from its current preparadigmatic state to achieving its goal of successfully implementing control proposals that mitigate AGI-induced existential risks. From this prediction-oriented perspective, then, these questions might also be thought of as the relevant anticipated ‘checkpoints’ for actualizing the goal of AGI safety research.
Let’s now consider each of the five questions in turn. Because they build upon themselves, it makes sense to begin with the first question and work down the list.
Thanks for taking the time to write up your thoughts! I appreciate your skepticism. Needless to say, I don't agree with most of what you've written—I'd be very curious to hear if you think I'm missing something:
Surely understanding generic strong optimization is necessary for alignment (as I also spend most of Q1 discussing). How can you be so sure, however, that zooming into something narrower would effectively only add noise? You assert this, but this doesn't seem at all obvious to me. I write in Q2: "It is also worth noting immediately that even if particular [alignment problems] are architecture-independent [your point!], it does not necessarily follow that the optimal control proposals for minimizing those risks would also be architecture-independent! For example, just because an SL-based AGI and an RL-based AGI might both hypothetically display tendencies towards instrumental convergence does not mean that the way to best prevent this outcome in the SL AGI would be the same as in the RL AGI."
By analogy, consider the more familiar 'alignment problem' of training dogs (i.e., getting the goals of dogs to align with the goals of their owners). Surely there are 'breed-independent' strategies for doing this, but it is not obvious that these strategies will be sufficient for every breed—e.g., Afghan Hounds are apparently way harder to train, than, say, Golden Retrievers. So in addition to the generic-dog-alignment-regime, Afghan hounds require some additional special training to ensure they're aligned. I don't yet understand why you are confident that different possible AGIs could not follow this same pattern.
I think that you think that I mean something far more specific than I actually do when I say "particular architecture," so I don't think this accurately characterizes what I believe. I describe my view in the next post.
I think this is a very interesting point (and I have not read Eliezer's post yet, so I am relying on your summary), but I don't see what the point of AGI safety research is if we take this seriously. If the unknown unknowns will kill us, how are we to avoid them even in theory? If we can articulate some strategy for addressing them, they are not unknown unknowns; they are "increasingly-known unknowns!"
I spent the entire first post of this sequence devoted to "figuring out what we want" (we = AGI safety researchers). It seems like what we want is to avoid AGI-induced existential risks. (I am curious if you think this is wrong?) If so, I claim, here is a "strategy that might systematically achieve this:" we need to understand what we mean when we say AGI (Q1), figure out what risks are likely to emerge from AGI (Q2), mitigate these risks (Q3), and implement these mitigation strategies (Q4).
If by "figure out what we want," you mean "figure out what we want out of an AGI," I definitely agree with this (see Robert's great comment below!). If by "figure out what we want," you mean "figure out what we want out of AGI safety research," well, that is the entire point of this sequence!
I completely disagree with this. It will definitely depend on the competitiveness of the relevant proposals, the incentives of the people who have control over the AGI, and a bunch of other stuff that I discuss in Q4 (which hasn't even been published yet—I hope you'll read it!).
When you frame it this way, I completely agree. However, there is definitely a continuous space of plausible timelines between "all-the-time-in-the-world" and "hail-Mary," and I think the probabilities of success [P(success|timeline) * P(timeline)] fluctuate non-obviously across this spectrum. Again, I hope you will withhold your final judgment of my claim until you see how I defend it in Q5! (I suppose my biggest regret in posting this sequence is that I didn't just do it all at once.)
I think this is a bit uncharitable. I have worked with and/or talked to lots of different AGI safety researchers over the past few months, and this framework is the product of my having "consider[ed] a wide variety of approaches, and look for subquestions which are clearly crucial to all of them." Take, for instance, this chart in Q1—I am proposing a single framework for talking about AGI that potentially unifies brain-based vs. prosaic approaches. That seems like a useful and productive thing to be doing at the paradigm-level.
I definitely agree that things like how we define 'control' and 'bad outcomes' might differ between approaches, but I do claim that every approach I have encountered thus far operates using the questions I pose here (e.g., every safety approach cares about AGI architectures, bad outcomes, control, etc. of some sort). To test this claim, I would very much appreciate the presentation of a counterexample if you think you have one!
Thanks again for your comment, and I definitely want to flag that, in spite of disagreeing with it in the ways I've tried to describe above, I really do appreciate your skepticism and engagement with this sequence (I cite your preparadigmatic claim a number of times in it).
As I said to Robert, I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.