The framework
Here, I will briefly introduce what I hope is a fundamental and potentially comprehensive set of questions that an AGI safety research agenda would need to answer correctly in order to be successful. In other words, I am claiming that a research agenda that neglects these questions would probably not actually be viable for the goal of AGI safety work arrived at in the previous post: to minimize the risk of AGI-induced existential threat.
I have tried to make this set of questions hierarchical, by which I simply mean that particular questions make sense to ask—and attempt to answer—before other questions; that there is something like a natural progression to building an AGI safety research agenda. As such, each question in this framework basically accepts the (hypothesized) answer from the previous question as input. Here are the questions:
- What is the predicted architecture of the learning algorithm(s) used by AGI?
- What are the most likely bad outcomes of this learning architecture?
- What are the control proposals for minimizing these bad outcomes?
- What are the implementation proposals for these control proposals?
- What is the predicted timeline for the development of AGI?
Some immediate notes and qualifications:
- As stated above, note that each question Q directly builds from whatever one’s hypothesized answer is to Q-1. This is why I am calling this question-framework hierarchical.
- Question 5 is not strictly hierarchical in this sense like questions 1-4. I consider one’s hypothesized AGI development timeline to serve as an important ‘hyperparameter’ that calibrates the search strategies that researchers adopt to answer questions 1-4.
- I do not intend to rigidly claim that it is impossible to say anything useful about bad outcomes, for example, without first knowing an AGI’s learning algorithm architecture. In fact, most of the outcomes I will actually discuss in this sequence will be architecture-independent (I discuss them for that very reason). I do claim, however, that it is probably impossible to exhaustively mitigate bad outcomes without knowing the AGI’s learning algorithm architecture. Surely, the devil will be at least partly in the details.
- I also do not intend to claim that AGI must consist entirely of learning algorithms (as opposed to learning algorithms being just one component of AGI). Rather, I claim that what makes the AGI safety control problem hard is that the AGI will presumably build many of its own internal algorithms through whatever learning architecture is instantiated. If there are other static or ‘hardcoded’ algorithms present in the AGI, these probably will not meaningfully contribute to what makes the control problem hard (largely because we will know about them in advance).
- If we interpret the aforementioned goal of AGI safety research (minimize existential risk) as narrowly as possible, then we should consider “bad outcomes” in question 2 to be shorthand for “any outcomes that increase the likelihood of existential risk.” However, it seems totally conceivable that some researchers might wish to expand the scope of “bad outcomes” such that existential risk avoidance is still prioritized, but clearly-suboptimal-but-non-existential risks are still worth figuring out how to avoid.
- Control proposals ≠ implementation proposals. I will be using the former to refer to things like imitative amplification, safety via debate, etc., while I’m using the latter to refer to the distinct problem of getting the people who build AGI to actually adopt these control proposals (i.e., to implement them).
Prescriptive vs. descriptive interpretations
I have tailored the order of the proposed question progression to be both logically necessary and methodologically useful. Because of this, I think that this framework can be read in two different ways: first, with its intended purpose in mind—to sharpen how the goals of AGI safety research constrain the space of plausible research frameworks from which technical work can subsequently emerge (i.e., it can be read as prescriptive). A second way of thinking about this framework, however, is as a kind of low-resolution prediction about what the holistic progression of AGI safety research will ultimately end up looking like (i.e., it can be read as descriptive). Because each step in the question-hierarchy is logically predicated on the previous step, I believe this framework could serve as a plausible end-to-end story for how AGI safety research will move all the way from its current preparadigmatic state to achieving its goal of successfully implementing control proposals that mitigate AGI-induced existential risks. From this prediction-oriented perspective, then, these questions might also be thought of as the relevant anticipated ‘checkpoints’ for actualizing the goal of AGI safety research.
Let’s now consider each of the five questions in turn. Because they build upon themselves, it makes sense to begin with the first question and work down the list.
I believe that it is very sensible to bring this sort of structure into our approach to AGI safety research, but at the same time it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches.
It is a feedback loop where we make each step according to our best theory of where to make it, and use the understanding gleaned from that step to update the theory (when necessary), which could well mean that we retrace some steps and recalibrate (this can be the case within and across questions). I think this connects to what both Charlie and Tekhne have said, though I believe Tekhne could have been more charitable.
In this light, it makes sense to emphasize the openness of the theory to being updated in this way, which also qualifies the ways in which the theory is allowed to be yet incomplete. Putting more effort into clarifying how this update process should look like seems like a promising addition to the framework that you propose.
On a more specific note I felt that Q5 could just be in position 2 and maybe a sixth question would be "What is the predicted timeline for stable safety/control implementations?" or something of the sort.
I also think that phrasing our research in terms of "avoiding bad outcomes" and "controlling the AGI" biases the way in which we pay attention to these problems. I am sure that you will also touch on this in the more detailed presentation of these questions, but at the resolution presented here, I would prefer the phrasing to be more open.
"Aiming at good outcomes while/and avoiding bad outcomes" captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there. It might rightfully appear that this is a more ambitious aim, but it is the less predisposed outlook! Many strategy games are based on the idea that you have to accumulate resources and avoid losses while at the same time improving your ability to accumulate resources and avoid losses in the future. Only focusing on the first aspect is a specific strategy in the space of possible ones, and often employed when one is close to losing. This isn't a perfect analogy in a number of ways, but serves to point out the more general outlook.
Similarly, we expect a superintelligent AGI to be out of our ability to control at some point, which invokes notions of "self-control" on part of the AGI or "justified trust" on our part - therefore, perhaps "influencing the development of the AGI" would be better, as, again, "influence" can cover more conceptual ground but can still be hardened into the more specific notion of "control" when appropriate.
Hey Robert—thanks for your comment!
Definitely agree—I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.
... (read more)