The framework
Here, I will briefly introduce what I hope is a fundamental and potentially comprehensive set of questions that an AGI safety research agenda would need to answer correctly in order to be successful. In other words, I am claiming that a research agenda that neglects these questions would probably not actually be viable for the goal of AGI safety work arrived at in the previous post: to minimize the risk of AGI-induced existential threat.
I have tried to make this set of questions hierarchical, by which I simply mean that particular questions make sense to ask—and attempt to answer—before other questions; that there is something like a natural progression to building an AGI safety research agenda. As such, each question in this framework basically accepts the (hypothesized) answer from the previous question as input. Here are the questions:
- What is the predicted architecture of the learning algorithm(s) used by AGI?
- What are the most likely bad outcomes of this learning architecture?
- What are the control proposals for minimizing these bad outcomes?
- What are the implementation proposals for these control proposals?
- What is the predicted timeline for the development of AGI?
Some immediate notes and qualifications:
- As stated above, note that each question Q directly builds from whatever one’s hypothesized answer is to Q-1. This is why I am calling this question-framework hierarchical.
- Question 5 is not strictly hierarchical in this sense like questions 1-4. I consider one’s hypothesized AGI development timeline to serve as an important ‘hyperparameter’ that calibrates the search strategies that researchers adopt to answer questions 1-4.
- I do not intend to rigidly claim that it is impossible to say anything useful about bad outcomes, for example, without first knowing an AGI’s learning algorithm architecture. In fact, most of the outcomes I will actually discuss in this sequence will be architecture-independent (I discuss them for that very reason). I do claim, however, that it is probably impossible to exhaustively mitigate bad outcomes without knowing the AGI’s learning algorithm architecture. Surely, the devil will be at least partly in the details.
- I also do not intend to claim that AGI must consist entirely of learning algorithms (as opposed to learning algorithms being just one component of AGI). Rather, I claim that what makes the AGI safety control problem hard is that the AGI will presumably build many of its own internal algorithms through whatever learning architecture is instantiated. If there are other static or ‘hardcoded’ algorithms present in the AGI, these probably will not meaningfully contribute to what makes the control problem hard (largely because we will know about them in advance).
- If we interpret the aforementioned goal of AGI safety research (minimize existential risk) as narrowly as possible, then we should consider “bad outcomes” in question 2 to be shorthand for “any outcomes that increase the likelihood of existential risk.” However, it seems totally conceivable that some researchers might wish to expand the scope of “bad outcomes” such that existential risk avoidance is still prioritized, but clearly-suboptimal-but-non-existential risks are still worth figuring out how to avoid.
- Control proposals ≠ implementation proposals. I will be using the former to refer to things like imitative amplification, safety via debate, etc., while I’m using the latter to refer to the distinct problem of getting the people who build AGI to actually adopt these control proposals (i.e., to implement them).
Prescriptive vs. descriptive interpretations
I have tailored the order of the proposed question progression to be both logically necessary and methodologically useful. Because of this, I think that this framework can be read in two different ways: first, with its intended purpose in mind—to sharpen how the goals of AGI safety research constrain the space of plausible research frameworks from which technical work can subsequently emerge (i.e., it can be read as prescriptive). A second way of thinking about this framework, however, is as a kind of low-resolution prediction about what the holistic progression of AGI safety research will ultimately end up looking like (i.e., it can be read as descriptive). Because each step in the question-hierarchy is logically predicated on the previous step, I believe this framework could serve as a plausible end-to-end story for how AGI safety research will move all the way from its current preparadigmatic state to achieving its goal of successfully implementing control proposals that mitigate AGI-induced existential risks. From this prediction-oriented perspective, then, these questions might also be thought of as the relevant anticipated ‘checkpoints’ for actualizing the goal of AGI safety research.
Let’s now consider each of the five questions in turn. Because they build upon themselves, it makes sense to begin with the first question and work down the list.
Your core claim is that all of these five questions need to be answered to minimize AI X-risk. Not only do I disagree with this, I claim that zero of these questions need to be answered to minimize AI X-risk.
Let's go through them in order...
My mainline vision for a theory of alignment and agency would be sort of analogous to thermodynamics. Thermodynamics does not care about what architecture we use for our heat engines. Rather, it establishes the universal constraints which apply to all possible heat engines. (... or at least all heat engines which work with more-than-exponentially-tiny-probability.) Likewise, I want a theory of alignment and agency which establishes the universal constraints which apply to all agents (or at least all agents which "work" with more-than-exponentially-tiny-probability).
Why would we expect to be able to find such a theory? One argument: we don't expect that the alignment problem itself is highly-architecture dependent; it's a fairly generic property of strong optimization. So, "generic strong optimization" looks like roughly the right level of generality at which to understand alignment. (This is not the only argument for our ability to find such a theory, but it's a relatively simple one which doesn't need a lot of foundations.) Trying to zoom in on something narrower than that would add a bunch of extra constraints which are effectively "noise", for purposes of understanding alignment.
On top of that, there's the obvious problem that if we try to solve alignment for a particular architecture, it's quite probable that some other architecture will come along and all our work will be obsolete. (At the current pace of ML progress, this seems to happen roughly every 5 years.)
Put all that together, and I think this question is not only unnecessary, but plausibly actively harmful as a guide for alignment research.
(I also note that you have a whole section in your post on question 2 which correctly identifies most of the points I just made; all it's missing is the step of "oh, maybe we just don't actually need to know about the details of the architecture at all".)
I also think these two together are potentially actively harmful. I think the best explanation of this view is currently Yudkowsky's piece on Security Mindset; "figure out the most likely bad outcomes and then propose solutions to minimize these bad outcomes" is exactly what he's arguing against. One sentence summary: it's the unknown unknowns that kill us. The move we want is not "brainstorm failure modes and then avoid the things we brainstormed", it's "figure out what we want and then come up with a strategy which systematically achieves it (automatically ruling out huge swaths of failure modes simultaneously)".
Setting aside that I don't agree with the "control proposals" framing, this question comes the closest to being actually necessary. Certainly we'll need implementations of something at some point.
On the other hand, starting from where we are now, I expect implementation to be relatively easy once we have any clue at all what to implement. So even if it's technically necessary to answer at some point, this question might not be very useful to think about ahead of time. We could solve the problem to a point where AI risk is minimized without necessarily putting significant thought into implementation proposals, especially if the core math ends being obviously-tractable. (Though, to be clear, I don't think that's a good idea; trying to build a great edifice of theory without empirical feedback of some kind is rarely useful in practice.)
Personally, I consider timelines approximately-irrelevant for my research plans. Whatever the probable-shortest-path is to aligned AI, that's the path to follow, regardless of how long we have.
The case for timeline-relevance is usually "well, if we don't have any hope of properly solving the problem in time, then maybe we need a hail Mary". That's a valid argument in principle, but in practice, when we multiply together probability-of-hail-Mary-actually-working vs probability-that-AI-is-coming-that-soon, I expect that number to basically-never favor the hail Mary. It would require too high a probability of the Hail Mary working, and too little uncertainty about AGI being right around the corner.
Now, I do expect other people to disagree with that argument (mainly because they have less hope about solving alignment anytime soon without a Hail Mary). But remember that the post's original claim is that timeline estimates are necessary for alignment, which seems like far too strong a claim when I'm sitting here with an at-least-internally-coherent view in which timelines are mostly irrelevant.
More Generally...
Zooming out a level, I think the methodology used to generate these questions is flawed. If you want to identify necessary subquestions, then the main way I know how to do that is to consider a wide variety of approaches, and look for subquestions which are clearly crucial to all of them. Then, try to generate further approaches which circumvent those subquestions, and that counterexample-search-process will probably make clear why the subquestions are necessary.
When I imagine what process would generate the questions in this post, I imagine starting with one single approach, looking for subquestions which are clearly crucial to that one approach, and then trying to come up with arguments that those subquestions are necessary (without really searching for necessity-counterexamples to stress-test those arguments).
If I've mischaracterized your process, then I apologize in advance, but currently this hypothesis seems pretty likely.
My recommendation is to go find some entirely different approaches, look for patterns which hold up across approaches, and consider what underlying features of the problem generate those patterns.
On The Bright Side
Complaining aside, you've clearly correctly understood that the subquestions need to be necessary subquestions in order to form a paradigm; that necessity is what allows the paradigm to generalize across the work done by many different people.
I do think that insight is the rate-limiting factor for most people explicitly trying to come up with paradigms. So well done there! I think you're already past the biggest barrier. The next few barriers will involve a lot of frustrating work, a lot of coming up with frameworks which seem good to you only to have other people shoot holes in them, but I think you are probably capable of doing it if you decide to pursue it for a while.
I am not 'so sure'—as I said in the previous comment, I have only claim(ed) it is probably necessary to, for instance, know more about AGI than just whether it is a 'generic strong optimizer.' I would only be comfortable making non-probabilistic claims about the necessity of pa... (read more)