I strongly upvoted because this post seemed comprehensive (based on what I've read at LW on these topics) and was written in a very approachable way with very little of the community's typical jargon.
Further, it also clearly represents a large amount of work.
If you're trying to make it more legible to outsiders, you should consider defining AGI at the top.
Thanks for the appreciation!
If you're trying to make it more legible to outsiders, you should consider defining AGI at the top.
Good idea, I just added this note to the top:
Terminology note: There is a lot of disagreement about what “intelligence”, “human-level”, “transformative” or AGI even means. For simplicity, I will use AGI as a catch-all term for ‘the kind of powerful AI that we care about’. If you find this unsatisfyingly vague, OpenPhil’s definition of Transformative AI is my favourite precise definition.
Thanks for posting this writeup, overall this reads very well, and it should be useful to newcomers. The threat models section is both compact and fairly comprehensive.
I have a comment on the agendas to build safe AGI section however. In the section you write
I focus on three agendas I consider most prominent
When I finished reading the list of three agendas in it, my first thought was 'Why does this not mention other prominent agendas like corrigibility? This list is hardly is a birds-eye overview mentioning all prominent agendas to build safe AI.'
Digging deeper, what you write about the 3 proposals in your long Google document is that:
The three proposals I discuss here are just the three I know the most about, have seen the most work on and, in my subjective judgement, the ones it is most worth newcomers to the field learning about.
which is quite different. My feeling is that your Google document description of what you are doing here in scoping is much more accurate and helpful to the reader than the 'most prominent' you use above.
Thanks for the feedback! That makes sense, I've updated the intro paragraph to that section to:
There are a range of agendas proposed for how we might build safe AGI, though note that each agenda is far from a complete and concrete plan. I think of them more as a series of confusions to explore and assumptions to test, with the eventual goal of making a concrete plan. I focus on three agendas here, these are just the three I know the most about, have seen the most work on and, in my subjective judgement, the ones it is most worth newcomers to the field learning about. This is not intended to be comprehensive, see eg Evan Hubinger’s Overview of 11 proposals for building safe advanced AI for more.
Does that seem better?
For what it's worth, my main bar was a combination of 'do I understand this agenda well enough to write a summary' and 'do I associate at least one researcher and some concrete work with this agenda'. I wouldn't think of corrigibility as passing the second bar, since I've only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems. It's very possible I've missed out on some important work though, and I'd love to hear pushback on this
Thanks, yes that new phrasing is better.
Bit surprised that you can think of no researchers to associate with Corrigibility. MIRI have written concrete work about it and so has Christiano. It is a major theme in Bostrom's Superintelligence, and it also appears under the phrasing 'problem of control' in Russell's Human Compatible.
In terms of the history of ideas of the field, I think it that corrigibility is a key motivating concept for newcomers to be aware of. See this writeup on corrigibility, which I wrote in part for newcomers, for links to broader work on corrigibility.
I've only seen it come up as a term to reason about or aim for, rather than as a fully-fledged plan for how to produce corrigible systems.
My current reading of the field is that Christiano believes that corrigibility will appear as an emergent property as a result of building an aligned AGI according to his agenda, while MIRI on the other hand (or at least 2021 Yudkowsky) have abandoned the MIRI 2015 plans/agenda to produce corrigibility, and now despair about anybody else ever producing corrigibility either. The CIRL method discussed by Russell produces a type of corrigibility, but as Russell and Hadfield-Menell point out, this type decays as the agent learns more, so it is not a full solution.
I have written a few papers which have the most fully fledged plans that I am aware of, when it comes to producing (a pretty useful and stable version of) AGI corrigibility. This sequence is probably the most accessible introduction to these papers.
This is great. I'd love to see more stuff like this.
Is anyone aware of articles like Chris Olah's views on AI Safety but for other prominent researchers?
Also, it would be great to see a breakdown of how the approaches of ARC, MIRI, Anthropic, Redwood, etc differ. Does this exist somewhere?
Thanks for this! I found it interesting and useful.
I don't have much specific feedback, partly because I listened to this via Nonlinear Library while doing other things rather than reading it, but I'll share some thoughts anyway since you indicated being very keen for feedback.
Thanks a lot for the feedback, and the Anki cards! Appreciated. I definitely find that level of feedback motivating :)
These categories were formed by a vague combination of "what things do I hear people talking about/researching" and "what do I understand well enough that I can write intelligent summaries of it" - this is heavily constrained by what I have and have not read! (I am much less good than Rohin Shah at reading everything in Alignment :'( )
Eg, Steve Byrnes does a bunch of research that seems potentially cool, but I haven't read much of it, and don't have a good sense of what it's actually about, so I didn't talk about it. And this is not expressing an opinion that, Eg, his research is bad.
I've updated towards including a section at the end of each post/section with "stuff that seems maybe relevant that I haven't read enough to feel comfortable summarising"
My Anki cards
Nanda broadly sees there as being 5 main types of approach to alignment research.
Addressing threat models: We keep a specific threat model in mind for how AGI causes an existential catastrophe, and focus our work on things that we expect will help address the threat model.
Agendas to build safe AGI: Let’s make specific plans for how to actually build safe AGI, and then try to test, implement, and understand the limitations of these plans. With an emphasis on understanding how to build AGI safely, rather than trying to do it as fast as possible.
Robustly good approaches: In the long-run AGI will clearly be important, but we're highly uncertain about how we'll get there and what, exactly, could go wrong. So let's do work that seems good in many possible scenarios, and doesn’t rely on having a specific story in mind. Interpretability work is a good example of this.
De-confusion: Reasoning about how to align AGI involves reasoning about complex concepts, such as intelligence, alignment and values, and we’re pretty confused about what these even mean. This means any work we do now is plausibly not helpful and definitely not reliable. As such, our priority should be to do some conceptual work on how to think about these concepts and what we’re aiming for, and trying to become less confused.
I consider the process of coming up with each of the research motivations outlined in this post to be examples of good de-confusion work
Field-building: One of the biggest factors in how much Alignment work gets done is how many researchers are working on it, so a major priority is building the field. This is especially valuable if you think we’re confused about what work needs to be done now, but will eventually have a clearer idea once we’re within a few years of AGI. When this happens, we want a large community of capable, influential and thoughtful people doing Alignment work.
Nanda focuses on three threat models that he thinks are most prominent and are addressed by most current research:
Power-Seeking AI
You get what you measure
[The case given by Paul Christiano in What Failure Looks Like (Part 1)]
AI Influenced Coordination Failures
[The case put forward by Andrew Critch, eg in What multipolar failure looks like. Many players get AGI around the same time. They now need to coordinate and cooperate with each other and the AGIs, but coordination is an extremely hard problem. We currently deal with this with a range of existing international norms and institutions, but a world with AGI will be sufficiently different that many of these will no longer apply, and we will leave our current stable equilibrium. This is such a different and complex world that things go wrong, and humans are caught in the cross-fire.]
Nanda considers three agendas to build safe AGI to be most prominent:
Iterated Distillation and Amplification (IDA)
AI Safety via Debate
Solving Assistance Games
[This is Stuart Russell’s agenda, which argues for a perspective shift in AI towards a more human-centric approach.]
Nanda highlights 3 "robustly good approaches" (in the context of AGI risk):
Interpretability
Robustness
Forecasting
[I doubt he sees these as exhaustive - though that's possible - and I'm not sure if he sees them as the most important/prominent/most central examples.]
Disclaimer: I recently started as an interpretability researcher at Anthropic, but I wrote this doc before starting, and it entirely represents my personal views not those of my employer
Intended audience: People who understand why you might think that AI Alignment is important, but want to understand what AI researchers actually do and why.
Epistemic status: My best guess.
Epistemic effort: About 70 hours into the full sequence, and feedback from over 30 people
Special thanks to Sydney von Arx and Ben Laurense for getting me to actually finish this, and to all of the many, many people who gave me feedback. This began as my capstone project in the first run of the AGI Safety Fellowship, organised by Richard Ngo and facilitated by Evan Hubinger - thanks a lot to them both!
Meta: This is a heavily abridged overview (4K words) of a longer doc (25K words) I’m writing, giving my birds-eye conceptualisation of the field of Alignment. UPDATE: This was intended to be a full sequence, but will likely be eternally incomplete - you can read a draft of the full sequence here.
Terminology note: There is a lot of disagreement about what “intelligence”, “human-level”, “transformative” or AGI even means. For simplicity, I will use AGI as a catch-all term for ‘the kind of powerful AI that we care about’. If you find this unsatisfyingly vague, OpenPhil’s definition of Transformative AI is my favourite precise definition.
Introduction
What needs to be done to make the development of AGI safe? This is the fundamental question of AI Alignment research, and there are many possible answers.
I've spent the past year trying to get into AI Alignment work, and broadly found it pretty confusing to get my head around what's going on at first. Anecdotally, this is a common experience. The best way I've found of understanding the field is by understanding the different approaches to this question. In this post, I try to write up the most common schools of thought on this question, and break down the research that goes on according to which perspective it best fits
There are already some excellent overviews of the field: I particularly like Paul Christiano’s Breakdown and Rohin Shah’s literature review and interview. The thing I’m trying to do differently here is focus on the motivations behind the work. AI Alignment work is challenging and confusing because it involves reasoning about future risks from a technology we haven’t invented yet. Different researchers have a range of views on how to motivate their work, and this results in a wide range of work, from writing papers on decision theory to training large language models to summarise text. I find it easiest to understand this range of work by framing it as different ways to answer the same fundamental question.
My goal is for this post to be a good introductory resource for people who want to understand what Alignment researchers are actually doing today. I assume familiarity with a good introductory resource, eg Superintelligence, Human Compatible or Richard Ngo’s AGI Safety from First Principles, and that readers have a sense for what the problem is and why you might care about it. I begin with an overview of the most prominent research motivations and agendas. I then dig into each approach, and the work that stems from that view. I especially focus on the different threat models for how AGI leads to existential risk, and the different agendas for actually building safe AGI. In each section, I link to my favourite examples of work in each area, and the best places to read more. Finally, as another way to understand the high-level differences in research motivations, I discuss the different underlying beliefs about how AGI will go, which I’ll refer to as crucial considerations.
Overview
I broadly see there as being 5 main types of approach to Alignment research. I break this piece into five main sections analysing each approach.
Note: The space of Alignment research is quite messy, and it's hard to find a categorisation that carves reality at the joints. As such, lots of work will fit into multiple parts of my categorisation.
Within this framework, I find the addressing threat models and agendas to build safe AGI sections the most interesting and think they contain the most diversity of views, so I expand these into several specific models and agendas.
Addressing threat models
There are a range of different concrete threat models. Within this section, I focus on three threat models that I consider most prominent, and which most current research addresses.
Note that this decomposition is entirely my personal take, and one I find useful for understanding existing research. For an alternate perspective and decomposition, see this recent survey of AI researcher threat models. They asked about five threat models (only half of which I cover here), and found that while opinions were often polarised, on average, the five models were rated as equally plausible.
Agendas to build safe AGI
There are a range of agendas proposed for how we might build safe AGI, though note that each agenda is far from a complete and concrete plan. I think of them more as a series of confusions to explore and assumptions to test, with the eventual goal of making a concrete plan. I focus on three agendas here, these are just the three I know the most about, have seen the most work on and, in my subjective judgement, the ones it is most worth newcomers to the field learning about. This is not intended to be comprehensive, see eg Evan Hubinger’s Overview of 11 proposals for building safe advanced AI for more.
Robustly good approaches
Rather than the careful sequence of logical thought underlying the two above categories, robustly good approaches are backed more by a deep and robust-feeling intuition. They are the cluster thinking to the earlier motivation’s sequence thinking. This means that the motivations tend to be less rigorous and harder to clearly analyse, but are less vulnerable to identifying a single weak point in a crucial underlying belief. Instead there are lots of rough arguments all pointing in the direction of the area being useful. Often multiple researchers may agree on how to push forwards on these approaches, while having wildly different motivations. I focus on the 3 key areas of interpretability, robustness and forecasting.
Note that robustly good does not mean that ‘there is no way this agenda is unhelpful’, it’s just a rough heuristic that there are lots of arguments for the approach being net good. It’s entirely possible that the downsides in fact outweigh the upsides.
(Conflict of interest: Note that I recently started work on interpretability under Chris Olah, and many of the researchers behind scaling laws are now at Anthropic. I formed the views in this section before I started work there, and they entirely represent my personal opinion not those of my employer or colleagues)
Key considerations
The point of this post is to help you gain traction on what different alignment researchers are doing and what they believe. Beyond focusing on research motivations, another way I’ve found valuable to get insight is to focus on key considerations - underlying beliefs about AI that often generate the high-level differences in motivation and agendas. So in the sixth and final section I focus on these. There are many possible crucial considerations, but I discuss four that seem to be the biggest generators of action-relevant disagreement:
—------------------------
NOTE: This was intended to be a full sequence, but will likely be eternally incomplete - you can read a draft of the full sequence here.