The slow takeoff hypothesis predicts that AGI emerges in a world where powerful but non-AGI AI is already a really big deal. Whether AI is a big deal right before the emergence of AGI determines many super basic things about what we should think our current job is. I hadn’t fully appreciated the size of this effect until a few days ago.
In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens. In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there.
In the fast takeoff world, we’re much more like a normal research field–we want some technical problem to eventually get solved, so we try to solve it. But in the slow takeoff world, we’re basically in a weird collaboration across time with the more numerous, non-longtermist AI researchers who will be in charge of aligning their powerful AI systems but who we fear won’t be cautious enough in some ways or won’t plan ahead in some other ways. Doing technical research in the fast takeoff world basically just requires answering technical questions, while in the slow takeoff world your choices about research projects are closely related to your sociological predictions about what things will be obvious to whom when.
I think that these two perspectives are extremely different, and I think I’ve historically sometimes had trouble communicating with people who held the slow takeoff perspective because I didn’t realize we disagreed on basic questions about the conceptualization of the question. (These miscommunications persisted even after I was mostly persuaded of slow takeoffs, because I hadn’t realized the extent to which I was implicitly assuming fast takeoffs in my picture of how AGI was going to happen.)
As an example of this, I think I was quite confused about what genre of work various prosaic alignment researchers think they’re doing when they talk about alignment schemes. To quote a recent AF shortform post of mine:
Something I think I’ve been historically wrong about:
A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.
Similarly with debate--adversarial setups are pretty obvious and easy.
In this frame, the contributions from Paul and Geoffrey feel more like “they tried to systematically think through the natural limits of the things people will do” than “they thought of an approach that non-alignment-obsessed people would never have thought of or used”.
It’s still not obvious whether people will actually use these techniques to their limits, but it would be surprising if they weren’t used at all.
I think the slightly exaggerated slogan for this update of mine is “IDA is futurism, not a proposal”.
My current favorite example of the thinking-on-the-margin version of alignment research strategy is in this comment by Paul Christiano.
I agree with the basic difference you point to between fast- and slow-takeoff worlds, but disagree that it has important strategic implications for the obviousness of takeover risk.
In slow takeoff worlds, many aspects of the alignment problem show up well before AGI goes critical. However, people will by-default train systems to conceal those problems. (This is already happening: RL from human feedback is exactly the sort of strategy which trains systems to conceal problems, and we've seen multiple major orgs embracing it within the past few months.) As a result, AI takeover risk never looks much more obvious than it does now.
Concealed problems look like no problems, so there will in-general be economic incentives to train in ways which conceal problems. The most-successful-looking systems, at any given time, will be systems trained in ways which incentivize hidden problems over visible problems.
You can use an LLM to ask what actions to take, or you can use an LLM to ask “hey is this a good world state?” The latter seems like it might capture a lot of human semantics about value given RL4HF