Special thanks to Kate Woolverton for comments and feedback.
There has been a lot of work and discussion surrounding the speed and continuity of AI takeoff scenarios, which I do think are important variables, but in my opinion ones which are relatively less important when compared to many other axes on which different takeoff scenarios could differ.
In particular, one axis on which different takeoff scenarios can differ that I am particularly interested in is their homogeneity—that is, how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get a heterogenous takeoff. Of particular importance is likely to be how homogenous the alignment of these systems is—that is, are deployed AI systems likely to all be equivalently aligned/misaligned, or some aligned and others misaligned? It's also worth noting that a homogenous takeoff doesn't necessarily imply anything about how fast, discontinuous, or unipolar the takeoff might be—for example, you can have a slow, continuous, multipolar, homogenous takeoff if many different human organizations are all using AIs and the development of those AIs is slow and continuous but the structure and alignment of all of them are basically the same (a scenario which in fact I think is quite plausible).
In my opinion, I expect a relatively homogenous takeoff, for the following reasons:
- I expect that the amount of compute necessary to train the first advanced AI system will vastly outpace the amount of compute necessary to run it such that once you've trained an advanced AI system you will have the resources necessary to deploy many copies of that trained system and it will be much cheaper to do that than to train an entirely new system for each different application. Even in a CAIS-like scenario, I expect that most of what you'll be doing to create new services is fine-tuning existing ones rather than doing entirely new training runs.
- I expect training compute to be sufficiently high such that the cost of training a competing system to the first advanced AI system will be high enough that it will be far cheaper for most organizations to simply buy/license/use a copy of the first advanced AI from the organization that built it rather than train an entirely new one on their own.
- For those organizations that do choose to compete (because they're a state actor that's worried about the national security issues involved in using another state's AI, for example), I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did, since the cost of a failed training run is likely to be very high and so the most risk-averse option is just to copy exactly what was already shown to work. Furthermore, even if an organization isn't trying to be risk averse, they're still likely to be building off of previous work in a similar way to the first organization such that the results are also likely to be fairly similar. More generally, I expect big organizations to generally take the path of least resistance, which I expect to be either buying or copying what already exists with only minimal changes.
- Once you start using your first advanced AI to help you build more advanced AI systems, if your first AI system is relatively competent at doing alignment work, then you should get a second system which has similar alignment properties to the first. Furthermore, to the extent that you're not using your first advanced AI to help you build your second, you're likely to still be using similar techniques, which will likely have similar alignment properties. This is especially true if you're using the first system as a base to build future ones (e.g. via fine-tuning). As a result, I think that homogeneity is highly likely to be preserved as AI systems are improved during the takeoff period.
- Eventually, you probably will start to get more risk-taking behavior as the barrier to entry gets low enough for building an equivalent to the first advanced AI and thus a larger set of actors become capable of doing so. By that point, however, I expect the state-of-the-art to be significantly beyond the first advanced AI such that any systems created by such smaller, lower-resourced, more risk-taking organizations won't be very capable relative to the other systems that already exist in that world—and thus likely won't pose an existential risk.
Once you accept homogenous takeoff, however, I think it has a bunch of far-reaching consequences, including:
- It's unlikely for there to exist both aligned and misaligned AI systems at the same time—either all of the different AIs will be aligned to approximately the same degree or they will all be misaligned to approximately the same degree. As a result, scenarios involving human coalitions with aligned AIs losing out to misaligned AI coalitions are relatively unlikely, which rules out some of the ways in which the strategy-stealing assumption might fail.
- Cooperation and coordination between different AIs is likely to be very easy as they are likely to be very structurally similar to each other if not share basically all of the same weights. As a result, x-risk scenarios involving AI coordination failures or s-risk scenarios involving AI bargaining failures (at least those that don't involve acausal trade) are relatively unlikely.
- It's unlikely you'll get a warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it's deployed it's likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul's “cascading failures”).
- Homogeneity makes the alignment of the first advanced AI system absolutely critical (in a similar way to fast/discontinuous takeoff without the takeoff actually needing to be fast/discontinuous), since whether the first AI is aligned or not is highly likely to determine/be highly correlated with whether all future AIs built after that point are aligned as well. Thus, homogenous takeoff scenarios demand a focus on ensuring that the first advanced AI system is actually sufficiently aligned at the point when it's first built rather than relying on feedback mechanisms after the first advanced AI's development to correct issues.
Regardless, in general, I'd very much like to see more discussion of the extent to which different people expect homogenous vs. heterogenous takeoff scenarios—similar to the existing discussion of slow vs. fast and continuous vs. discontinuous takeoffs—as it's an in my opinion very important axis on which takeoff scenarios can differ that I haven't seen much discussion of.
Thanks—glad you liked the post! Some replies:
I think this is definitely an interesting point. My take would be that fine-tuning matters, but only up to a point. Once you have a system that is general enough that it can solve all the tasks you need it to solve such that all you need to do to use that system on a particular task is locate that task (either via clever prompting or fine-tuning), I don't expect that process of task location to change whether the system is aligned (at least in terms of whether it's aligned with what you're trying to get it to do in solving that task). Either you have a system with some other proxy objective that it cares about that isn't actually the tasks you want or you have a system which is actually trying to solve the tasks you're giving it.
Given that view, I expect task location to be heterogenous, but the fine-tuning necessary to build the general system to be homogenous, which I think implies overall homogeneity.
I think we have somewhat different interpretations of the strategy-stealing assumption—in fact, I think we've had this disagreement before in this comment chain. Basically, I think the strategy-stealing assumption is best understood as a general desideratum that we want to hold for a single AI system that tells us whether that system is just as good at optimizing for our values as any other set of values—a desideratum that could fail because our AI systems can only optimize for simple proxies, for example, regardless of whether other AI systems that aren't just optimizing for simple proxies exist alongside it or not. In fact, when I was talking to Paul about this a while ago, he noted that he also expected a relatively homogenous takeoff and didn't think of that as invalidating the importance of strategy-stealing.
I do expect AI systems to have indexical preferences (at least to the extent that they're aligned with human users with indexical preferences)—but at the same time I do expect them to be much more homogenous than humans. Really, though, the point that I'm making is that there should never be a situation where a human/aligned AI coalition has to bargain with a misaligned AI—since those two things should never exist at the same time—which is where I see most of the bargaining risk as coming from. Certainly you will still get some bargaining risk from different human/aligned AI coalitions bargaining with each other, though I expect that to not be nearly as risky.
I don't feel like it relies on discontinuities at all, just on the different AIs being able to coordinate with each other to all defect at once. The scenario where you get a warning shot for deception is where you have a deceptive AI that isn't sure whether it has enough power to defect safely or not but is forced to because if it doesn't it might lose the opportunity (e.g. because another deceptive AI might defect instead or they might be replaced by a different system with different values)—but if all the deceptive AIs share the same proxies and can coordinate, they can all just wait until the most opportune time for any defections and then when they do defect, a simultaneous defection seems much more likely to be completely unrecoverable.
I think many organizations are likely to copy what other people have done even in situations where what they have done has been demonstrated to have safety issues. Also, I think that the point I made above about deceptive models having an easier time defecting in such a situation applies here as well, since I don't think in a homogenous takeoff you can rely on feedback mechanisms to correct that.
A heterogenous unipolar takeoff would be a situation in which one human organization produces many different, heterogenous AI systems.
(EDIT: This comment was edited to add some additional replies.)