I think the DSA framing is in keeping with the spirit of "first critical try" discourse.
(With that in mind, the below is more "this too seems very important", rather than "omitting this is an error".)
However, I think it's important to consider scenarios where humans lose meaningful control without any AI or group of AIs necessarily gaining a DSA. I think "loss of control" is the threat to think about, not "AI(s) take(s) control". Admittedly this gets into Moloch-related grey areas - but this may indicate that [humans do/don't have control] is too coarse-grained a framing.
I'd say that the key properties of "first critical try" are:
My guess is that the most likely near-term failure mode doesn't start out as [some set of AIs gets a DSA], but rather [AI capability increase selects against meaningful human control] - and the DSA stuff is downstream of that.
This is a possibility with the [individually controllable powerful AI assistants] approach - whether or not this immediately takes things to transformational AI territory. Suppose we get the hoped-for >10x research speedup. Do we have a principled strategy for controlling the collective system this produces? I haven't heard one. I wouldn't say we're doing a good job of controlling the current collective system.
I've heard cases for [this will speed things up], and [here are some good things this would make easier] but not for [overall, such a process should be expected to take things in a less doomy direction].
For such cases "you can’t learn enough from analogous but lower-stakes contexts" ought not to apply. However, I'd certainly expect "we won’t learn enough from analogous but lower-stakes contexts" (without huge efforts to avoid this).
Does any specific human or group of humans currently have "control" in the sense of "that which is lost in a loss-of-control scenario"? If not, that indicates to me that it may be useful to frame the risk as "failure to gain control".
It may be better to think about it that way, yes - in some cases, at least.
Probably it makes sense to throw in some more variables.
Something like:
In these terms, [loss of control] is something like [ensuring important properties becomes much more expensive (or impossible)].
I think the most important part of your "To stand x chance of property p applying to system s, we'd need to apply resources r" model is the word "we".
Currently, there exists no "we" in the world that can ensure that nobody in the world does some form of research, or at least no "we" that can do that in a non-cataclysmic way. The International Atomic Energy Agency comes the closest of any group I'm aware of, but the scope is limited and also it does its thing mainly by controlling access to specific physical resources rather than by trying to prevent a bunch of people from doing a thing with resources they already possess.
If "gain a DSA (or cause some trusted other group to gain a DSA) over everyone who could plausibly gain a DSA in the future" is a required part of your threat mitigation strategy, I am not optimistic about the chances for success but I'm even less optimistic about the chances of that working if you don't realize that's the game you're trying to play.
I don't think [gain a DSA] is the central path here.
It's much closer to [persuade some broad group that already has a lot of power collectively].
I.e. the likely mechanism is not: [add the property [has DSA] to [group that will do the right thing]].
But closer to: [add the property [will do the right thing] to [group that has DSA]].
3. At some point, some set of AI agents will be such that:
- they will all be able to coordinate with each other to try to kill all humans and take over the world; and
- if they choose to do this, their takeover attempt will succeed.[13]
There are way too many assumptions about what "AI" is baked into this. Suppose you went back 50 years and told people "in the year 2024, everyone will have an AI agent built into their phone that they rely on for critical-to-life tasks they do (such as finding directions to the grocery store)."
The 1950's observer would probably say something like "that sounds like a dangerous AI system that could easily take control of the world". But in fact, no one worries about Siri "coordinating" to suddenly give us all wrong directions to the grocery store, because that's not remotely how assistants work.
Trying to reason about what future AI agents will look like is basically equally fraught.
Second: for any failure you don't want to ever happen, you always need to avoid that failure on the first try (and the second, the third, etc).
I think this is the crux of my concern. Obviously if AI kills us all, there will be some moment when that was inevitable, but merely stating that fact doesn't add any additional information. I think any attempt to predict what AI agents will do from "pure reasoning" as opposed to careful empirical study of the capabilities of existing AI models is basically doomed to failure.
in fact, no one worries about Siri "coordinating" to suddenly give us all wrong directions to the grocery store, because that's not remotely how assistants work.
Note that Siri is not capable of threatening types of coordination. But I do think that by the time we actually face a situation where AIs are capable of coordinating to successfully disempower humanity, we may well indeed know enough about "how they work" that we aren't worried about it.
The degree to which 6-short is additionally worrying (once we’ve taken into account (1) and (3)) depends on the probability that the relevant agents will all choose to seek power in problematic ways within the relevant short period of time, without coordinating. If the “short period” is “the exact same moment,” the relevant sort of correlation seems unlikely.
Is this really true? It seems likely that some external event (which could be practically anything) plausibly could alert a sufficient subset of agents to all start trying to seek power as soon as they notice that event, and not before.
People sometimes say that AI alignment is scary partly (or perhaps: centrally) because you have to get it right on the “first critical try,” and can’t learn from failures.[1] What does this mean? Is it true? Does there need to be a “first critical try” in the relevant sense? I’ve sometimes felt confused about this, so I wrote up a few thoughts to clarify.
I start with a few miscellaneous conceptual points. I then focus in on a notion of “first critical try” tied to the first point (if there is one) when AIs get a “decisive strategic advantage” (DSA) over humanity – that is, roughly, the ability to kill/disempower all humans if they try.[2] I further distinguish between four different types of DSA:
I also offer some takes on our prospects for just not ever having “first critical tries” from each type of DSA (via routes other than just not building superhuman AI systems at all). In some cases, just not having a “first critical try” in the relevant sense seems to me both plausible and worth working towards. In particular, I think we should try to make it the case that no single AI system is ever in a position to kill all humans and take over the world. In other cases, I think avoiding “first critical tries,” while still deploying superhuman AI agents throughout the economy, is more difficult (though the difficulty of avoiding failure is another story).
Here’s a chart summarizing my takes in more detail.
Some conceptual points
The notion of “needing to get things right on the first critical try” can be a bit slippery in its meaning and scope. For example: does it apply uniquely to AI risk, or is it a much more common problem? Let's start with a few points of conceptual clarification:
Unilateral DSAs
OK, with those conceptual clarifications out of the way, let’s ask more directly: in what sense, if any, will there be a “first critical try” with respect to AI alignment?
I think the most standard version of the thought goes roughly like this:[9]
So the first point where (1) is true, here, is the “first critical try.” And (2), roughly, is the alignment problem. That is, if (1) is true, then whether or not this AI kills everyone depends on how it makes choices, rather than on what it can choose to do. And alignment is about getting the “how the AI makes choices” part sufficiently right.
I think that focusing on the notion of a decisive strategic advantage usefully zeroes in on the first point where we start banking on AI motivations, in particular, for avoiding doom – rather than, e.g., AIs not being able to cause doom if they tried. So I’ll generally follow that model here.
If (1) is true, then I think it is indeed appropriate to say that there will be a “first critical try” that we need to get right in some sense (though note that we haven’t yet said anything about how hard this will be; and it could be that the default path is objectively safe, even if subjectively risky). What’s more: we won’t necessarily know when this “first critical try” is occurring. And even if we get the first one right, there might be others to follow. For example: you might then build an even more powerful AI, which also has (or can get) a decisive strategic advantage.
Is (1) true? I won’t dive in deep here. But I think it’s not obvious, and that we should try to make it false. That is: I think we should try to make it the case that no AI system is ever in a position to kill everyone and take over the world.[11]
How? Well, roughly speaking, by trying to make sure that “the world” stays sufficiently empowered relative to any AI agent that might try to take it over. Of course, if single AI agents can gain sufficiently large amounts of relative power sufficiently fast (including: by copying themselves, modifying/improving themselves, etc), or if we should expect some such agent to start out sufficiently “ahead,” this could be challenging. Indeed, this is a core reason why certain types of “intelligence explosions” are so scary. But in principle, at least, you can imagine AI “take-offs” in which power (including: AI-driven power) remains sufficiently distributed, and defensive technology sufficiently robust and continually-improved, that no single AI agent would ever succeed in “taking over the world” if it tried. And we can work to make things more like that.[12]
Coordination DSAs
I think that in practice, a lot of the “first critical try” discourse comes down to (1) – i.e., the idea that some AI agent will at some point be in a position to kill everyone and take over the world. However, suppose that we don’t assume this. Is there still a sense in which there will be a “first critical try” on alignment?
Consider the following variant of the reasoning above:
Let’s say that an AI has a “unilateral DSA” if it’s in a position to take over without the cooperation of any other AI agents. Various AI doom stories feature systems with this sort of DSA,[14] and it's the central reading I have in mind for (1) above. But the sort of DSA at stake in (3) is broader, and includes cases where AI systems need to coordinate in order for takeover to succeed. Let’s call the sort of DSA at stake in (3) a “coordination DSA.”
Coordination DSAs, on the part of AI agents, are harder to avoid than unilateral DSAs. In particular: in a world with many different superintelligent AI agents – and especially, in worlds where such agents have been broadly integrated into crucial economic and military functions – it seems plausible that an increasing share of power will in some sense run “via” such agents. For example:
So even if no single AI agent ever gets a decisive strategic advantage, the power held by superintelligent AI agents collectively can easily grow to dominate the power that would oppose them if they all coordinated. And we might worry, on grounds of their superintelligence, that they will be able to coordinate if they want to.
Indeed, we can try to argue that the only plausible scenarios in which (1) is false – i.e., no superintelligence ever gets a unilateral DSA – are scenarios where (3) is true. In particular, we can argue that:
And we can try to argue, from (5), that at that point, (3) will be true. In particular: if, per 5, you need to rely on Agents B, C, D etc to oppose/constrain Agent A, then the collection of all those agents might well satisfy (3).
If AI capability development and deployment continues unabated, will (5) be true?[15] I think it’s more likely than (1), and likely overall. Still, it’s not totally obvious. For example:
But overall, (5) seems to me worryingly hard to avoid.
Note, though, that even if we grant (5), (3) doesn’t strictly follow. In particular: (3) specifically says that the AIs in question are able to coordinate – that is, that coordination is an option for them. And the fact that Agents B, C, D etc are functioning to oppose/constrain Agent A doesn’t imply this. For example, maybe adequate coordination between all these agents would require suitably unmonitored/opaque channels of interaction/communication, and they don’t have access to such channels.
So one option, for preventing the existence of a set of AI systems with a coordination DSA, is to try to prevent AI systems from being in a position to coordinate. Indeed, I generally think research into the dynamics of AI coordination is a neglected area, and that preventing coordination in only-somewhat-superhuman AIs may be an important line of defense.[17] For highly superintelligent agents, though – especially ones that are operating and interacting in contexts that humans can’t understand – it seems difficult.
So overall, if AI development and deployment continues unabated, it seems likely to me that some set of AI agents will eventually have a coordination DSA in the sense at stake in (3). And so we can view the first such point as a different type of “first critical try.”
Of course, as with unilateral DSAs, there’s still a question of how hard it will be, by the time (3) is true, to be confident that the relevant AIs won’t try to coordinate to kill all humans and take over the world, even though they could. I won’t try to assess this here.
Correlation DSAs
So far, I’ve talked about scenarios where a single AI takes over the world, and scenarios where a set of AIs all coordinate to take over the world. But these don’t exhaust the scenarios in which misaligned AI systems disempower all humans. In particular: humans could end up disempowered via the collective activity of many different AI systems seeking power in an uncoordinated way.
Is there a “first critical try” for this sort of scenario? One can imagine a line of reasoning like:
In a sense, (6) is its own type of “first critical try.” But setting aside unilateral DSAs and coordination DSAs, is the point at stake in (6) additionally worrying? Not necessarily. In particular, it depends on how much correlation between the AI agents’ choices to power-seek is necessary for doom to occur, over what timescales, without humans being able to correct the situation.
So to better tease out possible worries, here, perhaps we should distinguish between two versions of (6):
I’ll call the sort of DSA at stake in 6-short a “short-term correlation DSA,” and the sort at stake in 6-long a “long-term correlation DSA.”[19]
The degree to which 6-short is additionally worrying (once we’ve taken into account (1) and (3)) depends on the probability that the relevant agents will all choose to seek power in problematic ways within the relevant short period of time, without coordinating. If the “short period” is “the exact same moment,” the relevant sort of correlation seems unlikely. But as it gets chunkier, one should get more worried.
In particular, there are sources of correlation here other than AI agents actively “working together.” For example:
So I do think that 6-short creates its own kind of additional concern, on top of (3). That is, if we get to the point where we couldn’t recover from some set of AI systems all choosing to seek power in a fairly correlated way over a fairly short time-period, this does seem like a notable type of vulnerability, even if we are confident that the AI systems in question aren’t coordinating to engage in this type of behavior. And for reasons similar to those at stake with (3) and (5), I do think that 6-short is plausibly hard to avoid hitting eventually. And indeed, it’s actively harder to avoid than (3), because (3) can be avoided by cutting off the possibility of coordination between AI agents; whereas this doesn’t suffice to avoiding 6-short.
What about 6-long? 6-long, if true, is more worrying than 6-short, because it provides a longer time period for a correlated alignment failure to occur, thereby allowing looser forms of correlated power-seeking to cause doom. But 6-long is also easier to make false than 6-short. In particular: the longer time window allows for more time to notice and correct any given instance of power-seeking. Thus, for example, if the actions of Agent A and Agent B take place six months apart, in the example above, vs. a few days, this gives the humans more time to deal with the Agent A situation, and to have recovered full control, by the time the Agent B situation gets going.
A few final thoughts
Ok, those were four different types of “first critical tries,” corresponding to four different types of DSAs, plus a few takes on each. I’ll close with a few other notes:
I work at Open Philanthropy but I’m here speaking only for myself and not for my employer.
See e.g. Yudkowsky’s 3 here:
And see also Soares here.
This reflects how the term is already used by Yudkowsky and Soares.
I haven't pinned this down in detail, but roughly, I tend to think of a set of AI instances as a "single agent" if they are (a) working towards the same impartially-specified consequences in the world and (b) if they are part of the same "lineage"/causal history. So this would include copies of the same weights (with similar impartial goals), updates to those weights that preserve those goals, and new agents trained by old agents to have the same goals. But it wouldn't include AIs trained by different AI labs that happen to have similar goals; or different copies of an AI where the fact that they're different copies puts their goals at cross-purposes (e.g., they each care about what happens to their specific instance).
As an analogy: if you're selfish, than your clones aren't "you" on this story. But if you're altruistic, they are. But even if you and your friend Bob both have the same altruistic values, you're still different people.
That said, the discussion in the post will generally apply to many different ways of individuating AI agents.
Obviously AI risk is vastly higher stakes. But I'm here making the conceptual point that needing to get the first try (and all the other tries) right comes definitionally from having to avoid ever failing.
See Christiano here. Yudkowsky also acknowledges this.
See, for example, the discourse about “warning shots,” and about catching AIs red-handed.
See e.g. Karnofsky here, Soares here, and Yudkowsky here. The reason I’m most worried about is “scheming.”
Sixth: “Needing to get things right” can imply that if you don’t do the relevant “try” in some particular way (e.g., with the right level of technical competence), then doom will ensue. But even in contexts where you have significant subjective uncertainty about whether the relevant “try” will cause doom, you don’t necessarily need to “get things right” in the sense of “execute with a specific level of competence” in order to avoid doom. In particular: your uncertainty may be coming from uncertainty about some underlying objective parameter your execution doesn’t influence.
Thus: suppose that the evidence were more ambiguous about whether your volcano science experiment was going to cause doom, so you assign it a 10% subjective probability. This doesn’t mean that you have to do the experiment in a particular way – e.g., “get the experiment right” – otherwise doom will ensue. Rather, the objective facts might just be that any way of proceeding is safe; even if subjectively, some/all ways are unacceptably risky.
I think some AI alignment “tries” might be like this. Thus, suppose that you’re faced with a decision about whether to deploy an AI system that seems aligned, and you’re unsure whether or not it’s “scheming” – i.e., faking alignment in order to get power later. It’s not necessarily the case that at that point, you need to have “figured out how to eliminate scheming,” else doom. Rather, it could be that scheming just doesn’t show up by default – for example, because SGD’s inductive biases don’t favor it.
That said, of course, proceeding with a “try” that involves a significant subjective risk of doom is itself extremely scary. And insofar as you are banking on some assumption X holding in order to avoid doom, you do need to “get things right” with respect to whether or not assumption X is true.
Here I’m mostly thinking of Yudkowsky’s usage, which focuses on the first point where an AI is “operating at a ‘dangerous’ level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.” The usage in Soares here is similar, but the notion of “most theories don’t work on the first real try” could also apply more broadly, to scenarios where you’re using your scientific theory to assess an AI’s capabilities in addition to its alignment.
Really, whether or not an agent “can” do something like takeover the world isn’t a binary, at least from that agent’s subjective perspective. Rather, a given attempt will succeed with a given probability. I’m skipping over this for now, but in practice, the likelihood of success, for a given AI system, is indeed relevant to whether attempting a takeover is worth it. And it means that there might not be a specific point at which some AI system “gets a DSA.” Rather, there might be a succession of AI systems, each increasingly likely to succeed at takeover if they went for it.
I also think we should do this with human agents – but I’ll focus on AI agents here.
We can also try to avoid building “agents” of the relevant kind at all, and focus on getting the benefits of AI in other ways. But for the reasons I describe in section 3 here, I do expect humans to build lots of AI agents, so I won’t focus on this.
We can think of (1) as a special instance of (3) – e.g., a case where the set in question has only a single agent.
See e.g. here.
As ever, you could just not build superintelligent AI agents like agent A at all, and try to get most of the benefits of AI some other way.
I’m counting high-fidelity human brain emulations as “human” for present purposes.
I wrote a bit more about this here.
There’s a case for expecting sufficiently superintelligent agents to succeed in coordinating to avoiding zero-sum forms of conflict like actual war; but this doesn’t mean that the relevant agents, in this sort of scenario, will be smart enough and in a position to do this.
This is stretching the notion of a “DSA” somewhat, because the uncoordinated AIs in question won’t necessarily be executing a coherent “strategy,” but so it goes.
See related discussion from Christiano here:
Another example might be: a version of the Trinity Test where Bethe was more uncertain about his calculations re: igniting the atmosphere.
I haven't pinned this down in detail, but roughly, I tend to think of it as single AI if it's working towards the same impartially-specified consequences in the world and if it has a unified causal history. So this would include copies of the same weights (with similar impartial goals), updates to those weights that preserve those goals, and new agents trained by old agents to have the same goals. But it wouldn't include AIs trained by different AI labs that happen to have similar goals; or different copies of an AI where the fact that they're different copies puts their goals at cross-purposes (e.g., they each care about what happens to their specific instance).
Though standard discussions of DSAs don't t