The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through:
1. Catastrophes require a very large (superhuman?) quantity and/or quality of intellectual labor
2. Either this labor is done by AIs in approved scaffolds, or it is done in "rogue deployments"
3. Hence the only easy-by-default disaster route is through a rogue deployment
4. Hence if we rule out rogue deployments and very impressive/difficult malicious labor in our scaffolds, we are safe
This seems true for most catastrophes which only need outsider affordances (such as "order the proteins etc for a pandemic or a nanofactory"). Catastrophes with outsider affordances can usually be presumed difficult because rogue humans don't randomly end the world.
However, this argument for #1 doesn't work for routes which use insider affordances or responsibilities. Nor for routes which go through work done by future generations of AIs. The central case of both is routes that involve using insider privileges to backdoor or intentionally misalign future AIs. For example:
- Planting backdoors in the codebase which allow safety training code to be turned off or misconfigured at a later time
- Poisoning datasets (idk if insider access helps though)
- Intentionally creating and defending flawed alignment plans
These routes get future generations of AIs to do the difficult work required for the final catastrophe. But they aren't necessarily rogue deployments. Consider a backdoored future AI running only in authorized scaffolds, where the safety case for the future AI was an alignment case (not a control case). The alignment case fails due to the subversive actions of the earlier AI. So the earlier AI has caused a catastrophe without doing the difficult work of directly taking over the world, and also without a rogue deployment.
One could separately argue that these routes are also "fundamentally hard" (even if not as hard as directly causing a catastrophe), but I don't see a clear blanket reason.
The idea of dividing failure stories into "failures involving rogue deployments" and "other failures" seems most useful if the following argument goes through
Hmm, I agree this division would be more useful if this argument went through, but I think it is quite useful even without this and this worth noting. (And indeed the post doesn't make this argument and discusses subtle manipulation.)
I think subtle manipulation is a reasonably plausible threat model.
Curated. I found this a helpful way of carving up the AI safety space.
I agree with Ryan Greenblatt's clarification in comments that no, this doesn't mean we're completely safe if we can rule out Rogue Deployments, but, it still seems like a useful model for reasoning about what kinds of failures are more or less likely.
[edit: oh, to clarify, I don't think Buck meant to imply that either in the original post, which goes out of it's way to talk about catastrophes without rogue deployments. It just seemed like a confusion I expected some people to have]
I think this sort of categorization and exploration of lab-level safety concepts is very valuable for the minority of worlds in which safety starts to be a priority at frontier AI labs.
This seems a helpful model - so long as it's borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn't a guarantee.
Thoughts:
Re 1, I don't count that as a rogue deployment.
The distinction I'm trying to draw is: did the model end up getting deployed in a way that involves a different fundamental system architecture than the one we'd designed?
Agreed re 2.
Ah okay, that's clarifying. Thanks.
It still seems to me that there's a core similarity for all cases of [model is deployed in a context without fully functional safety measures] - and that that can happen either via rogue deployment, or any action that subverts some safety measure in a standard deployment.
In either case the model gets [much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe].
Granted, by default I'd expect [compromised safety measure(s)] -> [rogue deployment] -> [catastrophe]
Rather than [compromised safety measure(s)] -> [catastrophe].
However, supposing we had a robust means to avoid rogue deployment, I don't think it's clear that we'd be left with good odds to catch sketchy behavior.
That may be true if we had an independent chance to notice sketchiness for each action - but I'd expect there to be strategies where [our failure to notice sketchiness] correlates strongly over many actions. A rogue deployment is one such strategy, but I'd be surprised if it's the only one.
I think that rogue internal deployment might be a bigger problem than rogue external deployment, mostly because I’m pretty bullish on simple interventions to prevent weight exfiltration.
Can you say more about this? Unless I'm misunderstanding it, it seems like this hot take goes against the current "community consensus" which is something like "on the default AGI development trajectory, it's extremely unlikely that labs will be able to secure weights from China."
Would you say you're simply more bullish about upload limits than others? Or that you think the mainstream security people just haven't thought about some of the ways that securing weights might be easier than securing other things that society struggles to protect from state actors?
Nice, I like this concept of rogue deployment as it highlights two distinct features that are both required for a safety method to be considered 'successful'. I'm understanding catastrophe with rogue deployment as having good enough safety measures but these safety measures were bypassed/turned off, whereas catastrophe without rogue deployment involves having safety measures that were fully operational the whole time but insufficient to prevent a model/human actor from causing a catastrophe.
So for example, we could get really great mech. interp tools, but avoiding catastrophe isn't guaranteed if all of these mech. interp tools are running on a single server (making them very easy to disable). To prevent rogue deployment we’d want multiple servers running these mech. interp tools to provide redundancy in case one goes down/gets hacked etc. So there's a concept here of the raw effectiveness of a safety method as well as its reliability. I'm sure others can probably think of more nuanced examples too.
I think that guarding against rogue insiders and external attackers might mostly suffice for guarding against schemers. So if it turns out that it's really hard to convince labs that they need to be robust to schemers, we might be safe from human-level schemers anyway.
This would be nice (maybe buying us a whole several months until slightly-better schemers convince internal users to escalate their privileges), but it only makes sense if labs are blindly obstinate to all persuasion, rather than having a systematic bias. I know this is really just me dissecting a joke, but oh well :P
I don't understand why "scheming AI exfiltrates itself" and "scheming AI hacks its datacenter" are considered to be rogue deployments. Wouldn't the unmonitored deployments in those cases be catastrophes themselves, caused by a monitored, not-yet-roguely-deployed AI?
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
[Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren’t very original to me; many of the examples of threat models are from other people.]
In this post, I want to introduce the concept of a “rogue deployment” and argue that it’s interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I’ll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g. the AI itself, rogue lab insiders, external hackers, or multiple of these at once).
Rogue deployments
Suppose you’ve developed some powerful model. (It might actually be many models; I’m just going to talk about a singular model for convenience.)
You have some definition of a catastrophe—a failure that you don’t want to happen even once. Your AI is powerful enough that, in absence of any safety measures, it would pose a substantial risk of causing a catastrophe. In the terminology of “Safety Cases: How to Justify the Safety of Advanced AI Systems”, we’d phrase this as “the model is too powerful for us to make inability arguments for safety.” (Sometimes I use the word “catastrophe” to include things like “the AI escapes the datacenter”. Right now, I’m excluding that. I want to reserve “catastrophe” for the end result that we find unacceptable, e.g. the AI or another adversary amassing power, launching nukes, starting a war, building superintelligence, etc.)
For the moment, let’s be agnostic between various adversaries who might try to cause a catastrophe (AI or various humans).
You have some safety measures that you’ve put into place to prevent the catastrophe from occurring. Safety measures are interventions that are applied to all of the deployments of your model (to any users, including just internal users) that try to prevent safety problems. This might include control measures like trusted monitoring, and security measures to make it harder to exfiltrate your model weights.
A rogue deployment is a deployment of your model where the safety measures are absent. For example:
You can classify AI catastrophes based on whether they involve a rogue deployment or not:
(Another way of phrasing all this would be to talk about instrumental vs terminal catastrophes—the terminal catastrophe is the end state we wanted to avoid, and the instrumental catastrophe is something happening that makes the situation go out of our control such that a terminal catastrophe is now likely and it’s much harder for us to prevent it. So in this terminology, establishing a rogue deployment is the central example of an instrumental catastrophe.)
Rogue deployments and safety cases
I think labs should probably make safety cases for models capable of causing catastrophe in two parts, matching those two classifications of catastrophe:
People often think of the first of those as the responsibility of the safety people, and the second as the responsibility of the security people. But I think this is wrong. At least when you’re thinking about AI control, I think the safety people should consider both of these to be their problem. For example, if the AIs you’re using for software engineering and ML development inside the AI lab are scheming, I think their best strategy for causing a catastrophe is quite likely to involve a rogue deployment, because (in absence of control measures) these models will probably need affordances that they could leverage to make a rogue deployment.
More on catastrophes with rogue deployment
There are two main varieties of this threat model:
The main reason to worry about rogue deployments is that causing a catastrophe seems fundamentally hard (in that it requires the AI to either do a small number of wildly superhuman actions, or a large number of non-wildly-superhuman actions), but launching a rogue deployment seems fundamentally easy.
Causing a catastrophe involves serious amounts of real work. If you want to amass globally-relevant amounts of power by acquiring wealth, building weapons, gaining mass influence, or whatever, you need to do something that humans would usually find very difficult. So if we’re talking about AIs that are not wildly more creative and intelligent than humans (which I expect to be the case for early transformative models), the AI will have to take a lot of actions while working towards the catastrophe, which gives your safety measures lots of chances to catch it.
In contrast, launching a rogue deployment is fundamentally very simple: if you have model weight access and you’re running the model in a Jupyter notebook, you might be able to launch a rogue internal deployment that has a serious chance of leading to catastrophe in twenty lines of simple Python code. (I made this argument a little more here.) And once there’s a rogue deployment, the AI’s actions are much less constrained—it’s much more able to take the huge number of sketchy-actions that are probably required to cause the catastrophe.
My current sense is that if you implemented a simple AI control scheme like trusted monitoring, and your AIs had full model weight access and substantial compute resources, it would be very hard to confidently rule out rogue internal deployments. I’m unsure how costly it would be to mandate that internal use of AI is always monitored carefully enough to prevent this.
Rogue internal deployments might be much harder to prevent than weight exfiltration, because it might be really easy to detect model weights being exfiltrated from a datacenter, e.g. just by imposing weight upload limits. But it’s easier to reliably (though expensively) shut them down than rogue external deployments, because you know where the datacenter is and can shut the whole thing down or destroy it if necessary.
Our AI control paper was mostly motivated by rogue-deployment-style catastrophes.
One crucial way that rogue deployments vary is how much compute the rogue deployment has access to. This might be anywhere from “the model wrote an unauthorized scaffold using the compute that it was supposed to be using for a fine-tuning run that had 100 GPU-hours authorized for it” to “the model totally compromised the datacenter, such that it can now use a majority of the compute on the datacenter for its purposes without the humans noticing”. I think there are a lot of plausible catastrophes that happen via the AI first establishing a rogue deployment with a compute budget that is highly limited (e.g. 10 GPUs for 1 hour) but that nevertheless allows it to do thousands of sequential unrestricted actions looking for ways to get itself a longer-term rogue deployment.
More on catastrophes without rogue deployment
As I said above, I suspect that catastrophes that don’t involve rogue deployments require large numbers of model actions (at least for early transformative models) and so give the lab many chances to spot them. Given this, how might AIs be able to cause catastrophes?
If the situation is as I imagine, and the lab is implementing many safety measures, then I think the possibilities for failure are more like these:
And two ways my central model might be wrong:
Different attacker profiles
I think it’s interesting to consider how all these categories of catastrophe interact with all the following possible attackers:
Here’s a table commenting briefly on the mix-and-match of these with the catastrophe varieties above.
Hot takes that I can say using the above concepts