The labs' safety cases miss this threat.
This seems mostly right, but it's worth noting that Anthropic's current RSP contains "Early Thoughts on ASL-4" and this section includes language like:
Model theft should be prohibitively costly for state-level actors, even with the help of a significant number of employees and the model itself.
_
Safety research: Sufficient progress on the science of interpretability, alignment training, and model evaluations to make an “affirmative case” that our models will not autonomously attempt to strategically undermine our safety measures or cause large-scale catastrophe.
_
red-teaming should confirm that models can't cause harm quickly enough to evade detection.
In principle these measures could suffice for managing risk from scheming.
(In practice, I expect that making a strong "affirmative case" that modes won't attempt to undermine safety measures will be very hard and thus likely something else (more like control) will happen.)
I mostly agree. Anthropic is kinda an exception (see footnotes 3, 4, 5, 11, 12, and 14). Anthropic's stance on model escape is basically like we'll handle that in ASL-4 — this is reasonable and better than not noticing the threat but they've said ~nothing specific about escape/control yet.
I worry that quotes like
red-teaming should confirm that models can't cause harm quickly enough to evade detection.
are misleading in the context of escape (from all labs, not just Anthropic). When defining ASL-4, Anthropic will need to clarify whether this includes scheming/escape and how safety could be "confirm[ed]" — I think safety properties will need high-effort control evals, not just red-teaming.
This post is poorly organized/edited and the important parts aren't novel; you should go read AI catastrophes and rogue deployments and AI Control and The case for ensuring that powerful AIs are controlled instead.
But it is unlikely that any lab will be able to make safety cases via alignment for the first AIs with dangerous capabilities.
Hmm, I think the situation is a bit more complex than this makes it sound. In particular, we might be able to weakly update against scheming based on a variety of empirical evidence even without being able to make a strong affirmative case.
As a central example of making a weak argument via alignment, consider doing a bunch of model organisms style experiments. This could plausibly update you against scheming via not seeing any compelling demonstrations of scheming. (Of course, it could also end up being a risk case rather than a safety case if you indeed do find compelling evidence for scheming!)
Another intervention like this would be to offer the model a deal in which you compensate it in exchange for the model giving you evidence on its own misalignment.
It's also somewhat unclear exactly what should be considered part of a control safety case vs an alignment safety case.
For instance, consider an intervention like:
Perform extensive (plausible-to-the-model) sting operations (aka red-teaming focused on scheming). Do this at various points throughout post-training to reduce the risk that the model has learned to be more conservative due to other red-teaming efforts. Also, ideally ensure that all data which discusses this sort of sting-operation is removed for the initial training data. (You can potentially train the AI on this data at the end after running sting operations.)
The point of this intervention is to detect if your model is scheming. You could think about this from a control type perspective (we can argue that a scheming model might have failed to evade our sting operations and thus we get some (possibly small) update towards safety) or an alignment perspective (we now have evidence for alignment).
I personally like thinking about this from a control perspective.
And, we need to make sure that we don't double count interventions like this in both control and alignment. (See here for more discussion of double counting and related matters.)
That said, if we're going to think about all interventions like this sting operation intervention from a control perspective, we should probably consider a slightly more expansive notion of control:
Ensure that even if your models are scheming, you'll be safe, because the models have a sufficiently high chance of failing to subvert your safety measures.
Using "is it sufficiently likely that it fails to subvert" makes it more clear that we're allowing part of the control case to include interventions which would probably only catch schemers if they make serious errors (errors which from some perspective they were "capable" of avoiding). This means that you need to have some distribution over aspects of competence which are hard to measure (e.g., how good of a plan will the schemer follow?).
Without countermeasures, a scheming AI could escape.
A safety case (for deployment) is an argument that it is safe to deploy a particular AI system in a particular way.[1] For any existing LM-based system, the developer can make a great safety case by demonstrating that the system does not have dangerous capabilities.[2] No dangerous capabilities is a straightforward and solid kind of safety case. For future systems with dangerous capabilities, a safety case will require an argument that the system is safe to deploy despite those dangerous capabilities. In this post, I discuss the safety cases that the labs are currently planning to make, note that they ignore an important class of threats—namely threats from scheming AI escaping—and briefly discuss and recommend control-based safety cases.
I. Safety cases implicit in current safety plans: no dangerous capabilities and mitigations to prevent misuse
Four documents both (a) are endorsed by one or more frontier AI labs and (b) have implicit safety cases (that don't assume away dangerous capabilities): Anthropic's Responsible Scaling Policy v1.0, OpenAI's Preparedness Framework (Beta), Google DeepMind's Frontier Safety Framework v1.0, and the AI Seoul Summit's Frontier AI Safety Commitments. With small variations, all four documents have the same basic implicit safety case: before external deployment, we check for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice and implement appropriate mitigations before deploying it externally.[3] Central examples of dangerous capabilities include hacking, bioengineering, and operating autonomously in the real world.
These safety cases miss (or assume unproblematic) some crucial kinds of threats.
II. Scheming AI and escape during internal deployment
By default, AI labs will deploy AIs internally to do AI development. Maybe lots of risk "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored." Without countermeasures, if the AI is scheming, it can likely escape. Some control techniques could probably catch a scheming AI before it escaped, but the labs seem unaware of this threat and aren't planning to use relevant techniques.
By escape, I mean "either escaping from the datacenter or getting root access to it." A central story for AI escaping goes like this:
"[T]he catastrophic action isn't the AI doing something incredibly hard [like inventing nanobots], it's the AI doing something relatively easy that is entirely a zero-sum action that removes control of the situation from humans." For example:
The labs' safety cases miss this threat. And the safety practices they use could miss it.
The labs' plans focus on noticing capabilities that could directly enable misuse and preventing users from accessing those capabilities. Here's a central story for a threat and how the labs' safety plans would avert it: by default AI systems may soon have bioengineering capabilities that enable sophisticated users to create pandemics. Before a lab externally deploys such a system, the lab will notice and implement mitigations to prevent users from eliciting relevant capabilities. A similar story applies to cyber capabilities for enabling severe cyberattacks and to other capabilities.
(Anthropic, OpenAI, and DeepMind consider autonomy a category of dangerous capabilities. This is relevant to the harm caused by a scheming model escaping: if a model is not competent at autonomy, the harm from its escape is limited. But no lab has published a plan for safety measures to prevent models from escaping once they reach dangerous levels of autonomy.[11])
(Note that model weight exfiltration by some outside actor poses some similar risks as the model escaping.)
The labs' mitigations focus on preventing users from accessing dangerous capabilities;[12] they are mostly not relevant to the model taking actions to escape (nor doing harm if it does escape, since few mitigations would be effective after escape[13]).
No labs have really discussed the risk of models scheming and escaping, much less published plans to deal with this risk.[14] But the labs could do a lot to reduce this risk.
III. Control techniques and control-based safety cases
There are two main lines of defense you could employ to prevent schemers from causing catastrophes.
Source. Ideally the labs would do both independently. But it is unlikely that any lab will be able to make safety cases via alignment for the first AIs with dangerous capabilities.[15] Hopefully the labs will use control.[16]
Labs can use inference-time scaffolding and monitoring to make AI misbehavior more difficult and more likely to be caught:[17]
In addition to these inference-time safety-measures, labs can do upfront auditing: give untrusted AI systems opportunities to subtly misbehave and check whether they do so. (In the code-backdoor case, this means giving untrusted AI opportunities to write bugs, fail to notice bugs, or exploit bugs.)
Here are some stories for how control techniques prevent scheming AI from escaping:[19]
Given a protocol "for the training, evaluation, and deployment of the model," a lab can perform a control evaluation to estimate or bound the "the probability that certain safety requirements will be violated" under certain assumptions. Nobody yet knows how to do control evaluations well.
See Safety Cases (Clymer et al. 2024).
Rough definition: a model "has dangerous capabilities" if releasing the model weights would let someone cause a global catastrophe.
Anthropic's plan applies to internal deployment, but it's not designed to avert the threat discussed in this post. Anthropic has committed to expand its plan before developing models that pose risks from escaping.
"A model in the ASL-3 category does not itself present a threat of containment breach due to autonomous self-replication, because it is both unlikely to be able to persist in the real world, and unlikely to overcome even simple security measures intended to prevent it from stealing its own weights."
"ASL-3 Deployment Measures" mentions internal use and monitoring for "unexpected attempts by AI systems to take real-world actions."
"Deployment in this case refers to the spectrum of ways of releasing a technology for external impact." There are nominally two exceptions to the external-deployment scope.
It's not clear what this means. It's trivial to implement arbitrarily strong mitigations if you don't care about the system staying useful. Maybe this refers to internal deployment, but they don't say that.
This could be good, and restricting development does restrict internal deployment a fortiori. But the "critical" thresholds are extremely high and it's totally unclear what would count as "dependable evidence."
Exceptions:
- "Deployment Mitigations" levels 2 and 3 mention risks from internal access. (But there's not yet a plan to implement them, and they aren't about scheming.)
- One "risk domain" is "Machine Learning R&D." This is very relevant to internal deployment. DeepMind senior staff member Rohin Shah says "ML R&D will require thinking about internal deployments (and so will many of the other CCLs)." But DeepMind doesn't really discuss this.
Or "As we better understand the risks posed by models at different CCLs, and the contexts in which our models will be deployed, we will develop mitigation plans that map the CCLs to the security and deployment levels described."
"Thresholds can be defined using model capabilities, estimates of risk, implemented safeguards, deployment contexts and/or other relevant risk factors. It should be possible to assess whether thresholds have been breached."
"including safety and security-related risk mitigations such as modifying system behaviours and implementing robust security controls for unreleased model weights."
Anthropic's ASL-3 safety measures can be triggered by autonomy capabilities, and the ASL-3 safety measures include internal monitoring for "unexpected attempts by AI systems to take real-world actions," but they are not aimed at or sufficient for preventing escape. OpenAI requires evidence of alignment, which could be great but isn't currently clear, and seems to require mitigations for internal deployment, but only when models reach the 'Critical' risk threshold, which is too high to prevent escape. DeepMind and Anthropic plan to make better plans in the future.
Anthropic mentions some reasonable stuff not relevant to the scheming-and-escape threat, plus "For internal usage . . . generated outputs and corresponding inputs . . . . are monitored for abnormal activity, including harmful use or unexpected attempts by AI systems to take real-world actions." DeepMind mentions some reasonable stuff not relevant to the scheming-and-escape threat. It does mention internal deployment, but only the risk of "inappropriate internal access of the critical capability," not risks from scheming AI. OpenAI mentions "mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners." This doesn't really make sense.
Scaffolding and inference-time interventions would presumably not remain in place. Fine-tuning could be cheaply reversed. Some unlearning might be effective.
But Anthropic has done work motivated by the prospect of AI scheming—including the model organisms agenda and sleeper agents paper—but not necessarily the risk of escaping during internal deployment. And when I suggested a lab should "Prepare to do risk assessment to determine whether its systems would be dangerous, if those systems were scheming," Anthropic senior staff member Zac Hatfield-Dodds told me "I think we'd describe this as the positive safety case for ASL-4 models."
Unfortunately, none of the labs have acknowledged this, even conditional on powerful AI appearing soon.
Of course, the details of the safety case are crucial; it does not suffice to make an appropriate kind of case.
See AI Control (Greenblatt et al. 2023), AI Control (Shlegeris et al. 2023), The case for ensuring that powerful AIs are controlled (Greenblatt and Shlegeris 2024), Untrusted smart models and trusted dumb models (Shlegeris 2023), Coup probes (Roger 2023), and Preventing Language Models from hiding their reasoning (Roger and Greenblatt 2023).
Notably including coup probes: "train a probe on your AI to tell you whether it's currently trying to [do a] catastrophic action," training with "synthetic examples of coup attempts."
Other central classes of stories: an escape attempt is prevented because a monitor edits or replaces a suspicious output, or an escape attempt never happens because collusion is difficult because inputs or outputs are paraphrased.