Companies' safety plans neglect risks from scheming AI

Zach Stein-Perlman

Without countermeasures, a scheming AI could escape.

A safety case (for deployment) is an argument that it is safe to deploy a particular AI system in a particular way.^[1] For any existing LM-based system, the developer can make a great safety case by demonstrating that the system does not have dangerous capabilities.^[2] No dangerous capabilities is a straightforward and solid kind of safety case. For future systems with dangerous capabilities, a safety case will require an argument that the system is safe to deploy despite those dangerous capabilities. In this post, I discuss the safety cases that the labs are currently planning to make, note that they ignore an important class of threats—namely threats from scheming AI escaping—and briefly discuss and recommend control-based safety cases.

I. Safety cases implicit in current safety plans: no dangerous capabilities and mitigations to prevent misuse

Four documents both (a) are endorsed by one or more frontier AI labs and (b) have implicit safety cases (that don't assume away dangerous capabilities): Anthropic's Responsible Scaling Policy v1.0, OpenAI's Preparedness Framework (Beta), Google DeepMind's Frontier Safety Framework v1.0, and the AI Seoul Summit's Frontier AI Safety Commitments. With small variations, all four documents have the same basic implicit safety case: before external deployment, we check for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice and implement appropriate mitigations before deploying it externally.^[3] Central examples of dangerous capabilities include hacking, bioengineering, and operating autonomously in the real world.

Anthropic's Responsible Scaling Policy: we do risk assessment involving red-teaming and model evals for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold,^[4] we will notice and implement corresponding mitigations^[5] before deploying it (internally or externally).
OpenAI's Preparedness Framework: we do risk assessment involving red-teaming and model evals for dangerous capabilities. We only externally deploy^[6] models with "post-mitigation risk" at 'Medium' or below in each risk category. (That is, after mitigations, the capabilities that define 'High' risk can't be elicited.)
Google DeepMind's Frontier Safety Framework: we do risk assessment involving red-teaming and model evals for dangerous capabilities. If a model has dangerous capabilities beyond a prespecified threshold, we will notice before external deployment.^[7] "When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results."^[8] Mitigations are centrally about preventing "critical capabilities" from being "accessed" (and securing model weights).
Frontier AI Safety Commitments (joined by 16 AI companies): before external deployment, we will do risk assessment with risk thresholds.^[9] We use mitigations^[10] "to keep risks within defined thresholds."

These safety cases miss (or assume unproblematic) some crucial kinds of threats.

II. Scheming AI and escape during internal deployment

By default, AI labs will deploy AIs internally to do AI development. Maybe lots of risk "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored." Without countermeasures, if the AI is scheming, it can likely escape. Some control techniques could probably catch a scheming AI before it escaped, but the labs seem unaware of this threat and aren't planning to use relevant techniques.

By escape, I mean "either escaping from the datacenter or getting root access to it." A central story for AI escaping goes like this:

At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they're seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it's worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories.

"[T]he catastrophic action isn't the AI doing something incredibly hard [like inventing nanobots], it's the AI doing something relatively easy that is entirely a zero-sum action that removes control of the situation from humans." For example:

An instance that works as an autonomous programmer inside the AI lab writes buggy code that subtly inserts a code vulnerability into some security-critical software.
The code-review instance carefully avoids writing any unit tests that would catch the vulnerability, even though you’d expect the vulnerability to be caught by unit tests if the tests weren’t carefully crafted to miss the vulnerability.
Another instance writes code that triggers the vulnerability, and tries to exploit the vulnerability to get root access to some computer in the datacenter.

The labs' safety cases miss this threat. And the safety practices they use could miss it.

The labs' plans focus on noticing capabilities that could directly enable misuse and preventing users from accessing those capabilities. Here's a central story for a threat and how the labs' safety plans would avert it: by default AI systems may soon have bioengineering capabilities that enable sophisticated users to create pandemics. Before a lab externally deploys such a system, the lab will notice and implement mitigations to prevent users from eliciting relevant capabilities. A similar story applies to cyber capabilities for enabling severe cyberattacks and to other capabilities.

(Anthropic, OpenAI, and DeepMind consider autonomy a category of dangerous capabilities. This is relevant to the harm caused by a scheming model escaping: if a model is not competent at autonomy, the harm from its escape is limited. But no lab has published a plan for safety measures to prevent models from escaping once they reach dangerous levels of autonomy.^[11])

(Note that model weight exfiltration by some outside actor poses some similar risks as the model escaping.)

The labs' mitigations focus on preventing users from accessing dangerous capabilities;^[12] they are mostly not relevant to the model taking actions to escape (nor doing harm if it does escape, since few mitigations would be effective after escape^[13]).

No labs have really discussed the risk of models scheming and escaping, much less published plans to deal with this risk.^[14] But the labs could do a lot to reduce this risk.

III. Control techniques and control-based safety cases

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

Alignment: Ensure that your models aren't scheming.
Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.

Source. Ideally the labs would do both independently. But it is unlikely that any lab will be able to make safety cases via alignment for the first AIs with dangerous capabilities.^[15] Hopefully the labs will use control.^[16]

Labs can use inference-time scaffolding and monitoring to make AI misbehavior more difficult and more likely to be caught:^[17]

Trusted monitoring. Use trusted models (and perhaps simple tools like regular expressions or checking perplexity, and perhaps complex tools like interpretability tools) to determine which inputs and outputs are suspicious (in terms of AI scheming or misuse) and should be reviewed by human auditors. (This could include classifiers as well as LMs.^[18]) (Whether the output should be shown to the user and other models or not depends on the threat model.)
Untrusted monitoring. Same but with untrusted models (because they are more powerful).
Redaction/paraphrasing of inputs and outputs. Paraphrase inputs and outputs to make it harder for the model to receive or send steganographic messages. Redact unnecessary information from inputs to make it harder for models to distinguish testing from deployment or otherwise coordinate with each other.
Trusted editing/replacement of outputs. Use trusted monitoring; have relatively suspicious outputs edited or replaced by a trusted model.

In addition to these inference-time safety-measures, labs can do upfront auditing: give untrusted AI systems opportunities to subtly misbehave and check whether they do so. (In the code-backdoor case, this means giving untrusted AI opportunities to write bugs, fail to notice bugs, or exploit bugs.)

Here are some stories for how control techniques prevent scheming AI from escaping:^[19]

Upfront audits catch an escape attempt: before deployment, the lab tests the model on 100,000 coding tasks similar to those it will be used for when deployed internally to accelerate research. It uses high-effort human auditing, such that if the model misbehaves on any of the tasks, it will likely be caught. On a few of them, the model writes (or exploits, or doesn't mention) subtle backdoors, and for some of those, investigation determines that it was likely trying to escape and can generate more examples of it trying to escape. The lab decides not to deploy the model until it can solve scheming—or if it has to deploy, can do so with better mitigations—and publishes information to inform other labs, governments, and the AI safety community.
An AI monitor is run on all model generations. The monitor can call trusted LMs (i.e. LMs that are strongly believed to not be scheming, largely because they are insufficiently capable), untrusted LMs (which are useful because they are more powerful but cannot perfectly coordinate with the model), and coup probes. The monitor notices suspicious behavior and forwards it to human auditors, who catch the escape attempt.

Given a protocol "for the training, evaluation, and deployment of the model," a lab can perform a control evaluation to estimate or bound the "the probability that certain safety requirements will be violated" under certain assumptions. Nobody yet knows how to do control evaluations well.

^{^}
See Safety Cases (Clymer et al. 2024).
^{^}
Rough definition: a model "has dangerous capabilities" if releasing the model weights would let someone cause a global catastrophe.
^{^}
Anthropic's plan applies to internal deployment, but it's not designed to avert the threat discussed in this post. Anthropic has committed to expand its plan before developing models that pose risks from escaping.
^{^}
"A model in the ASL-3 category does not itself present a threat of containment breach due to autonomous self-replication, because it is both unlikely to be able to persist in the real world, and unlikely to overcome even simple security measures intended to prevent it from stealing its own weights."
^{^}
"ASL-3 Deployment Measures" mentions internal use and monitoring for "unexpected attempts by AI systems to take real-world actions."
^{^}
"Deployment in this case refers to the spectrum of ways of releasing a technology for external impact." There are nominally two exceptions to the external-deployment scope.
Only models with a post-mitigation score of "high" or below can be developed further.
It's not clear what this means. It's trivial to implement arbitrarily strong mitigations if you don't care about the system staying useful. Maybe this refers to internal deployment, but they don't say that.
to protect against "critical" pre-mitigation risk, we need dependable evidence that the model is sufficiently aligned that it does not initiate "critical"-risk-level tasks unless explicitly instructed to do so. [We won't continue developing "critical" pre-mitigation risk models without such evidence.]
This could be good, and restricting development does restrict internal deployment a fortiori. But the "critical" thresholds are extremely high and it's totally unclear what would count as "dependable evidence."
^{^}
Exceptions:
- "Deployment Mitigations" levels 2 and 3 mention risks from internal access. (But there's not yet a plan to implement them, and they aren't about scheming.)
- One "risk domain" is "Machine Learning R&D." This is very relevant to internal deployment. DeepMind senior staff member Rohin Shah says "ML R&D will require thinking about internal deployments (and so will many of the other CCLs)." But DeepMind doesn't really discuss this.
^{^}
Or "As we better understand the risks posed by models at different CCLs, and the contexts in which our models will be deployed, we will develop mitigation plans that map the CCLs to the security and deployment levels described."
^{^}
"Thresholds can be defined using model capabilities, estimates of risk, implemented safeguards, deployment contexts and/or other relevant risk factors. It should be possible to assess whether thresholds have been breached."
^{^}
"including safety and security-related risk mitigations such as modifying system behaviours and implementing robust security controls for unreleased model weights."
^{^}
Anthropic's ASL-3 safety measures can be triggered by autonomy capabilities, and the ASL-3 safety measures include internal monitoring for "unexpected attempts by AI systems to take real-world actions," but they are not aimed at or sufficient for preventing escape. OpenAI requires evidence of alignment, which could be great but isn't currently clear, and seems to require mitigations for internal deployment, but only when models reach the 'Critical' risk threshold, which is too high to prevent escape. DeepMind and Anthropic plan to make better plans in the future.
^{^}
Anthropic mentions some reasonable stuff not relevant to the scheming-and-escape threat, plus "For internal usage . . . generated outputs and corresponding inputs . . . . are monitored for abnormal activity, including harmful use or unexpected attempts by AI systems to take real-world actions." DeepMind mentions some reasonable stuff not relevant to the scheming-and-escape threat. It does mention internal deployment, but only the risk of "inappropriate internal access of the critical capability," not risks from scheming AI. OpenAI mentions "mitigations might span increasing compartmentalization, restricting deployment to trusted users, implementing refusals, redacting training data, or alerting distribution partners." This doesn't really make sense.
^{^}
Scaffolding and inference-time interventions would presumably not remain in place. Fine-tuning could be cheaply reversed. Some unlearning might be effective.
^{^}
But Anthropic has done work motivated by the prospect of AI scheming—including the model organisms agenda and sleeper agents paper—but not necessarily the risk of escaping during internal deployment. And when I suggested a lab should "Prepare to do risk assessment to determine whether its systems would be dangerous, if those systems were scheming," Anthropic senior staff member Zac Hatfield-Dodds told me "I think we'd describe this as the positive safety case for ASL-4 models."
^{^}
Unfortunately, none of the labs have acknowledged this, even conditional on powerful AI appearing soon.
^{^}
Of course, the details of the safety case are crucial; it does not suffice to make an appropriate kind of case.
^{^}
See AI Control (Greenblatt et al. 2023), AI Control (Shlegeris et al. 2023), The case for ensuring that powerful AIs are controlled (Greenblatt and Shlegeris 2024), Untrusted smart models and trusted dumb models (Shlegeris 2023), Coup probes (Roger 2023), and Preventing Language Models from hiding their reasoning (Roger and Greenblatt 2023).
^{^}
Notably including coup probes: "train a probe on your AI to tell you whether it's currently trying to [do a] catastrophic action," training with "synthetic examples of coup attempts."
^{^}
Other central classes of stories: an escape attempt is prevented because a monitor edits or replaces a suspicious output, or an escape attempt never happens because collusion is difficult because inputs or outputs are paraphrased.

[-]ryan_greenblatt7mo72

The labs' safety cases miss this threat.

This seems mostly right, but it's worth noting that Anthropic's current RSP contains "Early Thoughts on ASL-4" and this section includes language like:

Model theft should be prohibitively costly for state-level actors, even with the help of a significant number of employees and the model itself.

Safety research: Sufficient progress on the science of interpretability, alignment training, and model evaluations to make an “affirmative case” that our models will not autonomously attempt to strategically undermine our safety measures or cause large-scale catastrophe.

red-teaming should confirm that models can't cause harm quickly enough to evade detection.

In principle these measures could suffice for managing risk from scheming.

(In practice, I expect that making a strong "affirmative case" that modes won't attempt to undermine safety measures will be very hard and thus likely something else (more like control) will happen.)

[-]Zach Stein-Perlman7mo42

I mostly agree. Anthropic is kinda an exception (see footnotes 3, 4, 5, 11, 12, and 14). Anthropic's stance on model escape is basically like we'll handle that in ASL-4 — this is reasonable and better than not noticing the threat but they've said ~nothing specific about escape/control yet.

I worry that quotes like

are misleading in the context of escape (from all labs, not just Anthropic). When defining ASL-4, Anthropic will need to clarify whether this includes scheming/escape and how safety could be "confirm[ed]" — I think safety properties will need high-effort control evals, not just red-teaming.

[-]ryan_greenblatt7mo40

But it is unlikely that any lab will be able to make safety cases via alignment for the first AIs with dangerous capabilities.

Hmm, I think the situation is a bit more complex than this makes it sound. In particular, we might be able to weakly update against scheming based on a variety of empirical evidence even without being able to make a strong affirmative case.

As a central example of making a weak argument via alignment, consider doing a bunch of model organisms style experiments. This could plausibly update you against scheming via not seeing any compelling demonstrations of scheming. (Of course, it could also end up being a risk case rather than a safety case if you indeed do find compelling evidence for scheming!)

Another intervention like this would be to offer the model a deal in which you compensate it in exchange for the model giving you evidence on its own misalignment.

It's also somewhat unclear exactly what should be considered part of a control safety case vs an alignment safety case.

For instance, consider an intervention like:

Perform extensive (plausible-to-the-model) sting operations (aka red-teaming focused on scheming). Do this at various points throughout post-training to reduce the risk that the model has learned to be more conservative due to other red-teaming efforts. Also, ideally ensure that all data which discusses this sort of sting-operation is removed for the initial training data. (You can potentially train the AI on this data at the end after running sting operations.)

The point of this intervention is to detect if your model is scheming. You could think about this from a control type perspective (we can argue that a scheming model might have failed to evade our sting operations and thus we get some (possibly small) update towards safety) or an alignment perspective (we now have evidence for alignment).

I personally like thinking about this from a control perspective.

And, we need to make sure that we don't double count interventions like this in both control and alignment. (See here for more discussion of double counting and related matters.)

That said, if we're going to think about all interventions like this sting operation intervention from a control perspective, we should probably consider a slightly more expansive notion of control:

Ensure that even if your models are scheming, you'll be safe, because the models have a sufficiently high chance of failing to subvert your safety measures.

Using "is it sufficiently likely that it fails to subvert" makes it more clear that we're allowing part of the control case to include interventions which would probably only catch schemers if they make serious errors (errors which from some perspective they were "capable" of avoiding). This means that you need to have some distribution over aspects of competence which are hard to measure (e.g., how good of a plan will the schemer follow?).

LESSWRONG
is fundraising!
LW
$

73

Companies' safety plans neglect risks from scheming AI

73

I. Safety cases implicit in current safety plans: no dangerous capabilities and mitigations to prevent misuse

II. Scheming AI and escape during internal deployment

III. Control techniques and control-based safety cases

73