In this post, I’ll make a technical point that comes up when thinking about risks from scheming AIs from a control perspective.

In brief: Consider a deployment of an AI in a setting where it’s going to be given a sequence of tasks, and you’re worried about a safety failure that can happen suddenly. Every time the AI tries to attack, one of three things happens: we win, we lose, or the deployment continues (and the AI probably eventually attempts to attack again). So there’s two importantly different notions of attacks “failing”: we can catch the attack (in which case we win), or the attack can fail silently, either because it’s defective or because it’s defused by safety techniques, such that the attack doesn’t end up causing an unacceptable outcome but the AI also doesn’t get caught (in which case the deployment continues). When reasoning about how safe a protocol is, it’s crucial to distinguish between these two ways an attack can fail. And when designing protocols, it’s often worth thinking separately about the techniques that make the AI lose and the techniques that let the deployment continue without a win or loss.

This point has been made in passing in many places (our control paper, our toy models post, Charlie Griffin’s control paper, Towards Safety Cases For AI Scheming, various upcoming papers) but I wanted to have a single place that makes the point clearly. Thanks to Alex Mallen, Josh Clymer, Joe Benton, and Justis Mills for comments, and to Monte MacDiarmid for providing experimental data that I analyze.

Here’s a more precise description of the scenario:

  • You’re deploying your AI to do a large number of discrete tasks (e.g. implementing many features across the codebase inside a company, or doing many ML research projects).
    • Let’s assume that each task will be completed fairly quickly compared to the timescale of the whole deployment: for example, you’re trying to analyze the safety of deploying this AI for six months, and the tasks will take an average of an hour; each task lasts about 1/4000 of the whole deployment. The system will be doing many tasks in parallel.
  • You’re worried about a concentrated failure (a.k.a. “high-stakes failure”): in absence of countermeasures, the AI would be able to cause an unacceptable outcome just by behaving badly on a small number of tasks. Examples of concentrated failures include:
    • Causing a rogue deployment (see examples here).
    • Sabotaging an ML research task that is particularly crucial, where a single false result suffices for the lab to make a catastrophically bad decision. (I don’t think most ML research tasks are crucial enough that the AIs could cause an unacceptable outcome by just sabotaging a small number of experiments.)
    • Leaking sensitive information (e.g. describing key algorithmic secrets in a document that will be shared with people who aren’t cleared to know those secrets; one extreme example is the model leaking algorithmic secrets to someone outside the AI lab, which it is able to do if it has access to both algorithmic secrets and the internet).

Our first control paper (and various follow-up papers and related posts) analyzes a setting with these properties.

So you deploy the AI with some control measures. In many cases, the optimal strategy for the AI is to normally act aligned, and on very rare occasions, try to cause the unacceptable outcome—we’ll call those attempts “attacks”.

Execute/replace/audit protocols

For concreteness, let’s talk about a particularly important simple class of control protocols, which I’ll call execute/replace/audit protocols.

An execute/audit/replace protocol works as follows: We have an agent loop that repeatedly generates inputs to a LLM and then calls an inference service on those inputs, getting a proposed action in response (see here for more on what I’m imagining). Then we produce an estimate of how dangerous or suspicious this action is. There are many techniques you could use for this: asking trusted or untrusted models to evaluate the danger of proposed actions, using probes on the policy trained on off-policy examples or produced other ways, or using interpretability techniques like sparse autoencoders to detect deceptive intent. All of these methods take a proposed action (together with some surrounding context) and produce a score describing how dangerous and/or suspicious the action is. I’ll call that the monitor and call its response the monitor score.

Then, based on the monitor score, we need to make two decisions.

  • Do we execute the action or replace it?
    • If the action looks fine, we want to execute it.
    • If the action looks dangerous, we don’t want to execute it. So we want to replace it with a new action. In many current systems, this is implemented by replacing the model’s response with a refusal. In the context of an agent, there are many different options for how to replace the action (which will then be monitored itself):
      • Resample from the original, untrusted policy.
      • Defer to trusted: get an action from a weaker, trusted model.
      • Defer to trusted edit: get a trusted model to edit the action to remove parts that seemed suspicious.
      • Abort the trajectory, reporting that you failed to solve the task. This is similar to refusals in current language models, except it aborts the whole task instead of just the one message.
      • Block the trajectory, waiting until a human audit.
      • Shut down the whole deployment. I personally think that this is so costly that it’s very unlikely you’d want to do this without many people from the AI company first carefully investigating the action.
  • Do we want to audit the action? Auditing is when you review the action to determine whether it’s evidence that your model is intentionally trying to cause problems. The simplest strategy here is to put the action in a queue to eventually be reviewed by a human, who can escalate the investigation if they’re suspicious (perhaps using techniques described here); this escalation might end with you concluding that the model is a misaligned power-seeker.

So in general we want to get two scores from our monitor, which we use to make the two choices of whether to replace and whether to audit. You could use the same monitor with different thresholds for these two choices, which is how we handled it in our original control paper, but it’s probably better to make the decisions separately.

The two defensive actions of auditing and replacing have quite different safety properties.

  • Auditing is probably drastically more expensive, so you’ll probably do it much less. In particular, if you need to use human auditors and you’re able to replace actions by resampling, it’s totally affordable to resample your model’s actions 10% of the time, but the proportion of the model actions you can afford to have a human look at is drastically lower than that. E.g. OpenAI obviously can’t afford to have a human look at even a tiny fraction of the (very roughly estimated[1]) 100 trillion tokens that their API handles per year; with a staff of 1000 fast readers, they could read 0.1% of it.

    • One consequence is that you should generally replace whenever you audit.

  • False negatives are catastrophic when choosing whether to replace, but not when choosing whether to audit. If you execute a catastrophically dangerous action, you might immediately lose. Whereas if you choose not to audit a catastrophically dangerous action, you’re just missing out on the opportunity to get evidence that the model is scheming.

Execute/replace/audit protocols lead to win/continue/lose scenarios

Let’s reason through what happens when the AI attacks. First, let’s think about a simple version. In this version, auditing is perfect: the auditing is able to incriminate the AI so convincingly that it can shut down after observing any attack. And replacing is perfect: every attack is neutralized by replacing (which might not be true in reality, e.g. for a replace strategy like trusted editing).

So the ways the deployment can end here are:

  • The attack is above the audit threshold, and we audit the attack; we then win.
  • The attack is below the replace threshold, and also succeeds; we lose. (Note that attacks can fail. For example, in our code backdooring paper, a large fraction of attempted backdoors didn’t actually fit the requirements of a backdoor and so didn’t lead to an AI victory. One of the reasons that trusted monitoring is safer than upfront auditing in that paper is that trusted monitoring forces the attack policy to submit backdoors that, as a result of needing to be sneaky, are less likely to work.)

To calculate safety here (as described in Toy models of AI control for concentrated catastrophe prevention), you can make the following assumptions:

  • The model rarely diverges from the behavior of an aligned model, attacking occasionally (but often enough that the deployment will end early—this is usually a good assumption, because the model loses if the deployment completes, as described here).
  • Successful attacks immediately cause the scheming AI to win.

Then the odds of a blue team victory is the ratio between the probability of the following two events occurring on a particular step:

  • The AI developer wins because they audited an attempted attack and then decided to shut down based on the result of the audit.
  • The scheming AI wins because a successful attack was executed (rather than replaced).

For example, if on the average attack attempt, there’s a 10% chance that the attack will be executed and be successful, and there’s a 20% chance that the attack will be audited and then the deployment will be shut down, then the overall P(blue team victory) is ⅔. It’s easy to modify this rule of thumb to relax some of those assumptions.

Example: analyzing the safety of defer-to-trusted using the probes from “Simple probes can catch sleeper agents”

In Simple probes can catch sleeper agents, Monte MacDiarmid makes a linear probe monitor with a simple contrast pair technique on the model from Sleeper Agents. He reports the following ROC curve, graphing true positive rate (TPR) against false positive rate (FPR):

An obvious question is: if this was the actual ROC curve that we got with the actual monitor we were using on an actual scheming model, would we be safe?

Let’s pretend we’re using a perfect auditor that we can afford to use 0.1% of the time. We need to use the monitor as a classifier with a 0.1% FPR in order to keep our auditor fully utilized. According to the data that Monte generously gave me, at FPR=0.001, TPR=0.26. So the probability of an attack getting audited is 26%.

Now, let’s assume we’re also using this probe’s output for a replace strategy with a 5% replacement rate. At FPR=0.05, FNR=0.03. So when the model attacks, the chance that the attack is executed is 3%. (Let’s assume that every attack succeeds.)

Thus when the model attacks, there’s a 26% chance it gets audited, a 3% chance the attack gets executed, and the rest of the probability is that the attack is replaced. So there’s a 0.26/(0.26 + 0.03) = 90% chance that the model is caught before an attack is executed, and in expectation the model will have a few attacks replaced before it either wins or loses.

Some things to observe here:

  • If we don't use a replace protocol, the safety is atrocious: with a 0% replace rate, there’s a 74% chance that the model successfully attacks (that is, an attack is executed before an attack is audited and caught).
  • The safety is very sensitive to the audit rate.
  • A lot of people intuitively think that monitoring (for auditing) needs a super high TPR to be effective (probably because of a cached intuition about intelligent adversaries being able to break unrobust measurements), but as we see here even a fairly low TPR of 26% gets you a 10x risk reduction in the simplified setting, when paired with replacement.

So whether this probe is good enough for safety crucially depends on the details of the protocol in which the probe is being used as a monitor.

  1. ^

     I’m going off the $510 million annualized API revenue mentioned here, and arbitrarily using the GPT-4o price of $5 per million input tokens, which gets you 100 trillion tokens.

New Comment
1 comment, sorted by Click to highlight new comments since:
[-]jsd60

This distinction reminds me of Evading Black-box Classifiers Without Breaking Eggs, in the black box adversarial examples setting.