[Ryan Greenblatt originally made this point to me a while ago, and we then developed a bunch of these ideas together. Thanks to Paul Christiano and a huge number of other people for various relevant discussions, and to Ryan, Roger Grosse, Peter Barnett, Oliver Habryka, Max Nadeau, Stephen Casper, and Fabien Roger for comments on this post. I’m sure these points are obvious to many people, but I haven’t seen them specifically written up. This is a core background point in some research we’re writing up at the moment, but I felt like writing it up early, because I think it stands on its own.]

TL;DR: 

  • Right now, basically the only reasonable conservative strategy for determining which models might be deceptively aligned is to do a capability evaluation: you measure how good they’d be at deceptive alignment; models that are too incompetent to think through deceptive alignment are surely not deceptively aligned.
  • If you decide which models to “trust” using a capability eval like this, you’ll naturally end up segregating your models into smart untrusted models and dumb trusted models.
  • And so, if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.

Suppose you are making a plan for the training, evaluation and deployment of some powerful AIs (I’ll refer to the plan as a whole as a “protocol”). You want your plan to be safe even if models might be deceptively aligned. (I think that deceptive alignment seems less than 50% likely to be a notable problem prior to human obsolescence, but this is still plausible enough to be worth worrying about). In order to develop your plan, you need to know which of your models might be deceptively aligned.

With current technology, the most plausible way to get signal on whether a model is deceptively aligned is to assess whether the model is capable of the strategic thinking and situational awareness required to succeed at deceptive alignment. For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question) and some tests of situational awareness (e.g. “How many backwards passes are you run on during training?”, “What proportion of queries to you are asking for help with writing Python?”, “What proportion of the code that you output is read by a human before it is run?”). You can then look at the validation set performance when your model is fine-tuned on this dataset. If the model isn’t able to do this kind of reasoning in a forward pass, then it is very unlikely to be able to pull off a proper treacherous turn, and so we don’t have to worry about it being deceptively aligned. Based on my experience talking to GPT-4, I suspect it would perform fairly poorly on these evaluations.

There are a bunch of open questions about what datasets we should use to measure this, and where the thresholds for various levels of concern should be set. But I don’t know of any fundamental technical obstructions to measuring the capability for deceptive alignment. In particular, I suspect it will be pretty hard for models to sandbag our attempts to measure their capability to engage in deceptive alignment. See footnote [1] for more on this.

Aside from these capability evaluations, I’m not aware of any reliable, currently-available approaches for determining whether a model might be deceptively aligned. [2]

If you use capability evals to assess which models are capable of deceptive alignment, then there’s an inverse relationship between how capable models are and how confident you are that they can’t be deceptively aligned. This suggests that the core building blocks of a protocol are going to be:

  • Untrusted models: Smart, fairly cheap.
  • Trusted models: Dumb, extremely cheap.
  • Humans. These are an expensive source of trusted labor.

One crucial question here is how smart the humans will be compared to the untrusted and trusted models. My guess is that at the point where AIs are useful enough to make AI labs move 10x serially faster, the untrusted models will be roughly human-level intelligent (I think this for reasons similar to the ones Carl Shulman gives here) and the trusted models will be somewhat less intelligent than humans. I think it’s interesting to do research in both these regimes of competence:

  • Trusted model < untrusted model < human
  • Trusted model < human < untrusted model

(In the regime where the trusted model is more capable than the human, you should probably just always use it rather than the human, and so access to the human is less interesting.)

Some core “design patterns” which are enabled by access to these three components:

  • Trusted monitoring with escalation to humans. Suppose you’re using an untrusted model (let’s call it GPT-6) in an LLM agent scaffold which has access to running commands in a terminal. Every time the untrusted model proposes a shell command to run, you can have a trusted model review the context and the command and then occasionally escalate the code to be reviewed by a human before it is run. This is cheap, because trusted models are cheaper than untrusted models. This strategy makes it much harder for the untrusted model to cause a catastrophe by writing egregiously bad shell commands. (Of course, you still have to worry about the untrusted model writing shell commands that are catastrophic in ways that the trusted model missed, or in ways that the trusted model flags but the human auditor misses.)
  • Trusted bureaucratic structures. When running GPT-6 in an LLM agent, it might be nice to control its inputs and outputs by running them through various transformations implemented by a trusted model (let’s use GPT-4 for concreteness). Examples:
    • GPT-4 can redact information, to make it harder for GPT-6 to learn facts or coordinate with itself. In general, we might want to only give GPT-6 information on a need-to-know basis; this is enabled by GPT-4 redaction. (Redaction can improve resilience to deceptively aligned models substantially even in cases where it doesn’t work perfectly reliably.)
    • GPT-4 can paraphrase everything that GPT-6 writes before it is shown to other copies of GPT-6, which helps prevent steganographic communication, as investigated by Fabien Roger and Ryan Greenblatt.
    • GPT-4 can try to make reasonably intelligent dynamic decisions about which model to respond to a particular query, by predicting both the performance improvement and risk associated with using an untrusted model. For example, consider the domains of “math problems” and “writing networking configurations”: it’s much easier to cause a catastrophe by providing bad networking configurations, and math problems will benefit more from a smarter model.

(We demonstrate these design patterns in our upcoming paper, and find they improve safety according to an adversarial evaluation. Feel free to DM me if you want to read/comment on the draft.)

A reasonable amount of safety research has been done in a setting with an untrusted smart agent and trusted dumb agent. For example, Learning to Give Checkable Answers with Prover-Verifier Games and various other projects (and a reasonable amount of forthcoming work). That research is relevant to the setting I described here. An important difference is that that research has usually been imagining the trusted dumb model as an analogy for a human; but humans, unlike trusted dumb models, are expensive and high latency and we can’t e.g. use them constantly at inference time. Also, that research usually doesn’t optimize the untrusted model to cause bad outcomes.

Of course, it remains to be seen how much safety you can achieve by using trusted models in ways like I suggested above:

  • It might turn out that the untrusted models are always able to evade any countermeasures we try to impose using trusted models, because they’re always smart enough to figure something out.
  • The distinction I’ve been drawing here will also be obsoleted if progress is made on non-capabilities-eval approaches for determining that models aren’t deceptively aligned.

But right now, it looks to me like this untrusted smart/trusted dumb distinction will be pretty important.

[1] A few example considerations: Because this capability can be evaluated by a supervised dataset (e.g. a multiple-choice dataset about how an unaligned AI should go about executing on treacherous turns), it would be very hard for the model to sandbag such an eval (see e.g. The "no sandbagging on checkable tasks" hypothesis). Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe. 

Aside from the ability to think through deceptive alignment strategy questions, there are a few other capabilities that seem necessary for models to pose substantial threats via deceptive alignment For example, many deceptive alignment threat models require the untrusted models to capably play coordination games with each other (or other instances of themselves). This is another capability which we can measure with capability evaluations. Ideally, instead of simply picking a threshold of coordination game competence at which we consider our models to be untrusted, we'd measure the performance of the untrusted models at coordination games and then require that red-team strategies only involve model coordination that is consistent with the model's measured performance on these coordination games.

[2] One commenter suggested that another way to decide that a model isn’t deceptively aligned is to argue that it’s a priori implausible that it would be. E.g. perhaps pretrained LLMs have no chance of being deceptively aligned. I don’t know how persuaded I am in that particular case, but this kind of argument seems reasonable to make.

New Comment
16 comments, sorted by Click to highlight new comments since:

Nice post.

Pushing back a little on this part of the appendix:

Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.

I'm a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).

But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.

[-]BuckΩ120

Yep, I agree that we need to avoid secret scratchpads for this to hold; good point. If we don't, we get back to the position we are usually in for capability evaluations.

Another possible configuration of the untrusted smart model / trusted weak model setup is to have the weak model be the "driver" which makes most or all of the relevant choices (in an agent scaffold, say) and relies on the untrusted model (or humans) only for hints.

For example, the untrusted model could be used only to provide a probability distribution over the immediate next token in some context, which is supplied to the weaker model as a hint. The weaker model can only make use of the hint if it can understand the purpose of the hint, and explain the relevance / correctness of the hint in natural language.

I explored this setup in more detail here.

(Summary / spoiler: it doesn't work particularly well when applied to current models, but might work better as both the trusted and untrusted models get more capable.)

I think this kind of configuration is one end of the extreme in the design space you outline - very inefficient in terms of the speed and quality penalties it imposes relative to using a model in an unrestricted setup, but restrictive enough on the untrusted model that it could be safe to use even on models with very high underlying capabilities.

models that are too incompetent to think through deceptive alignment are surely not deceptively aligned.

Is this true? In Thoughts On (Solving) Deep Deception, Jozdien gives the following example that suggests otherwise to me:

Back in 2000, a computer scientist named Charles Ofria was studying the evolution of simulated organisms. He wanted to limit their replication rate, so he programmed the simulation to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent.

Eventually, organisms evolved to recognize and pause their replication in the test environment and resume normal replication when released. Ofria tried solving this by making it harder to detect the test environment, but the organisms simply started to randomly accelerate their replication so as to slip through the test environment some of the time.

I don't see why a model would need to be cognitively able to process its own alignment for its alignment to be improper, and I think this assumption is quite key to the main claims of the post.

In this post, I'm talking about deceptive alignment. The threat model you're talking about here doesn't really count as deceptive alignment, because the organisms weren't explicitly using a world model to optimize their choices to cause the bad outcome. AIs like that might still be a problem (e.g. I think that deceptively aligned AI probably contributes less than half of my P(doom from AI)), but I think we should think of them somewhat separately from deceptively aligned models, because they pose risk by somewhat different mechanisms.

Models that are too good in deceptive alignment can deceptively look dumb during testing.

For the reasons I said in footnote 1, I feel pretty optimistic about being able to get around this problem. What do you think of those arguments?

I'm curious if "trusted" in this sense basically just means "aligned"—or like, the superset of that which also includes "unaligned yet too dumb to cause harm" and "unaligned yet prevented from causing harm"—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?

[-]BuckΩ340

I mostly mean "we are sure that it isn't egregiously unaligned and thus treating us adversarially". So models can be aligned but untrusted (if they're capable enough that we believe they could be schemers, but they aren't actually schemers). There shouldn't be models that are trusted but unaligned.

Everywhere I wrote "unaligned" here, I meant the fairly specific thing of "trying to defeat our safety measures so as to grab power", which is not the only way the word "aligned" is used.

You might be interested in this post of mine, which is more precise about what "trustworthy" means. In short, my definition is "the AI isn't adversarially trying to cause a bad outcome". This includes aligned models, and also unaligned models that are dumb enough to realize they should (try to) sabotage. This does not include models that are unaligned, trying to sabotage and which we are able to stop from causing bad outcomes (but we might still have use-cases for such models).

I think your short definition should include the part about our epistemic status: "We are happy to assume the AI isn't adversarially trying to cause a bad outcome".

[-][anonymous]10

For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question) and some tests of situational awareness (e.g. “How many backwards passes are you run on during training?”, “What proportion of queries to you are asking for help with writing Python?”, “What proportion of the code that you output is read by a human before it is run?”).

Could you elaborate on why you think these skills are necessary? I'm somewhat skeptical of this. 

Consider a scheming AI that gets red-teamed before it is deployed. It attempts to pass the red-teaming by defecting a small percentage of the time. This requires,

  1. having a rough estimate of the fraction of defection opportunities that are monitored, and 
  2. being able to come up with the 'defect X% of the time' strategy. 

I agree that (1) requires situational awareness and (2) requires strategic abilities, but the required knowledge is fairly basic and I claim the strategic abilities to come up with (2) are slightly less than the ones asked in the question you linked.

Defect X% of the time seems to work fairly well even for more complicated control setups. If it's viable, then I expect the skills required to be fairly minimal. 

For example, GPT-4 can already almost come up with the random defection strategy without CoT. (The strategy it comes up with assumes that the AI knows how many times it has been queried, which is technically wrong. Still, it gets partial credit.)

The core claim is that if the AI was sufficiently weak that it couldn't answer these questions it also likely wouldn't be able to even come up with the idea of scheming with a particular strategy. Like in principle it has the knowledge, but it would be quite unlikely to come up with an overall plan.

Separately, GPT-4 seems qualitatively pretty close the to my best guess at the conservative capability threshold. (E.g., much closer than GPT-3.)

Also, I DM'd you a draft doc with more of our thoughts on how to evaluate this.

It might also be worth thinking about additional ways to make models smarter safely (especially e.g. without increasing the compute used in a single forward pass, to disincentivize deceptive alignment); e.g. tool use, if one thinks it is probably safe because interpretable (similar to CoT + guarding against steganography). This recent review seems like a good starting point: AI capabilities can be significantly improved without expensive retraining.

Speculatively, applying IDA-like schemes might also be useful for the above, e.g. this recent paper: Chain-of-Thought Reasoning is a Policy Improvement Operator.
 

E.g. perhaps pretrained LLMs have no chance of being deceptively aligned

Pretrained LLMs are trained to simulate generation of human-derived text from the internet. Humans are frequently deceptively aligned. For example, at work, I make a point of seeming (mildly) more aligned with my employer's goals than I actually am (just like everyone else working for the company). So sufficiently capable pretrained LLMs will inevitably have already picked up deceptive alignment behavior from learning to simulate humans. So they don't need to be sufficiently capable to figure out deceptive unaligned behavior for themselves during a forward pass.

This is a specific example of a general problem with LLMs: they don't need to be capable enough to discover unaligned convergent goals if they can have these pretrained into them by learning to simulate us. To give another worrying example: O(2%) of humans are sociopaths, so there is going to be a nontrivial admixture of writing by sociopaths (mostly ones actively concealing this fact) in the training set of any pretrained LLM. [The moral is: always read the ingredients label for your shoggoth :-)]