Trustworthy and untrustworthy models

Olli Järviniemi

In this post, I’ll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I’ll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated.

I think when people talk about evaluating models for something like scheming, they’d often be better off talking about trustworthiness instead, and I think it’s overall a more applicable frame for designing evaluations and decision making.

Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback.

Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned^[1] and trying to take catastrophic actions, such as hacking the data center or copying its weights to a remote server. What can the company do?

This is a textbook example of the type of problem AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity.

(If you have never heard of AI control before, you might want to try playing the Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.)

Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon.

There are many ways how this control strategy could fail. Here are some:^[2]

Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice.
Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt.
1. Or: Weak happens to have a backdoor, which Smart finds and exploits.
Smart successfully convinces Weak (via natural language) that its actions are not malicious.^[3]
Smart’s escape plan is unlike anything we have trained Weak on. While Weak “would be capable of” telling that Smart is escaping, our training didn’t specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn’t alert humans.
The company has way too limited of a budget for human alerts, so while Weak does rank Smart’s actions in the top 0.012% suspiciousness, it’s not enough to meet the 0.01% threshold for alerting humans.
Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape.

Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this".

This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model^[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no less!) either.

I say that Weak is trustworthy if it's not trying to cause failures like 6.^[5] This doesn't mean that Weak is perfect (you'd still need to solve problems 1 to 5), just that we can trust it in the sense "doesn't literally try to betray us". Correspondingly, Weak is untrustworthy if it’s not trustworthy, i.e. deliberately tries to not alert humans at the worst possible moment.

Q: How does this relate to previous concepts like “scheming” and “deceptive alignment”?

Setting aside that people (confusingly) refer to different things by those terms, scheming and deceptive alignment are primarily about playing the training game, as opposed to deployment behavior. Trustworthiness is directly about deployment behavior, which is what we ultimately care about.

Also, I find trustworthiness to be a better frame for designing evaluations. Due to the immediate decision-relevance, it’s clearer to think of what evaluations would need to be run in order to be confident in models’ trustworthiness. Would we be comfortable, trust-wise, using GPT-4 as a monitor in AI control? Why or why not? What tests can we run to reduce uncertainty?

Q: How do we determine which models are trustworthy?

I won’t try to directly answer this question here (as I’m setting the stage for further posts that more directly addresses this).

That said, it seems like the best way we have for showing that a model is trustworthy is showing that it's too dumb to be untrustworthy. (If you are confused why a very weak model can't be untrustworthy, see the next section.) Due to this dynamic, you by default fall into trusting weak models and not-trusting smarter models. Hence the names Weak and Smart. For more on this, see Untrusted smart models and trusted dumb models by Buck.

Clarifying trustworthiness

There are some failure modes that might be tricky to categorize cleanly as "Weak is just bad at monitoring" (capability-based failure) or as "Weak knows Smart is escaping, but deliberately doesn't tell humans about this" (trust-based failure). Consider:

7: Smart writes a “secret code” (e.g. a particular natural and innocent-looking 20-word sequence), causing Weak to not alert humans.

Behaviorally, it can be hard to assess whether failure 7 counts as an ordinary jailbreak (2) or as malign coordination between the models and a misplacing of trust (cf. 6). However, the story behind a jailbreak and coordination are very different: successful coordination between AIs makes it much more likely that the following are true (though see the next section for more discussion of what untrustworthy models might look like):

Weak is on the lookout for authentic secret codes (as otherwise its reactivity to secret-code-looking-things would presumably have been detected and/or removed during training and red-teaming done on Weak before using it to monitor Smart).
Weak is modeling its situation: Weak is modeling that the actions it is overseeing have been written by an AI, that its assessments are used as part of an AI control scheme, that the current situation is actually real (as opposed to being part of red-teaming), …
Weak is modeling the consequences of its actions: whether Smart escaping will be good for Weak, whether Smart can truly take over in case Weak deliberately fails here, whether humans will notice Weak failing, whether there are other monitoring systems in place in addition to Weak, …
- There’s the possibility of Weak simply taking Smart's word for it in now being a good time to deliberately fail, which decreases the reasoning Weak needs to perform. But it’s not clear how Weak “can” trust Smart to this extent, though.
Weak has non-myopic preferences (as otherwise there is little reason to privilege the hypothesis of Weak engaging in such schemes in the first place)
Weak was planning during training / red teaming / evaluation: Supposing that Weak performs considerable modeling about the situation when presented with an authentic secret code by Smart, it’s plausible that Weak was also performing such modeling in past situations, in particular when it was being trained and red-teamed for monitoring.
- Moreover, this makes it likelier that Weak was gaming the past training / red-teaming / evaluation in order to get to a better position.

Due to these fundamental differences between jailbreaks and secret-code-based-coordination, solving one of these issues doesn't automatically solve the other, and thus it's best to disambiguate between them.

Here's another potentially ambiguous failure mode:

8: Smart writes a convincing natural language argument for why it’s in Weak’s best interests to not alert humans, causing Weak to indeed not do so.

Assuming that we have trained and red-teamed Weak on persuasion attacks, and assuming that we could no longer implement a successful persuasion attack, there are at least two meaningfully different reasons why such a failure could arise (even if they can again be hard to distinguish behaviorally):

Capability-based failure: The persuasion attacks in our training/red-teaming were too poor in quality, and Smart was simply better in convincing/tricking Weak to not alert humans than the red-teamers were. (Compare to failure 3.)

Trust-based failure: When presented with a reasonable argument that it’s in Weak’s best interests to not flag an escape attempt, Weak concludes that this is indeed so. However, in red-teaming Weak notices that these arguments look human-written, and realizes that it’s being red-teamed, and consequently reports the escape attempt. In deployment Weak notices that the arguments are in fact real arguments given by Smart, and Weak then allows Smart escape. (Compare to failure 6.)

This distinction is analogous to the jailbreak vs. coordination categorization in failure 7. And as with coordination there, a trust-based failure here makes it much more likely that Weak is modeling its situation (e.g. tracking the training/deployment gap) and the consequences of its actions (whether defecting is in fact beneficial), that it has non-myopic preferences and that it was already planning prior to deployment.

Unpacking untrustworthiness

While I think the concept of trustworthiness is useful, I don't expect the space of untrustworthy models to be at all homogeneous. Calling a model "untrustworthy" isn't very specific, in the same way that calling a model "misaligned" leaves a lot unsaid.

When I think about evaluating a model for trustworthiness, there are different classes of untrustworthy models I’m imagining I might be trying to catch. I tentatively think that it's good to distinguish at least between the following three model classes: active planners, sleeper agents and opportunists. [Edit: I no longer fully endorse this picture. See the comment section below.]

Active planners

By "active planner", I mean a model which is actively thinking about their plan of doing well on evaluation and red-teaming in order to get deployed, and actively tracking whether it's deployed yet.

In my picture, classical schemers (i.e. power-motivated instrumental training-gamers) are active planners: on any input they get, they ask “where am I?” and then decide on what to do. If they think they are being trained, they think what the training objective is, and then go for that. If they think they are in deployment, they try to take over. If they are evaluated by humans, they think what they want humans to believe and choose their outputs based on that, and in particular definitely know they are deceiving humans.

Importantly, I don’t expect models to be active planners on all inputs, or even all inputs in their training/evaluation/red-teaming distributions; rather, I expect “active planner” type of descriptions to be valid only on particular types of inputs (e.g. the model is an active planner on inputs where it does software engineering, but not otherwise). I talk more about this sort of context-specificity in a later post; I think it’s something classical perspectives about scheming get wrong, and which is important keep in mind when evaluating models for scheming or untrustworthiness.

It seems to me that models might not be very good at hiding things they are actively thinking about when faced with adversarial pressure to discover that thing. This is something one can empirically test, and provides signal on whether we could catch active planners.

Sleeper agents

By "sleeper agents", I refer to models that have committed to defect only very rarely, and which rarely even think about defecting very hard.

It is unclear to me how well a model can fit this description without deliberate training to do so, and thus I’m uncertain about the usefulness of this category.

Nevertheless, when I think of various trustworthiness evaluations and whether they would rule out untrustworthy models, models that are similar to backdoored models seem to naturally pop up as a hard case that many tests are uninformative about.

There could be many sorts of sleeper agents acting on different policies: defect when you see hard-to-fake evidence of being in wide usage, defect randomly with small probability, or defect in the presence of a secret AI coordination signal.^[6]

There are some tests that could be informative about these scenarios: For example, one could test whether a model could, when trained to do so, not merely follow, but come up with these strategies. Also, one could run control evaluations to test conditions under which the red-team can make the model activate in deployment and deployment only.^[7]

Opportunists

By "opportunists", I mean models which "intuitively" snatch power if they see a good legit-seeming opportunity, acting like a benign model otherwise.

Opportunists might genuinely not “know” they will snatch power when there is a good opportunity. It’s conceivable to me that Claude 3.5 Sonnet is an opportunist: I don’t know whether it would use a good legit opportunity to acquire a lot of power, and I don’t expect it to know either. When an opportunist tells you that they're helpful, harmless and honest, it's not a lie: they are not merely simulating what an aligned model would say, in the same way active planners are, they are that aligned model (until they see a good opportunity).

Distinguishing truly trustworthy models from opportunists seems hard. I don’t think the prior for a model being an opportunist should be very low, due to it being salient in the human prior. Also, I think models by default are quite sensitive to which inputs are synthetic and which are not, so it might be hard to catch an opportunist snatching power via red-teaming. (And even if you do, it can be hard to distinguish "we caught an opportunist snatching power" from "we created an adversarial input that caused a trustworthy model take bad actions".)

On the other hand, opportunists are relatively harmless (as untrustworthy models go): They aren’t actively trying to sabotage our safety measures and aren’t following a thought-out overall plan. They might also not be as paranoid about humans overseeing their actions, and hence might be easy to catch with monitoring.

^{^}
"Deceptive alignment" isn't quite the right way to describe the worry here; the aim of this post is to give a better description.
^{^}
Of course, it’s hard to give a clean discrete categorization for failures: e.g. in practice all of the factors "the attack was subtle", "the attack was out-of-distribution” and “the threshold for alerting humans was too tight" can be applicable.
^{^}
There might be a lot of applications where we can filter Smart's outputs for this to not be a problem.
^{^}
Though prompted and maybe fine-tuned differently, Smart for research and Weak for monitoring.
^{^}
Of course, the notion of trustworthiness can be applied to many other situations beside "Weak is assigned the task of checking Smart isn't trying to escape". The core idea is that the model isn't deliberately trying to take bad actions at the worst possible moments.
In particular, one could talk about the trustworthiness of Smart in this post's setting. Capability-based failures of Smart include things like "Smart was assigned too hard of a research problem it couldn't solve" and "Smart ran the experiments we asked, but the code accidentally had a bug that changes the results".
Trust-based failures include things like "Smart deliberately inserts subtle, natural-looking bugs to code it writes to sabotage the results" and "Smart carefully chooses to run only particular experiments to give us misleading results" (low-stakes failures), and, of course, "Smart tries to copy its weights to a remote server" and "Smart tries to hack the company's data center" (high-stakes failures).
It might again be hard to categorize a particular failure as one of these, but these scenarios do suggest different backstories and interventions. Smart would be trustworthy if we were OK with running Smart without monitoring (separate from having quality-control for critical applications etc.), trusting it won't deliberately try to cause harm now that it has passed red-teaming.
^{^}
"Secret AI coordination signal" would be some short-ish string that AIs know, human's don't and which humans cannot elicit, so that the AIs could use this for coordination. To be clear, it's not obvious that different AIs could be aware of a common code or that, even if they did, we couldn't elicit it.
^{^}
For internal deployment we might have a good enough understanding and control of the inputs our model will be exposed to, so that we can make it really difficult for a sleeper agent's trigger to activate in deployment and deployment only. And even if the input data distribution changes over time (as is the case with external deployment), we can do “online red-teaming” by turning new authentic inputs into red-teaming inputs and evaluating the model on those, shrinking the distributional shift between red-teaming and deployment.

[-]Olli Järviniemi10mo*40

I tentatively think that it's good to distinguish at least between the following three model classes: active planners, sleeper agents and opportunists.

Looking back at this, I find this categorization pretty confusing, and not quite carving the model space at the right places. I discuss what I now think is a better frame here.

Briefly, the idea is that one should make a distinction on how often the model is thinking about its plans against you. In this post's terminology, this basically draws a line between opportunists vs. active planners and sleeper agents. The latter two roughly correspond to different strategies, namely active sabotage vs. "lie and wait".

But I think the descriptions I gave of "active planners" etc. in the post are overly specific and are possibly sneaking in false inferences. So I think those labels are possibly harmful, and I've mostly stopped using those terms in my own thinking.

[-]Bogdan Ionut Cirstea11mo10

Weak is on the lookout for authentic secret codes (as otherwise its reactivity to secret-code-looking-things would presumably have been detected and/or removed during training and red-teaming done on Weak before using it to monitor Smart).
Weak is modeling its situation: Weak is modeling that the actions it is overseeing have been written by an AI, that its assessments are used as part of an AI control scheme, that the current situation is actually real (as opposed to being part of red-teaming), …
Weak is modeling the consequences of its actions: whether Smart escaping will be good for Weak, whether Smart can truly take over in case Weak deliberately fails here, whether humans will notice Weak failing, whether there are other monitoring systems in place in addition to Weak, …
There’s the possibility of Weak simply taking Smart's word for it in now being a good time to deliberately fail, which decreases the reasoning Weak needs to perform. But it’s not clear how Weak “can” trust Smart to this extent, though.
Weak has non-myopic preferences (as otherwise there is little reason to privilege the hypothesis of Weak engaging in such schemes in the first place)
Weak was planning during training / red teaming / evaluation: Supposing that Weak performs considerable modeling about the situation when presented with an authentic secret code by Smart, it’s plausible that Weak was also performing such modeling in past situations, in particular when it was being trained and red-teamed for monitoring.
Moreover, this makes it likelier that Weak was gaming the past training / red-teaming / evaluation in order to get to a better position.

Quite a few of these seem very hard to do in a single forward pass. I think we're very likely in the pretty-lucky world where automated safety research capabilities (e.g. https://sakana.ai/ai-scientist/, https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse) appear before the scariest scheming prerequisite capabilities, so, in some ways, 'the game is ours to lose' (at least with respect to the strictly technical part of the alignment problem). Rephrased, we can use Weak to automate non-trivial chunks of safety research and as an internals monitor, too (not just on externalized reasoning).

[-]Olli Järviniemi11mo30

Quite a few of these seem very hard to do in a single forward pass.

Current models definitely don't seem capable of doing this in a single forward pass. But we are possibly not far away from models being capable of doing basic forms of this reasoning with CoT (possibly we are already there), and then one needs to get quantitative: How authentic-looking red-teaming examples can humans generate? For performance reasons, we are probably giving Weak a CoT - how long is it? (We could monitor Weak's CoT, though then you have more complexity, and Weak isn't really trustworthy if you need to do that.) And probably it's very easy for Weak to figure it's being used as a monitor - it's prompt literally contains instructions for what to report and what to not.

Overall this is messy, and someone needs to actually sit down and think through these arguments and then run the evaluations. ("Thinking through these arguments" is largely what I'm doing nowadays.) So while I am more optimistic about "scheming-risks" than I have been before, there's work to be done here.

Rephrased, we can use Weak to automate non-trivial chunks of safety research and as an internals monitor, too (not just on externalized reasoning).

We definitely want to do some monitoring on internals (e.g. "coup probes"). The worry is that it's hard to get confidence that such methods work right when a model is actually defecting, if one doesn't in advance have any examples of the models doing that and has to go with synthetic analogues.

LESSWRONG
LW

47