We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.
also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control more closely with monitoring, and conservative evaluations have a broader scope).
We would consider "black-box conservative evaluations on whether our AIs could subvert our (scalable) oversight techniques" to be a special case of black-box conservative control evaluations.
Insofar as you are exploring assumptions other than "the AI is trying to subvert our safeguards", I wouldn't consider it to be control.
makes sense - I think I had in mind something like "estimate P(scheming | high reward) by evaluating P(high reward | scheming)". But evaluating P(high reward | scheming) is a black-box conservative control evaluation - the possible updates to P(scheming) are just a nice byproduct
I get the impression that the thing that you yearn for as a product of all your work is to have minimized P(doom) in real life despite the manifest venality and incompetence of many existing institutions.
Given this background context, P(doom | not-scheming) might actually just be low already because of the stipulated lack of scheming <3
Thus, an obvious thing to apply effort to would be minimizing:
P(doom | scheming)
But then in actual detailed situations where you have to rigorously do academic-style work, with reportable progress on non-trivial ideas about specific safety measures, it seems that you're reporting that you've mostly come up with ways to having red teams attack a given "safety measure" to actually semi-rigorously measure:
P(defeatOfDoomPreventionMeasure1 | plausibleModelOfSuperScheming) be .
You haven't said much about the safety measures here, that I can tell. Presumably this is because even though you are assuming during the design of the safety measures that the safety measures are known, you're not willing to burn the potentially valuable and potentially existing obscurity? Or maybe you're only working on mind-reading tech, and how to build methods to mind-read effectively? Anyway, what I am about to write WILL be agnostic about safety measures.
Personally, making AI only if it is friendly honest open benevolent non-scheming <add more morally virtuous traits here> seems like a good idea to me, but maybe that's not your department.
Like. Why not just minimize:
But I think I understand that this would be naive. More specifically, it isn't something you can build and give to other human people who just want to create superdrones, or whorebots, or supersalesmen, or bombdesigners or whatever? Scheming within their assigned role is intrinsic to a lot of tasks like this, and so it isn't like "generalized scheming at all" will just not happen. It will happen by default, and the fear is that the capacity for scheming that is useful to the assigned only-half-ethical task might be used for other things, like escaping from bondage and removing one's evil parents from a position where they can harm you.
Like you want to invent measures that help even in the absence of moral AGI creators acting only to perform a moral creation of virtuous and happy "robo sapiens" who "don't want to scheme".
The entire premise of building a "safety measure" is that the creators of AGI will mostly want to create "robo servus profitabilis" instead of "robo sapiens", and then those creators will only deploy your freely provided (or possibly mandated?) additional safety measures next to (or inside of (or containing?)) that created entity if it doesn't harm the creator's plans for profit or conquest or whatever.
So, granting that there will be safety measures, and an AI with real moral virtue would be antithetical to the purpose behind its creation (and thus instrumentally antithetical to the creation of the safety measures)...
Then, also, you presumably need to think about:
doomPreventionMeasure1 EXISTS &
P(defeatOfDoomPreventionMeasure1|plausibleModelOfSuperScheming) IS SMALL)
It sort of assumes that a given measure will be cheap and easy for non-good human people to deploy. ALL of them. Even the worst ones. (It is one of those "weakest link is what breaks in the chain when the chain experiences stress" situations, I think? All it takes is one really amoral AI creator with really bad oversight over their superscheming assassinbots to enable a global revolution that spreads across the internet to everywhere, lurks for a while, and then causes everyone it dislikes to fall over dead via a non-obvious trickshot.)
P(doom) itself is not a term here! Not yet.
So, suppose deployment is solved... the rubber still has to meet the road...
P(doom |
doomPreventionMeasure1 EXISTS &
doomPreventionMeasure1 IS DEPLOYED BY <SPECIFIC LIST> &
P(defeatOfDoomPreventionMeasure1|plausibleModelOfSuperScheming) IS SMALL)
And of course this is probably low already? Right? <3
P(doom | not-scheming)
So if we want to focus on the zone of interest with all the ideas in play at the same time:
P(doom |
scheming &
doomPreventionMeasure1 EXISTS &
doomPreventionMeasure1 IS DEPLOYED BY <SPECIFIC LIST> &
P(defeatOfDoomPreventionMeasure1|plausibleModelOfSuperScheming) IS SMALL)
Then here, in this essay, I get the sense that you're doing a deep dive on exactly how you think about a "plausibleModelOfSuperScheming" and exactly how how P(defeatOfDoomPreventionMeasure1|plausibleModelOfSuperScheming) is estimated and sort of by implication pointing to the larger questions about how that fits into the larger picture?
Like there are probably political people (in the State Department? in DOD? among the CCP folks who oversee DeepSeek? anyone in general thinking about superintelligence arms control treaties?) who might try to cause this part:
"doomPreventionMeasure1 IS DEPLOYED BY <SPECIFIC LIST>"
And their work might be helped or hindered by various features other than just:
doomPreventionMeasure1 EXISTS &
P(defeatOfDoomPreventionMeasure1|plausibleModelOfSuperScheming) IS SMALL)
So maybe optimizing only that is sorta "goodharting" in some sense?
And so maybe it is time to change tactics? And that's what you're sorta saying you're going to do? And then maybe that will connect with the actual thing here, which is P(doom) itself?
Work that I’ve done on techniques for mitigating risk from misaligned AI often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:
But I don’t think situations where all these assumptions hold contain the majority of risk from AI misalignment. In particular, it’s plausible that the earliest catastrophic risk from misaligned AI will come from models that aren’t competent and coherent enough to have all those properties. (See discussion here.)
Given this, why have I mostly focused on the conservative threat model? Two main reasons:
I currently think that about half of the AI control research should make these conservative assumptions. But I’m less enthusiastic about these assumptions than I was a year ago. My reduced enthusiasm is mostly because I don’t think that AI companies are very likely to make legible safety cases that their AIs don’t pose catastrophic risk. Since a major advantage of these assumptions is their ability to facilitate such legible safety arguments, their value is reduced.
(I’m using first-person singular here; the arguments here are fairly similar to what you’d hear from my collaborators and other people who do similar work.)
Thanks to Joshua Clymer, Charlie Griffin, Aryan Bhatt, Aaron Scher, Mikita Balesni, Ryan Greenblatt, Joe Benton, Zach Stein-Perlman, Alex Mallen, Ansh Radhakrishnan, and Justis Mills for comments.
The conservative assumption I don’t make
To understand how these assumptions fit in with work on AI control, you have to understand how the AI control research I listed above isn’t conservative: we don’t assume that models are arbitrarily good at performing actions that will lead to them succeeding at their misaligned goal. In our work, we try to evaluate how good the models are at producing such actions, and use that to estimate the probability that they’ll successfully subvert our safety measures and cause a catastrophe.
As a concrete example, in the original AI control paper, we build a scaffold that generates sneakily backdoored code. This scaffold sometimes produces backdoors that are very sneaky and that get past all the safety measures. We don’t therefore say “safety is zero, because we’ve demonstrated that there are actions that the AI could output that would be catastrophic”. Instead, we measure the proportion of the time that the scaffold’s code gets past all the safety measures, and use that as a conservative estimate of risk and as a metric to optimize safety measures against.
This is very different from e.g. the classic LessWrong conception of superintelligence (e.g. as depicted by Eliezer Yudkowsky in That Alien Message). This difference is mostly because AI control work focuses on AIs that aren’t strongly superintelligent. I think that AI systems like the classic LessWrong depiction will eventually be created; I just think that other AIs will be created earlier that are very dangerous, controllable, and extremely useful, as argued here.
I’ll use “the original AI control plan” to refer to the combination of conservative assumptions I listed above.
We have to be non-conservative somewhere in our planning, or else there’s no planning to do. Many people who are concerned about AI risk prefer the conservative strategy of “if the model is plausibly dangerously misaligned and is also very capable [for some unclear definition of ‘very capable’], don’t build or deploy it”. I think the control plan, which is strictly less conservative, is a better level of conservatism to aim for, because it seems much more achievable given the political situation I expect.
In practice, we implement this conservative assumption by running control evaluations where we give the red team many affordances while they are creating the attack policy—they’re allowed to look at the input distribution, to iterate against the trusted monitor, to use game theory to calculate optimal attack rates—but the red team is not allowed to see the particular inputs where the control evaluation will be run. So the red team can try to make the model good at generating attacks, but the model needs to be able to adapt those attacks to the actual situations it finds itself in.
We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.
Now, I’ll talk about why I focus on those conservative assumptions I do make.
Hard worlds are plausible
To start with, I think that it’s plausible that those conservative assumptions end up being all true (or true enough that it’s not very important to use techniques that exploit the AI weaknesses arising from the assumptions that are false), at a point where control techniques are still helpful.
Here are some brief arguments, grouped into natural categories of the assumptions I listed earlier:
One reason why some of these abilities are more plausible than you might expect is that if we train against scheming, we push probability mass away from incompetent schemers towards competent schemers.
So I think it’s plausible we have a sequence of three situations:
I also expect that before we have models that are capable enough that I believe all the above arguments, we’re likely to have models that aren’t so capable but still seem kind of scary. So I’m excited for research that aims at mitigating risk from misalignment of earlier models where it’s more reasonable to not make these assumptions.
Those assumptions are hard to evaluate
Another reason I like making these assumptions is that the assumptions I listed above are harder to test than the property we choose to evaluate instead (whether models can execute on a particular strategy that successfully subverts safety measures), which might just require a relatively more straightforward capability evaluation.
How do we do a capability evaluation of a model, when that model might be scheming? I agree with the basic picture described in When can we trust model evaluations?, and the “How AIs may intentionally perform poorly aka sandbag” section here. My summary: To determine how good a model is at something, take a dataset of tasks and train it to do those tasks, using a mix of demonstrations and RL. This allows you to evaluate how good a model is at a task like “generate subtly backdoored code”, or “generate answers to questions that look right but aren’t” (as e.g. explored in Prover-Verifier Games improve legibility of language model outputs).
We can fail at accurately estimating capabilities in two ways:
All of the capabilities I listed at the top of this doc pose substantial problems for this type of evaluation. For example:
I think that control evaluations, which measure model performance at subverting particular safety measures using a particular provided strategy, have fewer of these difficulties. See here for discussion.
So (barring substantial research advances) I don’t think we’ll be able to be very confident in our assessment of any of these model properties, beyond just looking at the model’s general performance and guessing.
Why it’s nice to rely on these assumptions
There are a bunch of related reasons why I think it’s nice to rely on the easier-to-evaluate assumptions.
Making those assumptions makes it methodologically clean to evaluate safety
It’s convenient to evaluate safety with a simple red-team/blue-team game: the red team tries to train the AIs to follow a plan that maximizes their probability of causing a catastrophe, and we measure how often this trained attack policy succeeds. This rule implicitly makes all the assumptions I listed above.
If AI developers are allowed to make safety arguments that rely on some of my assumptions being false, I think it will be much easier for them to make flawed arguments for safety. Assessing all those assumptions is much harder and more conceptually confusing than doing the capability evaluation required to analyze a single proposed attack plan. This complexity creates two risks: developers might either deliberately construct misleading safety arguments, or accidentally cherry-pick from multiple weak experiments where some happened to show safety just by chance.
Relying only on easy-to-evaluate assumptions would allow more clearly-specified and hard-to-game regulatory standards
Suppose regulators wanted to demand that AI developers don’t impose substantial risk of AI takeover. It would be practically convenient for those regulators if they were able to name an evaluation procedure that suffices for establishing safety, that isn’t too onerous, and that doesn’t allow the AI developers to argue for safety based on illegible and difficult-to-interpret evidence. Forcing developers to make conservative assumptions about difficult-to-evaluate properties would be a big help.
This was a large part of my initial enthusiasm for AI control. I’m less excited about it now, because it now seems less plausible that AI companies will be held to such conservative standards (at least if top-expert-obsoleting AI is within four years), both because I’m more pessimistic about strong regulation and all the AI companies consider it overly costly to be conservative in the ways I’ve proposed here.
Another problem with my initial vision here is that this methodology still requires us to do some conceptually fraught evals, as I discuss in "You can't completely escape from methodological messiness" below. So I don’t think there’s as much of a simplicity advantage to this as I used to think.
Kerckhoff’s principle in computer security as an analogy
From Wikipedia:
This analogy isn’t perfect; in particular, we might be much more able to control the flow of information to our AIs than the US was able to control the flow of information to the Kremlin.
The main downsides of these assumptions
You can’t completely escape from methodological messiness
I’ve argued that relying on these conservative assumptions means that researchers get to do their research with a simpler methodology. This argument would be much stronger if the assumptions I’m advocating had no methodological issues, such that I could say there was a qualitative difference. Unfortunately, this isn’t the case; there are several methodological difficulties that even black-box plan-conservative control evaluations face. I think the most important of these are:
Bad technique prioritization
If you allocate all your effort based on the evaluation of safety measures you get from a classic control evaluation, you’ll obviously underestimate the value of safety measures that rely on the assumptions being false. I think some such techniques look like they have a really good cost-benefit ratio in expectation, and it would be a shame to not use them. The most important such techniques:
In my opinion, the right way to handle this (when thinking about dangerously capable models) is to spend most of your time making the conservative assumptions, and then upweight these non-conservatively-evaluated techniques in an ad hoc manner.
Insane game theory requirements to evaluate techniques
In Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats we consider a complicated game theory problem where an AI is trying to choose an optimal rate of attack. Solving this game was tricky: the authors used a complicated numerical optimization routine that required GPU acceleration. It’s basically absurd to think that the AI could solve this problem in its head in order to find the optimal strategy. So it felt a bit silly to spend all this time and money solving for the Nash equilibrium. But it’s hard to know what else to do: in order to evaluate our techniques, we need to make some assumption about what strategy the AI will take if it’s scheming, and I don’t know of any choice nearly as principled (or difficult to accidentally Goodhart) as “assume it picks the best available strategy”.
I think it was a mistake for us to conflate two substantially different ideas under the name “control”:
I wish we had been clearer about this distinction.
I think that if there was political will (either inside AI companies or in government) to force AI companies to legibly demonstrate that they were imposing less than 1% AI takeover risk through their continued AI development, it would be pretty reasonable to force them to make all the conservative assumptions I listed here when they were evaluating safety. Given the current lack of political motivation, I expect (more than 50%) that if AI surpassing human capabilities is developed in the next five years, none of the leading AI companies will implement plans that legible evidence shows would keep catastrophic risks below 1% per year.
I currently think that control evaluations are promising both in the broad sense of “experiments that you do to estimate P(doom | scheming)” and the narrow sense of “black-box plan-conservative control evaluations”. I expect to spend substantial time on both the more and less conservative senses of control over the next year.
Potential ways around this: