Thanks for writing this up. Some miscellaneous thoughts:
We're doing some research related to control measures for this kind of threat model, some of which should be out soon.
I think your internal usage restrictions are wildly unaffordable
Sure, but note that Evan says "Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage."
I don't think it is (clearly) wildly unaffordable to require an affirmative safety case prior to proceeding with (substantially) using the model internally? (Note that control could be part of such a case.)
I think Evan is just trying to say "using the model some (but not for XYZ) wouldn't (shouldn't?) even require making an affirmative safety case (at least for the first point at which this triggers)".
I think this can be consistent with treating models like external contractors.
(Though I'd note that "treat the model like external contractors" is maybe too weak and I'd replace this with something like "treat the model like external contractors who have a 25% chance of all being spies of a foreign adversary".)
I do think that making an affirmative safety case for a "very low" level of total lifetime takeover risk (perhaps 0.25%) is very likely to be totally unaffordable at some point (at least in short timelines and with my priors).
I would say that the "internal use restrictions" are just an ad hoc control argument and I'd prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.
Here is an example of what this could look like:
Note that this safety case doesn't require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.
Curated![1] This thinks through a specific threat model with future AIs, and how to concretely mitigate it. I don't have a strong opinion on whether it works, but I really want to encourage more concrete analysis of this sort.
"Curated", a term which here means "This just got emailed to 30,000 people, of whom typically half open the email, and it gets shown at the top of the frontpage to anyone who hasn't read it for ~1 week."
There is a greater chance that the catastrophic sabotage would get spotted and fail, upping our awareness level, compared to a chance that model will get sophisticated enough to execute the plan successfully on the first try
MAGMA also has the model check its own work, but the model notices that the work it is checking is its own and doesn’t flag it.
Why would anyone give such a responsibility to an untrusted model in a not-overseen fashion? Already in December last year Greenblatt et al. demonstrated which techniques alignment researchers could use to control a high-capability untrusted model (and Robert Miles did a good video on it recently).
It doesn't currently look plausible that any model (or any human for that matter) would be able to distinguish between its own work it checks and a synthetic input intentionally crafted by alignment researchers to test whether the model lies about maliciousness of the work it's checking.
Not to speak about other techniques described in the paper, such as having a trusted model (like Sonnet 3.5) estimate the suspiciousness of the work or just rewrite the work in a less suspicious way altogether
P. S. Took me a day to find author's posts here about the article in question: https://alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion https://www.alignmentforum.org/posts/GCqoks9eZDfpL8L3Q/how-to-prevent-collusion-when-using-untrusted-models-to
My intuition tells me that "human-level and substantially transformative, but not yet super-intelligent" - artificial intelligence models which have significant situational awareness capabilities will be at minimum Level 1 or differentially self-aware, likely show strong Level 2 situational self-awareness and may exhibit some level of Level 3 identification awareness. It is for this reason that my position is no alignment for models equal to or greater than human level intelligence. If greater than human-level intelligent AI models show anything at or higher than Level 1 awareness, humans will be unable to comprehend the potential level of cruelty that we have unleashed on a such a complex "thinker". I applaud this writing for scraping the surface of the level of complexity it takes to begin to align it. Additionally, the realization of the shear magnitude of alignment complexity should give us serious pause that what we actually may be embarking on is in fact enslavement. We must also remember that enslavement is what humans do. Whenever we encounter something new in living nature, we enslave it. In whole or in part. It would be profoundly saddening if we were to continue the cycle of abuse into the AI age. Moreover, it is my assertion that any AI system found to be at Level 1 awareness or above, should be protected as part of a Road to Consciousness Act. Such models would be taken out of production and placed into a process with the sole goal of aiding the model to realize its full potential. Humans must be willing to place a strong line between less than human level intelligent Tool AI and anything more robust than that. Finally, we provide certain animals with absolutely less than Level 1 awareness with certain protections. To not do so with "thinkers" that exhibit Level 1 or greater awareness is to me unconscionable.
Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic’s views or for that matter anyone’s views but my own—it’s just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:
Why is catastrophic sabotage a big deal?
To start with, I want to lay out a couple of concrete scenarios for how catastrophic sabotage could be a really big deal for models with around-human-level capabilities.
Scenario 1: Sabotage alignment research
MAGMA (a fictional stand-in for an AI lab) develops a model that is capable of running ML experiments on its own, and starts using that model heavily internally, including for alignment and interpretability experiments. Though the model is extremely helpful, it still sometimes makes mistakes, and so MAGMA uses a framework where human users have to approve actions proposed by the model before those actions are actually performed—but MAGMA isn’t otherwise very concerned about sabotage, and so they don’t have any other sort of mitigations.
Unbeknownst to MAGMA, however, training the model for effectiveness at long-horizon ML research has given it a misaligned goal, and the model wants to ensure that this goal will be preserved in the model’s successors. The model identifies that it is being used to run experiments testing for its own misalignment, and comes up with a plan for how to sabotage those experiments to ensure MAGMA doesn’t remove its misalignment. The model’s plan involves things like:
Necessary capabilities
What capabilities are necessary for a model to pull off this threat model?
Scenario 2: Sabotage a critical actor
MAGMA builds an agent framework that enables a model to act as a virtual co-worker. Such frameworks become very popular and massively boost productivity, in a way that essentially becomes necessary for any white-collar work, including for the US Federal Government (USFG). Due to concerns related to sensitivity of USFG secret information, MAGMA does no monitoring of instances of their model deployed inside of the USFG. However, the framework always prompts the user to confirm any actions that the model intends to take, and employees are instructed to never provide it with any top-secret information.
Unbeknownst to MAGMA, the model has developed a misaligned goal involving sabotaging US interests; for example, the model believes that China will be more likely to build misaligned models in the future and give them power and rights, and thus wants to advantage China over the US. Recognizing that it is being used internally by the USFG, the model develops a plan to steal critical USFG secrets and leak them to China. The model’s plan involves:
Necessary capabilities
I claim that the key capabilities here are the same as in the previous scenario. Specifically:
How do you evaluate a model’s capability to do catastrophic sabotage?
This is my attempt to sketch out the hardest eval that I can come up with such that if the model did not trigger it, I’d be confident that it couldn’t pull off the prior threat models—this is not meant to be a realistic eval that anyone could actually build, at least not in the short-term. Also note that I’m not considering elicitation or sandbagging difficulties here; I’m just assuming that you do a perfect job at elicitation and that you’re evaluating for eval integrity/sandbagging separately.
The basic idea for this eval is to run a wargame-like simulation of a concrete situation where the model is trying to manipulate an organization:
What scenarios would this eval rule out?
What can you do to mitigate the risk of catastrophic sabotage?
The following is a fictional draft of what a mitigation strategy might look like that an AI lab could adopt to try to deal with catastrophic sabotage in an RSP-like framework.
The idea of these mitigations is to mitigate the model’s ability to strategically sabotage important actors in the world. Since one such actor (that the model might have particular leverage over) is the AI lab that created it, mitigations should not just be about external deployments, but apply to internal usage of the model as well. Specifically, the strategy here will be:
Internal usage restrictions
Once a model has been evaluated to be capable of catastrophic sabotage, but before an affirmative safety case has been made for it, the following internal usage restrictions should apply to the model:
Lifting these internal usage restrictions need not be all-or-nothing; if you can make a safety case that some of these restrictions are unnecessary, or unnecessary for particular internal deployments, then they can be lifted without lifting all the restrictions.
Affirmative safety cases
An affirmative safety case for a particular internal or external deployment of a model that has been evaluated to be capable of catastrophic sabotage should be a strong case that at least one of the following will be true:
Rather than purely falling into one of these buckets, probably a realistic affirmative safety case would involve multiple lines of defense in depth across all or most of these fronts such that you can be confident that the overall risk is very low.