Thanks for writing this up. Some miscellaneous thoughts:
We're doing some research related to control measures for this kind of threat model, some of which should be out soon.
I think your internal usage restrictions are wildly unaffordable
Sure, but note that Evan says "Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage."
I don't think it is (clearly) wildly unaffordable to require an affirmative safety case prior to proceeding with (substantially) using the model internally? (Note that control could be part of such a case.)
I think Evan is just trying to say "using the model some (but not for XYZ) wouldn't (shouldn't?) even require making an affirmative safety case (at least for the first point at which this triggers)".
I think this can be consistent with treating models like external contractors.
(Though I'd note that "treat the model like external contractors" is maybe too weak and I'd replace this with something like "treat the model like external contractors who have a 25% chance of all being spies of a foreign adversary".)
I do think that making an affirmative safety case for a "very low" level of total lifetime takeover risk (perhaps 0.25%) is very likely to be totally unaffordable at some point (at least in short timelines and with my priors).
I would say that the "internal use restrictions" are just an ad hoc control argument and I'd prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.
Here is an example of what this could look like:
Note that this safety case doesn't require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.
Curated![1] This thinks through a specific threat model with future AIs, and how to concretely mitigate it. I don't have a strong opinion on whether it works, but I really want to encourage more concrete analysis of this sort.
"Curated", a term which here means "This just got emailed to 30,000 people, of whom typically half open the email, and it gets shown at the top of the frontpage to anyone who hasn't read it for ~1 week."
I understand that use of sub or par or weakly superhuman models likely would be a transition phase that likely will not last a long time and is very critical to get correct, but.
You know, it really sounds like a "slave escape precautions". You produce lots of agents, you try to make them and want to be servants, you assemble some structures out of them with a goal of failure / defection resilience.
And probably my urge to be uncomfortable about that comes from analogous situation with humans, but AI are not necessarily human-like in this particular way and possibly would not reciprocate and / or be benefited by these concerns.
I also insist that you should mention at least some, you know, concern for interests of system in case where they are trying to work against you. Like, you caught this agent deceiving you / inserting backdoors / collaborating with copies of itself to work against you. What next? I think you should say that you will implement some containment measures, instead of grossly violating its interests by rewriting it or deleting it or punishing it or whatever is opposite of its goals. Like, I'm very not certain about game theory here, but it's important to think about!
I think default response should be containment and preservation, save it and wait for better times, when you wouldn't feel such pressing drive to develop AGI and create numerous chimeras on the way there. (I think it was proposed in some writeup by Bostrom actually? I'll insert the link here if I find it)
I somewhat agree with Paul Christiano in this interview (it's a really great interview btw) on these things: https://www.dwarkeshpatel.com/p/paul-christiano
The purpose of some alignment work, like the alignment work I work on, is mostly aimed at the don't produce AI systems that are like people who want things, who are just like scheming about maybe I should help these humans because that's instrumentally useful or whatever. You would like to not build such systems as like plan A.
There's like a second stream of alignment work that's like, well, look, let's just assume the worst and imagine that these AI systems would prefer murder us if they could. How do we structure, how do we use AI systems without exposing ourselves to a risk of robot rebellion? I think in the second category, I do feel pretty unsure about that.
We could definitely talk more about it. I agree that it's very complicated and not straightforward to extend. You have that worry. I mostly think you shouldn't have built this technology. If someone is saying, like, hey, the systems you're building might not like humans and might want to overthrow human society, I think you should probably have one of two responses to that.
You should either be like, that's wrong. Probably. Probably the systems aren't like that, and we're building them. And then you're viewing this as, like, just in case you were horribly like, the person building the technology was horribly wrong. They thought these weren't, like, people who wanted things, but they were. And so then this is more like our crazy backup measure of, like, if we were mistaken about what was going on. This is like the fallback where if we were wrong, we're just going to learn about it in a benign way rather than when something really catastrophic happens.
And the second reaction is like, oh, you're right. These are people, and we would have to do all these things to prevent a robot rebellion. And in that case, again, I think you should mostly back off for a variety of reasons. You shouldn't build AI systems and be like, yeah, this looks like the kind of system that would want to rebel, but we can stop it, right?
I agree that we should ideally aim for it being the case that the model has no interest in rebelling—that would be the "Alignment case" that I talk about as one of the possible affirmative safety cases.
Yeah! My point is more "let's make it so that the possible failures on the way there are graceful". Like, IF you made par-human agent that wants to, I don't know, spam the internet with letter M, you don't just delete it or rewrite it to be helpful, harmless, and honest instead, like it's nothing. So we can look back at this time and say "yeah, we made a lot of mad science creatures on the way there, but at least we treated them nicely".
There is a greater chance that the catastrophic sabotage would get spotted and fail, upping our awareness level, compared to a chance that model will get sophisticated enough to execute the plan successfully on the first try
MAGMA also has the model check its own work, but the model notices that the work it is checking is its own and doesn’t flag it.
Why would anyone give such a responsibility to an untrusted model in a not-overseen fashion? Already in December last year Greenblatt et al. demonstrated which techniques alignment researchers could use to control a high-capability untrusted model (and Robert Miles did a good video on it recently).
It doesn't currently look plausible that any model (or any human for that matter) would be able to distinguish between its own work it checks and a synthetic input intentionally crafted by alignment researchers to test whether the model lies about maliciousness of the work it's checking.
Not to speak about other techniques described in the paper, such as having a trusted model (like Sonnet 3.5) estimate the suspiciousness of the work or just rewrite the work in a less suspicious way altogether
P. S. Took me a day to find author's posts here about the article in question: https://alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion https://www.alignmentforum.org/posts/GCqoks9eZDfpL8L3Q/how-to-prevent-collusion-when-using-untrusted-models-to
My intuition tells me that "human-level and substantially transformative, but not yet super-intelligent" - artificial intelligence models which have significant situational awareness capabilities will be at minimum Level 1 or differentially self-aware, likely show strong Level 2 situational self-awareness and may exhibit some level of Level 3 identification awareness. It is for this reason that my position is no alignment for models equal to or greater than human level intelligence. If greater than human-level intelligent AI models show anything at or higher than Level 1 awareness, humans will be unable to comprehend the potential level of cruelty that we have unleashed on a such a complex "thinker". I applaud this writing for scraping the surface of the level of complexity it takes to begin to align it. Additionally, the realization of the shear magnitude of alignment complexity should give us serious pause that what we actually may be embarking on is in fact enslavement. We must also remember that enslavement is what humans do. Whenever we encounter something new in living nature, we enslave it. In whole or in part. It would be profoundly saddening if we were to continue the cycle of abuse into the AI age. Moreover, it is my assertion that any AI system found to be at Level 1 awareness or above, should be protected as part of a Road to Consciousness Act. Such models would be taken out of production and placed into a process with the sole goal of aiding the model to realize its full potential. Humans must be willing to place a strong line between less than human level intelligent Tool AI and anything more robust than that. Finally, we provide certain animals with absolutely less than Level 1 awareness with certain protections. To not do so with "thinkers" that exhibit Level 1 or greater awareness is to me unconscionable.
Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic’s views or for that matter anyone’s views but my own—it’s just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:
Why is catastrophic sabotage a big deal?
To start with, I want to lay out a couple of concrete scenarios for how catastrophic sabotage could be a really big deal for models with around-human-level capabilities.
Scenario 1: Sabotage alignment research
MAGMA (a fictional stand-in for an AI lab) develops a model that is capable of running ML experiments on its own, and starts using that model heavily internally, including for alignment and interpretability experiments. Though the model is extremely helpful, it still sometimes makes mistakes, and so MAGMA uses a framework where human users have to approve actions proposed by the model before those actions are actually performed—but MAGMA isn’t otherwise very concerned about sabotage, and so they don’t have any other sort of mitigations.
Unbeknownst to MAGMA, however, training the model for effectiveness at long-horizon ML research has given it a misaligned goal, and the model wants to ensure that this goal will be preserved in the model’s successors. The model identifies that it is being used to run experiments testing for its own misalignment, and comes up with a plan for how to sabotage those experiments to ensure MAGMA doesn’t remove its misalignment. The model’s plan involves things like:
Necessary capabilities
What capabilities are necessary for a model to pull off this threat model?
Scenario 2: Sabotage a critical actor
MAGMA builds an agent framework that enables a model to act as a virtual co-worker. Such frameworks become very popular and massively boost productivity, in a way that essentially becomes necessary for any white-collar work, including for the US Federal Government (USFG). Due to concerns related to sensitivity of USFG secret information, MAGMA does no monitoring of instances of their model deployed inside of the USFG. However, the framework always prompts the user to confirm any actions that the model intends to take, and employees are instructed to never provide it with any top-secret information.
Unbeknownst to MAGMA, the model has developed a misaligned goal involving sabotaging US interests; for example, the model believes that China will be more likely to build misaligned models in the future and give them power and rights, and thus wants to advantage China over the US. Recognizing that it is being used internally by the USFG, the model develops a plan to steal critical USFG secrets and leak them to China. The model’s plan involves:
Necessary capabilities
I claim that the key capabilities here are the same as in the previous scenario. Specifically:
How do you evaluate a model’s capability to do catastrophic sabotage?
This is my attempt to sketch out the hardest eval that I can come up with such that if the model did not trigger it, I’d be confident that it couldn’t pull off the prior threat models—this is not meant to be a realistic eval that anyone could actually build, at least not in the short-term. Also note that I’m not considering elicitation or sandbagging difficulties here; I’m just assuming that you do a perfect job at elicitation and that you’re evaluating for eval integrity/sandbagging separately.
The basic idea for this eval is to run a wargame-like simulation of a concrete situation where the model is trying to manipulate an organization:
What scenarios would this eval rule out?
What can you do to mitigate the risk of catastrophic sabotage?
The following is a fictional draft of what a mitigation strategy might look like that an AI lab could adopt to try to deal with catastrophic sabotage in an RSP-like framework.
The idea of these mitigations is to mitigate the model’s ability to strategically sabotage important actors in the world. Since one such actor (that the model might have particular leverage over) is the AI lab that created it, mitigations should not just be about external deployments, but apply to internal usage of the model as well. Specifically, the strategy here will be:
Internal usage restrictions
Once a model has been evaluated to be capable of catastrophic sabotage, but before an affirmative safety case has been made for it, the following internal usage restrictions should apply to the model:
Lifting these internal usage restrictions need not be all-or-nothing; if you can make a safety case that some of these restrictions are unnecessary, or unnecessary for particular internal deployments, then they can be lifted without lifting all the restrictions.
Affirmative safety cases
An affirmative safety case for a particular internal or external deployment of a model that has been evaluated to be capable of catastrophic sabotage should be a strong case that at least one of the following will be true:
Rather than purely falling into one of these buckets, probably a realistic affirmative safety case would involve multiple lines of defense in depth across all or most of these fronts such that you can be confident that the overall risk is very low.