The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so. However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal of this post is to present some candidates for what such a safety case might look like.
This is a post by Roger Grosse on Anthropic's new Alignment Science Blog. The post is full of disclaimers about how it isn't an official plan and doesn't speak for the org (and that it's inadequate: "none of the sketches presented here fully succeeds in addressing the sabotage risk"). But presumably it's Anthropic's best sketch of ASL-4 safety cases.
The three safety cases are Mechanistic Interpretability, AI Control, and Incentives Analysis.
Regardless of how good these safety cases are, it's good when labs share their thinking on safety stuff; yay Anthropic.
Glad to see public writing on what safety cases could look like!
I liked the section "Common precautions". Points like "wherever possible, run supervised fine-tuning on the eval", "run the eval on multiple model sizes", "ensure that we don’t train against measurements" and "all generated outputs are monitored" are commonly acknowledged as useful/important (even if obvious), so it's nice they were explicitly mentioned here.
In the AI control sketch, there's a part that I find unsatisfactory:
It's a strong assumption that we have such a simulation procedure, at least if we are talking about General External Deployment (GED). [Edited later: As Buck says, we actually don't even need this assumption for trusted monitoring. I'm still leaving this up, since untrusted monitoring does rely on such assumptions.] In the Limitations section later, you write
but I think that even this is still understating the issue: For example, during GED a user might ask a model to proofread a yet-unpublished graduate level textbook on analytic number theory she has been writing. It would be very costly to generate a diverse set of such inputs for the evaluation, and automatic generation faces the challenge of discrimination being easier than generation.
The best solution for this that I'm aware of is to train and use a trusted model to flag inputs that likely weren't in the evaluation set. Now, you mention a related technique in "[P2.5] Optional Steps."
but I think this is far from enough: in the "textbook on analytic number theory" example, it's extremely hard to hide the fact that this input would have been costly to simulate.[1]
My answer would be to have a human oversee the model's actions on inputs flagged by the trusted model, and use such inputs to continually train and evaluate the model. But this is costly and not privacy-preserving. (The cost is acceptable if these inputs are rare enough, though. And if such inputs are common, one needs to put more effort to bridge the evaluation-deployment gap.)
I also worry about multimodality (e.g. native computer usage) making it harder to 1) make realistic simulations and 2) modify inputs in real time to hide information.
Thanks, that clears it up. Indeed I agree that "take real trajectories, then edit the attacks in" is relatively promising, and didn't track this when writing my first comment (oops) - thanks for flagging it.