The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so. However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal of this post is to present some candidates for what such a safety case might look like.
This is a post by Roger Grosse on Anthropic's new Alignment Science Blog. The post is full of disclaimers about how it isn't an official plan and doesn't speak for the org (and that it's inadequate: "none of the sketches presented here fully succeeds in addressing the sabotage risk"). But presumably it's Anthropic's best sketch of ASL-4 safety cases.
The three safety cases are Mechanistic Interpretability, AI Control, and Incentives Analysis.
Regardless of how good these safety cases are, it's good when labs share their thinking on safety stuff; yay Anthropic.
I only read through the beginning and their control safety case. Some thoughts:
I don't think one needs to generate hyper-realistic-looking scenarios for monitoring, and I'm confused by their reasoning for believing one does. I guess it depends on the fact that they need to elicit the best possible capabilities from the model during the eval, but I don't think this is necessary for the reasons I stated above.
General Thoughts:
I think this is pretty good and shows knowledge of the core points of control. I think Antrhopic's control protocols are easily beaten, and there are much better things to do. They made it seem like control is much harder to implement than I think it will be. I don't think a perfect Elicitation Strategy or Simulation Procedure is needed for control to work in the settings laid out. Once we have much scarier AIs and deploy them in more crazy environments, we will likely need better techniques, but their setting seems to be before this.
I agree if there are privacy guarantees, then you can't get data from the real world, so you'd need honeypots. I guess I'm much more scared of the internal deployments where I don't think privacy guarantees are as likely. I'm trying to respond to someone who is like, "Control seems hard because you have to generate real-world looking input, and that seems hard" and I'm like "You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing"
Yeah, I guess I'm saying that if you don't have a great way to el... (read more)