The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so. However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal of this post is to present some candidates for what such a safety case might look like.
This is a post by Roger Grosse on Anthropic's new Alignment Science Blog. The post is full of disclaimers about how it isn't an official plan and doesn't speak for the org (and that it's inadequate: "none of the sketches presented here fully succeeds in addressing the sabotage risk"). But presumably it's Anthropic's best sketch of ASL-4 safety cases.
The three safety cases are Mechanistic Interpretability, AI Control, and Incentives Analysis.
Regardless of how good these safety cases are, it's good when labs share their thinking on safety stuff; yay Anthropic.
The interpretability safety case section seems to make the rather strong claim that the activation of known sabotage-related features is necessary for sabotage to occur (emphasis mine):
Afaict, the way that they establish this link (that a feature activation is necessary for a behavior to occur) is through ablation studies. Basically, if deactivating a feature prevents a model organism from committing a dangerous behavior, then they conclude that the feature causes the behavior. And although it does seem like this sort of test will generate Bayesian evidence about which features contribute to sabotage-capacity, I’m confused why the author thinks this would generate strong Bayesian evidence, much less identify the necessity conditions for sabotage with ~certainty?
Like, it seems plausible to me that some other unknown combination of features might create sabotage-like behavior, or that some relevant features were missed, or that there were other, ill-described-as-features mechanisms which can cause sabotage, etc. And my guess is that the evidence this sort of testing does generate will become weaker over time—i.e., once our safety precautions start to really matter—since more intelligent systems will presumably be able to cause sabotage in more numerous and creative ways.
Some of these potential setbacks are mentioned, and I get that this is just a preliminary sketch. But still, I’m confused where the confidence is coming from—why does the author expect this process to yield high confidence of safety?
I think that if they:
- Find a feature that is necessary for defection in all of the cases of defection in a diverse and realistic set of model organisms
- Locate that feature with very little optimization on the defection cases
I'd be fairly confident (>~80%) that it would catch any particular deployment-time defection attempt from an ASL-4 schemer. This would cut risk from the deployment by >5x when paired with compelling incrimination and a shutdown. This is a substantial risk reduction, and I might even buy this safety case if there weren't better o... (read more)