The cleanest argument that current-day AI models will not cause a catastrophe is probably that they lack the capability to do so. However, as capabilities improve, we’ll need new tools for ensuring that AI models won’t cause a catastrophe even if we can’t rule out the capability. Anthropic’s Responsible Scaling Policy (RSP) categorizes levels of risk of AI systems into different AI Safety Levels (ASL), and each level has associated commitments aimed at mitigating the risks. Some of these commitments take the form of affirmative safety cases, which are structured arguments that the system is safe to deploy in a given environment. Unfortunately, it is not yet obvious how to make a safety case to rule out certain threats that arise once AIs have sophisticated strategic abilities. The goal of this post is to present some candidates for what such a safety case might look like.
This is a post by Roger Grosse on Anthropic's new Alignment Science Blog. The post is full of disclaimers about how it isn't an official plan and doesn't speak for the org (and that it's inadequate: "none of the sketches presented here fully succeeds in addressing the sabotage risk"). But presumably it's Anthropic's best sketch of ASL-4 safety cases.
The three safety cases are Mechanistic Interpretability, AI Control, and Incentives Analysis.
Regardless of how good these safety cases are, it's good when labs share their thinking on safety stuff; yay Anthropic.
Not sure I understand what you are saying.
If we can construct fake honeypots that are indistinguishable - to the policy - from real opportunities, and do this at scale, then that could suffice: we could drive the policy's P(deployment | great opportunity) very low. (This relies on how many high-stakes we can create and how many the model will see in deployment.) If this is what you are saying, I agree.
If we can construct fake attacks that are indistinguishable - to the untrusted monitor - from real attacks, then that possibly suffices: we could drive the monitor's prior P(deployment | attack) very low. (We have to worry about some other sources of evidence, though.) I guess one could call these honeypots for the monitor, and if this is what you are saying, I agree.
If you are saying that we don't need either of these - i.e. that we can get by with honeypots/attacks that are noticeably fake - then that's news to me, and I don't see how you can do that.