I mostly disagree with your criticisms.
Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems. This means that a model that initially merits ASL-3 containment and deployment measures for national security reasons might later be reduced to ASL-2 if defenses against national security risks (such as biological or cyber defenses) advance, or if dangerous information becomes more widely available. However, to avoid a “race to the bottom”, the latter should not include the effects of other companies’ language models; just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.
I think it's sensible to reduce models to ASL-2 if defenses against the threat become available (in the same way that it makes sense to demote pathogens from BSL-4 to BSL-3 once treatments become available), but I'm concerned about the "dangerous information becomes more widely available" clause. Suppose you currently can't get slaughterbot schematics off Google; if those become available, I am not sure it then becomes ok for models to provide users with slaughterbot schematics. (Specifically, I don't want companies that make models which are 'safe' except they leak dangerous information X to have an incentive to cause dangerous information X to become available thru other means.)
[There's a related, slightly more subtle point here; supposing you can currently get instructions on how to make a pipe bomb on Google, it can actually reduce security for Claude to explain to users how to make pipe bombs if Google is recording those searches and supplying information to law enforcement / the high-ranked sites on Google search are honeypot sites and Anthropic is not. The baseline is not just "is the information available?" but "who is noticing you accessing the information?".]
4. I mean, superior alternatives always preferred. I am moderately optimistic about "just stop" plans, and am not yet convinced that "scale until our tests tell us to stop" is dramatically superior to "stop now."
(Like, I think the hope here is to have an AI summer while we develop alignment methods / other ways to make humanity more prepared for advanced AI; it is not clear to me that doing that with the just-below-ASL-3 model is all that much better than doing it with the ASL-2 models we have today.)
Thanks.
[Busy now but I hope to reply to the rest later.]
As someone with experience in BSL-3 labs, BSL feels like a good metaphor to me. The big issue with the RSP proposal is that it's still just a set of voluntary commitments that could undermine progress on real risk management by giving policymakers a way to make it look like they've done something without really doing anything. It would be much better with input from risk management professionals.
Announcement, Policy v1.0, evhub's argument in favor on LW. These are my personal thoughts; in the interest of full disclosure, one of my housemates and several of my friends work at Anthropic; my spouse and I hold OpenAI units (but are financially secure without them). This post has three main parts: applause for things done right, a summary / review of RSPs in general, and then specific criticisms and suggestions of what to improve about Anthropic's RSP.
First, the things to applaud. Anthropic’s RSP makes two important commitments: that they will manage their finances to allow for pauses as necessary, and that there is a single directly responsible individual for ensuring the commitments are met and a quarterly report on them is made. Both of those are the sort of thing that represent genuine organizational commitment rather than lip service. I think it's great for companies to be open about what precautions they're taking to ensure the development of advanced artificial intelligence benefits humanity, even if I don’t find those policies fully satisfactory.
Second, the explanation. Following the model of biosafety levels, where labs must meet defined standards in order to work with specific dangerous pathogens, Anthropic suggests AI safety levels, or ASLs, with corresponding standards for labs working with dangerous models. While the makers of BSL could list specific pathogens to populate each tier, ASL tiers must necessarily be speculative. Previous generations of models are ASL-1 (BSL-1 corresponds to no threat of infection in healthy adults), current models like Claude count as ASL-2 (BSL-2 corresponds to moderate health hazards, like HIV), and the next tier of models, which either substantially increase baseline levels of catastrophic risk or are capable of autonomy, count as ASL-3 (BSL-3 corresponds to potentially lethal inhalable diseases, like SARS-CoV-1 and 2).
While BSL tops out at 4, ASL is left unbounded for now, with a commitment to define ASL-4 before using ASL-3 models.[1] This means having a defined ceiling of what capabilities would call for increased investments in security practices at all levels, while not engaging in too much armchair speculation about how AI development will proceed.
The idea behind RSPs is that rather than pausing at an arbitrary start-date for an arbitrary amount of time (or simply shutting it all down), capability thresholds are used to determine when to start model-specific pauses, and security thresholds are used to determine when to unpause development on that model. RSPs are meant to demand active efforts to determine whether or not models are capable of causing catastrophic harm, rather than simply developing them blind. They seem substantially better than scaling without an RSP or equivalent.
Third, why am I not yet satisfied with Anthropic’s RSP? Criticisms and suggestions in roughly decreasing order of importance:
Overall, this feels to me like a move towards adequacy, and it's good to reward those moves; I appreciate that Anthropic's RSP has the feeling of a work-in-progress as opposed to being presented as clearly sufficient to the task.
BSL-4 diseases have basically the same features as BSL-3 diseases, except that also there are no available vaccines or treatments. Also, all extraterrestrial samples are BSL-4 by default, to a standard more rigorous than any current BSL-4 labs could meet.
The implied belief is that model capabilities primarily come from training during scaling, and that our scaling laws (and various other things) are sufficient to predict model capabilities. Advancements in how to deploy models might break this; as would the scaling laws failing to predict capabilities.
It does have some convenience costs--if the baseline were set in 2019, for example, then the model might not be able to talk about coronavirus, even tho AI development and the pandemic were independent.