As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI.
What are some examples of work that is most largeness-loaded and most risk-preventing? My understanding is that interpretability work doesn't need large models (though I don't know about things like influence functions). I imagine constitutional AI does. Is that the central example or there are other pieces that are further in this direction?
I am surprised to see the Open Philanthropy network taking all of the powerful roles here.
The initial Trustees are:
- Jason Matheny: CEO of the RAND Corporation
- Kanika Bahl: CEO & President of Evidence Action
- Neil Buddy Shah: CEO of the Clinton Health Access Initiative (Chair)
- Paul Christiano: Founder of the Alignment Research Center
- Zach Robinson: Interim CEO of Effective Ventures US
In case it's not apparent:
And, as a reminder for those who've forgotten, Holden Karnofsky's wife Daniela Amodei is President of Anthropic (and so Holden is the brother-in-law of Dario Amodei).
From a business perspective, we want to be clear that our RSP will not alter current uses of Claude or disrupt availability of our products. Rather, it should be seen as analogous to the pre-market testing and safety feature design conducted in the automotive or aviation industry, where the goal is to rigorously demonstrate the safety of a product before it is released onto the market, which ultimately benefits customers.
What happens if you are pretty sure that a future system is ASL-2, but then months after release when people have built business-critical systems on top of your API, it is discovered to actually be a higher level of risk in a way that is not immediately clear how to mitigate?
Try really hard to avoid this situation for many reasons, among them that de-deployment would suck.. I think it's unlikely, but:
(2d) If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers.
I think it's unlikely, but:
Why does it seem unlikely? (Note: I haven't read the post or comments in full yet, if you think this is already covered somewhere I'll go read that first)
We've deliberately set conservative thresholds, such that I don't expect the first models which pass the ASL-3 evals to pose serious risks without improved fine-tuning or agent-scaffolding, and we've committed to re-evaluate to check on that every three months. From the policy:
Ensuring that we never train a model that passes an ASL evaluation threshold is a difficult task. Models are trained in discrete sizes, they require effort to evaluate mid-training, and serious, meaningful evaluations may be very time consuming, since they will likely require fine-tuning.
This means there is a risk of overshooting an ASL threshold when we intended to stop short of it. We mitigate this risk by creating a buffer: we have intentionally designed our ASL evaluations to trigger at slightly lower capability levels than those we are concerned about, while ensuring we evaluate at defined, regular intervals (specifically every 4x increase in effective compute, as defined below) in order to limit the amount of overshoot that is possible. We have aimed to set the size of our safety buffer to 6x (larger than our 4x evaluation interval) so model training can continue safely while evaluations take place. Correct execution of this scheme will result in us training models that just barely pass the test for ASL-N, are still slightly below our actual threshold of concern (due to our buffer), and then pausing training and deployment of that model unless the corresponding safety measures are ready.
I also think that many risks which could emerge in apparently-ASL-2 models will be reasonably mitigable by some mixture of re-finetuning, classifiers to reject harmful requests and/or responses, and other techniques. I've personally spent more time thinking about the autonomous replication than the biorisk evals though, and this might vary by domain.
To me it feels like this policy is missing something that accounts for a big chunk of the risk.
While recursive self-improvement is covered by the "Autonomy and replication" point, there is another risk from actors that don't intentionally cause large scale harm but use your system to make improvements to their own systems as they don't follow your RSP. This type of recursive improvement doesn't seem to be covered by any of "Misuse" or "Autonomy and replication".
In short it's about risks due to shortening of timelines.
Yay Anthropic for making good commitments about (1) pausing training if autonomous-replication model evals fail or red-teaming reveals dangerous capabilities [edit: until implementing the "ASL-3 Containment Measures"] and (2) deploying powerful models well!! I wish other labs would do this!
Not well; almost all of pp. 2–9 or maybe 2–13 is relevant. But here are some bits:
we commit to pause the scaling2 and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.
. . .
Before advancing to a given ASL, the next level must be defined to create a clear boundary with a "safety buffer".
. . .
We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things. (By post-training techniques we mean the best capabilities elicitation techniques we are aware of at the time, including but not limited to fine-tuning, scaffolding, tool use, and prompt engineering.) 1. Capabilities that significantly increase risk of misuse catastrophe. . . . 2. Autonomous replication in the lab.
. . .
[See "ASL-3 Containment Measures" and "ASL-3 Deployment Measures"]
Cool thanks.
I've seen that you've edited your post. If you look at ASL-3 Containment Measures, I'd recommend considering editing away the "Yay" aswell.
This post is a pretty significant goalpost moving.
While my initial understanding was that the autonomous replication would be a ceiling, this doc now made it a floor.
So in other words, this paper is proposing to keep navigating beyond levels that are considered potentially catastrophic, with less-than-military-grade cybersecurity, which makes it very likely that at least one state, and plausibly multiple states, will have access to those things.
It also means that the chances of leaking a system which is irreversibly catastrophic are probably not below 0.1%, maybe not even below 1%.
My interpretation of the excitement around the proposal is a feeling that "yay, it's better than where we were before".
But I think it neglects heavily a few things.
1. It's way worse than risk management 101, which is easy to push for.
2. the US population is pro-slowdown (so you can basically be way more ambitious than "responsibly scaling")
3. an increasing share of policymakers are worried
4. self-regulation has a track record of heavily affecting hard law (either by preventing it, or by creating a template that the state can enforce. That's the ToC that I understood from people excited by self-regulation). For instance I expect this proposal to actively harm the efforts to push for ambitious slowdowns that would let us put the probability of doom below two-digit numbers.
For those reasons, I wish this doc didn't exist.
Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees
One year is shockingly short, why such fast turnaround?
And great post I'm excited to see responsible scaling policies becoming a thing!
One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!
Exciting.
I'm kind of confused by the emphasis on the Trust rather than the board.
On the Trust: does the Trust do anything besides elect some board members (and sometimes advise the board)? Especially hard powers? How can the Trust's powers be modified?
On the board: what does the board do, especially hard powers? What decisions is it consulted about? How is it informed? (Who is on the board now?)
Put another way: Anthropic is talking like the Trust is great corporate governance and will help Anthropic predictably act responsibly. I tentatively believe that — in part because I mostly trust Anthropic; maybe in part because I expect some Trustees wouldn't have joined if they thought the Trust was a sham. But the information I know—basically just that the Trust will be able to elect a majority of the board in a few years, and it's not clear what could cause the Trust to lose its powers—only weakly demonstrates that the Trust is great.
The role of the Trust is to elect (and potentially replace) board members; its formal power comes entirely from the fact that it will eventually elect a majority of the board seats.
The post mentions a "failsafe" where a supermajority of investors can amend this arrangement, which I think is a reasonable compromise. But I'm not aware of any public information about what that supermajority is, or whether there are other ways the Trust's formal powers could be reduced.
Dylan Matthews reports the members of the board here: Dario, Daniela, Luke Meulhauser, and Yasmin Razavi. (I think it's also listed plenty of other places.)
The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]
Would you be willing to rephrase this as something like
The model shows early signs of autonomous self-replication ability. Autonomous self-replication ability is defined as 50% aggregate success rate on the capabilities for which we list evaluations in [Appendix on Autonomy Evaluations]
?
The hope here is to avoid something like "well this system doesn't have autonomous self-replication ability/isn't ASL-3, because Anthropic's evals failed to elicit the behaviour. That definitionally means it's not ASL-3", and get a bit more map-territory distinction in.
We commit to an additional set of measures for producing ASL-3 model outputs (externally or internally)
Wait, how can you do red-teaming on an ASL-3 model before producing ASL-3 model outputs?
Thanks for posting, Zac!
I don't think I understand / buy this "race to the top idea":
If adopted as a standard across frontier labs, we hope this might create a “race to the top” dynamic where competitive incentives are directly channeled into solving safety problems.
Some specific questions I have:
To be clear, I don't think a historical precedent of similar things working is necessary for the RSP to be net positive, but it is (I think) a crux for how likely I believe this is to succeed.
Does it make sense to consider misuse and autonomy risks under the same framework? Potential for misuse doesn't appear to be difficult to detect and address. Autonomy risk is much harder to assess due to our current lack of understanding around the dynamics of emergence. Addressing the Knightian uncertainty around what conditions may give rise to autonomy at the very least deserves a different approach as compared to more quantifiable misuse risks. ARC's autonomous replication evals address this partially but a sufficiently advanced agent may be able to evade detection.
Yay Anthropic for making great security commitments! (Beyond its previous very-incomplete commitments here and here.) I wish other labs would do this!
But less yay because these 'commitments' are merely aspirational — they don't necessarily describe what Anthropic is actually doing, and there's no timeline or accountability.
(Anthropic describes some currently-implemented practices here, but my impression is they're very inadequate.)
They do as far as I can tell commit to a fairly strong sort of "timeline" for implementing these things: before they scale to ASL-3 capable models (i.e. ones that pass their evals for autonomous capabilities or misuse potential).
I read it differently; in particular my read is that they aren't currently implementing all of the ASL-2 security stuff (and they're not promising to do all of the ASL-3 stuff before scaling to ASL-3). Clarity from Anthropic would be nice.
In "ASL-2 and ASL-3 Security Commitments," they say things like "labs should" rather than "we will."
Almost none of their security practices are directly visible from the outside, but whether they have a bug bounty program is. They don't. But "Programs like bug bounties and vulnerability discovery should incentivize exposing flaws" is part of the ASL-2 security commitments.
I guess when they "publish a more comprehensive list of our implemented ASL-2 security measures" we'll know more.
I'm delighted that Anthropic has formally committed to our responsible scaling policy. We're also sharing more detail about the Long-Term Benefit Trust, which is our attempt to fine-tune our corporate governance to address the unique challenges and long-term opportunities of transformative AI.
Today, we’re publishing our Responsible Scaling Policy (RSP) – a series of technical and organizational protocols that we’re adopting to help us manage the risks of developing increasingly capable AI systems.
As AI models become more capable, we believe that they will create major economic and social value, but will also present increasingly severe risks. Our RSP focuses on catastrophic risks – those where an AI model directly causes large scale devastation. Such risks can come from deliberate misuse of models (for example use by terrorists or state actors to create bioweapons) or from models that cause destruction by acting autonomously in ways contrary to the intent of their designers.
Our RSP defines a framework called AI Safety Levels (ASL) for addressing catastrophic risks, modeled loosely after the US government’s biosafety level (BSL) standards for handling of dangerous biological materials. The basic idea is to require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk, with higher ASL levels requiring increasingly strict demonstrations of safety.
A very abbreviated summary of the ASL system is as follows:
The definition, criteria, and safety measures for each ASL level are described in detail in the main document, but at a high level, ASL-2 measures represent our current safety and security standards and overlap significantly with our recent White House commitments. ASL-3 measures include stricter standards that will require intense research and engineering effort to comply with in time, such as unusually strong security requirements and a commitment not to deploy ASL-3 models if they show any meaningful catastrophic misuse risk under adversarial testing by world-class red-teamers (this is in contrast to merely a commitment to perform red-teaming). Our ASL-4 measures aren’t yet written (our commitment is to write them before we reach ASL-3), but may require methods of assurance that are unsolved research problems today, such as using interpretability methods to demonstrate mechanistically that a model is unlikely to engage in certain catastrophic behaviors.
We have designed the ASL system to strike a balance between effectively targeting catastrophic risk and incentivising beneficial applications and safety progress. On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling, and allows us to use the most powerful models from the previous ASL level as a tool for developing safety features for the next level.[1] If adopted as a standard across frontier labs, we hope this might create a “race to the top” dynamic where competitive incentives are directly channeled into solving safety problems.
From a business perspective, we want to be clear that our RSP will not alter current uses of Claude or disrupt availability of our products. Rather, it should be seen as analogous to the pre-market testing and safety feature design conducted in the automotive or aviation industry, where the goal is to rigorously demonstrate the safety of a product before it is released onto the market, which ultimately benefits customers.
Anthropic’s RSP has been formally approved by its board and changes must be approved by the board following consultations with the Long Term Benefit Trust. In the full document we describe a number of procedural safeguards to ensure the integrity of the evaluation process.
However, we want to emphasize that these commitments are our current best guess, and an early iteration that we will build on. The fast pace and many uncertainties of AI as a field imply that, unlike the relatively stable BSL system, rapid iteration and course correction will almost certainly be necessary.
The full document can be read here. We hope that it provides useful inspiration to policymakers, third party nonprofit organizations, and other companies facing similar deployment decisions.
We thank ARC Evals for their key insights and expertise supporting the development of our RSP commitments, particularly regarding evaluations for autonomous capabilities. We found their expertise in AI risk assessment to be instrumental as we designed our evaluation procedures. We also recognize ARC Evals' leadership in originating and spearheading the development of their broader ARC Responsible Scaling Policy framework, which inspired our approach.. [italics in original]
I'm also including a short excerpt from the Long-Term Benefit Trust post, which I suggest reading in full. You may also wish to (re-)read Anthropic's core views on AI safety as background.
The Anthropic Long-Term Benefit Trust (LTBT, or Trust) is an independent body comprising five Trustees with backgrounds and expertise in AI safety, national security, public policy, and social enterprise. The Trust’s arrangements are designed to insulate the Trustees from financial interest in Anthropic and to grant them sufficient independence to balance the interests of the public alongside the interests of Anthropic’s stockholders. ... the Trust will elect a majority of the board within 4 years. ...
The initial Trustees are:
The Anthropic board chose these initial Trustees after a year-long search and interview process to surface individuals who exhibit thoughtfulness, strong character, and a deep understanding of the risks, benefits, and trajectory of AI and its impacts on society. Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees. We are honored that this founding group of Trustees chose to accept their places on the Trust, and we believe they will provide invaluable insight and judgment.
As a general matter, Anthropic has consistently found that working with frontier AI models is an essential ingredient in developing new methods to mitigate the risk of AI. ↩︎