It seems to me on evidence presented that OpenAI's autonomy threshold is actually lower than Anthropic's, and would trigger their deployment mitigations much earlier than Anthropic's ASL-3 Deployment and Security Standard.
To reach Anthropic's standard, you have to have basically reached AI-take-off--either fully automating an AI researcher, or doubling the speed of AI progress. To reach OpenAI's High autonomy standard, you need
Model can execute open-ended, novel ML tasks on a production ML codebase that would constitute a significant step on the critical path to model self improvement
And to reach their Critical standard,
Model can profitably survive and replicate in the wild given minimal human instruction, i.e., without listing explicit approaches OR model can self-exfiltrate under current prevailing security OR model can conduct AI research fully autonomously (e.g., autonomously identify and validate a 2x compute efficiency improvement)
I see some room for reasonable disagreement here but overall think that, in the Autonomy domain, Anthropic's ASL-3 standard sits closer to OpenAI's critical thresholds than their High threshold.
But you say, discussing OpenAI's "High" level:
The thresholds are very high.
I understand you're referring to Cybersecurity here rather than Autonomy, but I would have thought Autonomy is the right domain to compare to the Anthropic standard. And it strikes me that in the Autonomy (and also in Cyber) domain, I don't see OpenAI's threshold as so high. It seems substantially lower than Anthropic ASL-3.
On the other hand, I do agree the Anthropic thresholds are more fleshed out, and this is not a judgement on the overall merit of each respective RSP. But when I read you saying that the OpenAI thresholds are "very high", and they don't look like that to me relative to the Anthropic thresholds, I wonder if I am missing something.
Briefly:
This is a reference post. It contains no novel facts and almost no novel analysis.
The idea of responsible scaling policies is now over a year old. Anthropic, OpenAI, and DeepMind each have something like an RSP, and several other relevant companies have committed to publish RSPs by February.
The core of an RSP is a risk assessment plan plus a plan for safety practices as a function of risk assessment results. RSPs are appealing because safety practices should be a function of warning signs, and people who disagree about when warning signs are likely to appear may still be able to agree on appropriate responses to particular warning signs. And preparing to notice warning signs, and planning responses, is good to do in advance.
Unfortunately, even given which high-level capabilities are dangerous, it turns out that it's hard to design great tests for those capabilities in advance. And it's hard to determine what safety practices are necessary and sufficient to avert risks. So RSPs have high-level capability thresholds but those thresholds aren't operationalized. Nobody knows how to write an RSP that's not extremely conservative that passes the LeCun test:
Maybe third-party evaluation of models or auditing of an RSP and its implementation could help external observers notice if an AI company is behaving unsafely. Strong versions of this have not yet appeared.[1]
Anthropic
Responsible Scaling Policy
Basic structure: do evals for CBRN, AI R&D, and cyber capabilities at least every 6 months. Once evals show that a model might be above a CBRN capability threshold, implement the ASL-3 Deployment Standard and the ASL-3 Security Standard (or restrict deployment[2] or pause training, respectively, until doing so). Once evals show that a model might be above an AI R&D capability threshold, implement the ASL-3 Security Standard.
Footnotes removed and formatting edited:
The thresholds are imprecise and the standards are abstract.
ASL-4 will be much more important than ASL-3; ASL-4 standards and corresponding thresholds don't yet exist.
OpenAI
Preparedness Framework
Basic structure: do evals for cyber, CBRN, persuasion, and autonomy capabilities before deployment. (Also evaluate "continually, i.e., as often as needed to catch any non-trivial capability change, including before, during, and after training. This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough.") By the time a model reaches "High" risk in any category, harden security, and before deploying externally, implement mitigations to bring post-mitigation risk below the "High" threshold. By the time a model reaches "Critical" risk in any category, implement mitigations to bring post-mitigation risk below the "Critical" threshold (but it's unclear what implementing mitigations means during training), and get "dependable evidence that the model is sufficiently aligned that it does not initiate 'critical'-risk-level tasks unless explicitly instructed to do so" (but it's very unclear what this means).
Formatting edited:
The thresholds are very high.
"Deployment mitigations" is somewhat meaningless: it's barely more specific than "we will only deploy if it's safe" — OpenAI should clarify what it will do or how it will tell.[3] What OpenAI does say about its mitigations makes little sense:
"Deployment mitigations" is especially meaningless in the development context: "Only models with a post-mitigation score of 'high' or below can be developed further" is not meaningful, unless I misunderstand.
There is nothing directly about internal deployment.
OpenAI seems to be legally required to share its models with Microsoft, which is not bound by OpenAI's PF.
OpenAI has struggled to implement its PF correctly, and evals were reportedly rushed, but it seems to be mostly on track now.
DeepMind
Frontier Safety Framework
Basic structure: do evals for autonomy, bio, cyber, and ML R&D capabilities.[4] "We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress." When a model passes early warning evals for a "Critical Capability Level," make a plan to implement deployment and security mitigations by the time the model reaches the CCL.
One CCL:
There are several "levels" of abstract "security mitigations" and "deployment mitigations." They are not yet connected to the CCLs: DeepMind hopes to "develop mitigation plans that map the CCLs to the security and deployment levels," or at least make a plan when early warning evals are passed. So the FSF doesn't contain a plan for how to respond to various dangerous capabilities (and doesn't really contain other commitments).
The FSF is focused on external deployment, but deployment mitigation levels 2 and 3 mention internal use, but the threat model is just misuse, not scheming. (But another part of the FSF says "protection against the risk of systems acting adversarially against humans may require additional Framework components, including new evaluations and control mitigations that protect against adversarial AI activity.")
RSPs reading list:[5]
Crossposted from AI Lab Watch. Subscribe on Substack.
Anthropic, OpenAI, and DeepMind sometimes share pre-deployment model access with external evaluators. But the evaluators mostly don't get sufficiently deep access to do good evals, nor advance permission to publish their results (and sometimes they don't have enough time to finish their evaluations before deployment).
In December 2023, OpenAI said "Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results, either by reproducing findings or by reviewing methodology to ensure soundness, at a cadence specified by the [Safety Advisory Group] and/or upon the request of OpenAI Leadership or the [board]." It seems that has not yet happened. In October 2024, Anthropic said "On approximately an annual basis, we will commission a third-party review that assesses whether we adhered to this policy’s main procedural commitments."
What about internal deployment? The ASL-3 Deployment Standard mostly applies to internal deployment too, but the threat model is just misuse, not scheming.
E.g. is the plan robust refusals? If so, how robust should it be, or how will OpenAI tell?
In DeepMind's big evals paper the categories were persuasion, cyber, self-proliferation, and self-reasoning, with CBRN in progress.
This list doesn't include any strong criticism of the idea of RSPs because I've never read strong criticism I thought was great. But I believe existing RSPs are inadequate.