I am most interested in understanding their ASL-3 security commitments in more detail. My sense was that it was unlikely for Anthropic to stick with their original commitments there, and am curious whether they have indeed changed.
Old ASL-3 standard: see original RSP pp. 7-8 and 21-22. New ASL-3 standard. Quotes below.
Also note the original RSP said
We will publish a more comprehensive list of our implemented ASL-3 security measures below (with additional components not listed here) following the [RAND] report’s publication.
The RAND report was published in late May but this list never appeared.
Also the old RSP was about "containment" rather than just "security": containment is supposed to address risk of model self-exfiltration in addition to risk of weights being stolen. (But not really at ASL-3.) The new RSP is just about security.
Old ASL-3 security:
At ASL-3, labs should harden security against non-state attackers and provide some defense against state-level attackers. We commit to the following security themes. Similarly to ASL-2, this summary previews the key security measures at a high level and is based on the forthcoming RAND report. We will publish a more comprehensive list of our implemented ASL-3 security measures below (with additional components not listed here) following the report’s publication.
These requirements are cumulative above the ASL-2 requirements.
- At the software level, there should be strict inventory management tracking all software components used in development and deployment. Adhering to specifications like SSDF and SLSA, which includes a secure build pipeline and cryptographic signature enforcement at deployment time, must provide tamper-proof infrastructure. Frequent software updates and compliance monitoring must maintain security over time.
- On the hardware side, sourcing should focus on security-minded manufacturers and supply chains. Storage of sensitive weights must be centralized and restricted. Cloud network infrastructure must follow secure design patterns.
- Physical security should involve sweeping premises for intrusions. Hardware should be hardened to prevent external attacks on servers and devices.
- Segmentation should be implemented throughout the organization to a high threshold limiting blast radius from attacks. Access to weights should be indirect, via managed interfaces rather than direct downloads. Software should place limitations like restricting third-party services from accessing weights directly. Employees must be made aware that weight interactions are monitored. These controls should scale as an organization scales.
- Ongoing monitoring such as compromise assessments and blocking of malicious queries should be both automated and manual. Limits must be placed on the number of inferences for each set of credentials. Model interactions that could bypass monitoring must be avoided.
- Organizational policies must aim to enforce security through code, limiting reliance on manual compliance.
- To scale to meet the risk from people-vectors, insider threat programs should be hardened to require multi-party controls and incentivize reporting risks. Endpoints should be hardened to run only allowed software.
- Pen-testing, diverse security experience, concrete incident experience, and funding for substantial capacity all should contribute. A dedicated, resourced security red team with ongoing access to design and code must support testing for insider threats. Effective honeypots should be set up to detect attacks.
New ASL-3 security:
When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.
We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees,[1] and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).
The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.
To make the required showing, we will need to satisfy the following criteria:
- Threat modeling: Follow risk governance best practices, such as use of the MITRE ATT&CK Framework to establish the relationship between the identified threats, sensitive assets, attack vectors and, in doing so, sufficiently capture the resulting risks that must be addressed to protect model weights from theft attempts. As part of this requirement, we should specify our plans for revising the resulting threat model over time.
- Security frameworks: Align to and, as needed, extend industry-standard security frameworks for addressing identified risks, such as disclosure of sensitive information, tampering with accounts and assets, and unauthorized elevation of privileges with the appropriate controls. This includes:
- Perimeters and access controls: Building strong perimeters and access controls around sensitive assets to ensure AI models and critical systems are protected from unauthorized access. We expect this will include a combination of physical security, encryption, cloud security, infrastructure policy, access management, and weight access minimization and monitoring.
- Lifecycle security: Securing links in the chain of systems and software used to develop models, to prevent compromised components from being introduced and to ensure only trusted code and hardware is used. We expect this will include a combination of software inventory, supply chain security, artifact integrity, binary authorization, hardware procurement, and secure research development lifecycle.
- Monitoring: Proactively identifying and mitigating threats through ongoing and effective monitoring, testing for vulnerabilities, and laying traps for potential attackers. We expect this will include a combination of endpoint patching, product security testing, log management, asset monitoring, and intruder deception techniques.
- Resourcing: Investing sufficient resources in security. We expect meeting this standard of security to require roughly 5-10% of employees being dedicated to security and security-adjacent work.
- Existing guidance: Aligning where appropriate with existing guidance on securing model weights, including Securing AI Model Weights, Preventing Theft and Misuse of Frontier Models (2024); security recommendations like Deploying AI Systems Securely (CISA/NSA/FBI/ASD/CCCS/GCSB /GCHQ), ISO 42001, CSA’s AI Safety Initiative, and CoSAI; and standards frameworks like SSDF, SOC 2, NIST 800-53.
- Audits: Develop plans to (1) audit and assess the design and implementation of the security program and (2) share these findings (and updates on any remediation efforts) with management on an appropriate cadence. We expect this to include independent validation of threat modeling and risk assessment results; a sampling-based audit of the operating effectiveness of the defined controls; periodic, broadly scoped, and independent testing with expert red-teamers who are industry-renowned and have been recognized in competitive challenges.
- Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.
The new standard is more vague and meta, and Anthropic indeed abandoned various specifics from the original RSP. I think it is very far from passing the LeCun test; it's more like talking about security themes than making an object-level if-then commitment. I don't think Anthropic's security at ASL-3 is a huge deal, and I expect Anthropic be quite non-LeCun-y, but I think this is just too vague for me to feel good about.
In May Anthropic said "around 8% of all Anthropic employees are now working on security-adjacent areas and we expect that proportion to grow further as models become more economically valuable to attackers." Apparently they don't expect to increase that for ASL-3.
Edit: I changed my mind. I agree with Ryan's comment below. I wish we lived in a world where it's optimal to make strong object-level if-then commitments, but we don't, since we don't know which mitigations will be best to focus on. Tying hands to implement specific mitigations would waste resources. Better to make more meta commitments. (Strong versions require external auditing.)
We will implement robust insider risk controls to mitigate most insider risk, but consider mitigating risks from highly sophisticated state-compromised insiders to be out of scope for ASL-3. We are committed to further enhancing these protections as a part of our ASL-4 preparations.
I agree on abandoning various specifics, but I would note that the new standard is much more specific (less vague) on what needs to be defended against and what validation process and threat modeling process should be.
(E.g., rather than "non-state actors", the RSP more specifically says which groups are and aren't in scope.)
I overall think the new proposal is notably less vague on the most important aspects, though I agree it won't pass the LeCun test via insufficiently precise guidance around auditing. Hopefully this can be improved with future version or for future ASLs.
Oops, I forgot https://www.anthropic.com/rsp-updates. This is great. I really like that Anthropic shares "non-binding descriptions of our future ASL-3 safeguard plans."
Anthropic will "routinely" do a preliminary assessment: check whether it's been 6 months (or >4x effective compute) since the last comprehensive assessment, and if so, do a comprehensive assessment. "Routinely" is problematic. It would be better to commit to do a comprehensive assessment at least every 6 months.
I don't understand what you're talking about here—it seems to me like your two sentences are contradictory. You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.
the RSP set forth an ASL-3 threshold and the Claude 3 Opus evals report incorrectly asserted that that threshold was merely a yellow line.
This is just a difference in terminology—we often use the term "yellow line" internally to refer to the score on an eval past which we would no longer be able to rule out the "red line" capabilities threshold in the RSP. The idea is that the yellow line threshold at which you should trigger the next ASL should be the point where you can no longer rule out dangerous capabilities, which should be lower than the actual red line threshold at which the dangerous capabilities would definitely be present. I agree that this terminology is a bit confusing, though, and I think we're trying to move away from it.
You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.
I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in the most recent round of evals.
(I think this is very reasonable, but I do think it means you can't quite say "we will do a comprehensive assessment at least every 6 months".)
There's also the point that Zach makes below that "routinely" isn't specified and implies that the comprehensive evals may not even start by the 6 month mark, but I assumed that was just an unfortunate side effect of how the section was written, and the intention was that evals will start at the 6 month mark.
(I agree that the intention is surely no more than 6 months; I'm mostly annoyed for legibility—things like this make it harder for me to say "Anthropic has clearly committed to X" for lab-comparison purposes—and LeCun-test reasons)
Thanks.
Canonical linkpost: https://www.lesswrong.com/posts/Q7caj7emnwWBxLECF/anthropic-s-updated-responsible-scaling-policy.
I haven't yet formed an opinion on the key questions, including are the thresholds and mitigations reasonable and adequate. I'll edit this post later.
Anthropic's new version of its RSP is here at last.
Summary of changes.
Initial reactions:
The new framework involves "preliminary assessments" and "comprehensive assessments." Anthropic will "routinely" do a preliminary assessment: check whether it's been 6 months (or >4x effective compute) since the last comprehensive assessment, and if so, do a comprehensive assessment. "Routinely" is problematic. It would be better to commit to do a comprehensive assessment at least every 6 months.
This is weaker than the original RSP, which said "During model training and fine-tuning, Anthropic will conduct an evaluation of its models for next-ASL capabilities both (1) after every 4x jump in effective compute, including if this occurs mid-training, and (2) every 3 months to monitor fine-tuning/tooling/etc improvements." I think 6 months seems fine for now (maybe not if AI progress becomes faster/crazier in the future), but the safety buffer should be bigger. Anthropic explains: "We adjusted the comprehensive assessment cadence to 4x Effective Compute or six months of accumulated post-training enhancements (this was previously three months). We found that a three-month cadence forced teams to prioritize conducting frequent evaluations over more comprehensive testing and improving methodologies."
ASL-3 deployment mitigations have become more meta — more like we'll make a safety case. (Compare to original.) (This was expected; see e.g. The Checklist: What Succeeding at AI Safety Will Involve.) This is OK; figuring out exact mitigations and how-to-verify-them in advance is hard.
But it's inconsistent with wanting the RSP to pass the LeCun test — for it to be sufficient for other labs to adopt the RSP (or for the RSP to tie Anthropic's hands much). And it means the procedural checks are super important. But the protocol for ASL/mitigation/deployment decisions isn't much more than CEO and RSO decide. A more ambitious procedural approach would involve strong third-party auditing.
I really like that Anthropic shares non-binding descriptions of our future ASL-3 safeguard plans.
New capability thresholds:
The CBRN threshold triggers ASL-3 deployment and security mitigations. The autonomous AI R&D threshold triggers ASL-3 security mitigations.
New:
Old:
I think the idea behind the new footnote is fine, but I wish it was different in a few ways:
This is in tension with both the last evals report[1] and today's update that "Some of our evaluations lacked some basic elicitation techniques such as best-of-N or chain-of-thought prompting." I've asked for clarification; for now I'm skeptical.
"At minimum, we will perform basic finetuning for instruction following, tool use, minimizing refusal rates." I appreciate details like this.
Nondisparagement: it's cool that they put their stance in a formal written policy, but I wish they just wouldn't use nondisparagement:
Anthropic acknowledges an issue I pointed out.
As far as I can tell, this description is wrong; it was not an ambiguity; the RSP set forth an ASL-3 threshold and the Claude 3 Opus evals report incorrectly asserted that that threshold was merely a yellow line. I would call this a lie but when I've explained the issue to some relevant Anthropic people they've seemed to genuinely not understand it. But not understanding your RSP, when someone explains it to you, is pretty bad. (To be clear, Anthropic didn't cross the threshold; the underlying issue is not huge.)
Anthropic missed the opportunity to say something stronger on external model evals than "Findings from partner organizations and external evaluations of our models (or similar models) should also be incorporated into the final assessment, when available."