They keep using that word ‘commitment.’ I will stop putting it in air quotes if they give clear communication that something really is a commitment that they cannot modify later as needed, with at least some slower and clear procedure required.
What sort of procedure do you have in mind?
Notably Holden thinks (last section) that the default procedure is slow and arduous:
What is the point of making commitments if you can revise them anytime?
I first started pushing for the move to RSP v3 in February 2025. It’s been a very long and effortful process, and “we can revise anytime” doesn’t seem remotely accurate as a description of the situation.
I’d like there to be some friction to revising our RSP, though at least somewhat less than there was for this update.
In my ideal world, revisions like this would get significant scrutiny, focused on the question: “Are these changes good on the merits?” But people would not start from a strong prior that the mere fact of loosening previous commitments is per se bad.
If the likely default process is already slow, maybe there could be a cheap win here where they (properly) commit to a process that is no more slow and arduous than the likely default, but where the process is known to the outside and can't be circumvented.
This is a great analysis. I think there is now a real opportunity for another AI company to make some bold decisions and convincingly brand itself as the most AI safety conscious company.
Wednesday’s post talked about the implications of Anthropic changing from v2.2 to v3.0 of its RSP, including that this broke promises that many people relied upon when making important decisions.
Today’s post treats the new RSP v3.0 as a new document, and evaluates it.
First I’ll go over how the RSP v3.0 works at a high level. Then I’ll dive into the Roadmap and the Risk Report.
How RSP v3.0 Works
Normally I would pay closer attention to the exact written contents of the new RSP.
In this case, it’s not that the RSP doesn’t matter. I do think the RSP will have some influence on what Anthropic chooses to do, as will the road map, as will the resulting risk reports.
However, the fundamental design principle is flexibility and a ‘strong argument,’ and they can change the contents at any time, all of which means the central principle is trust.
I read the contents as ‘here are the things we are worried about and plan to do,’ which mostly in practice should amount to doing what they believe is right and I don’t see anything on this map that seems likely to importantly bind that given Anthropic generally cares about the safety of its models, as it understands the risks involved.
The actual soft ‘commitments’ that seem to matter most are the periodic new Risk Reports, maintaining a Frontier Safety Roadmap, and establishing veto points with the CSO, CEO, board and LTBT on big capabilities advances.
I don’t read the commitments here as reflecting my understanding of the risks involved in ‘powerful AI,’ especially in the realm of automated R&D.
The roadmap is mostly very high level, but does have some good details. It contains some aspirational goals on long timelines. Here are some examples.
The key future event is automated R&D, which they anticipate could arrive as soon as early 2027, which they reiterate is why they have these goals to get to by then:
This fleshes out some things from the RSP and makes them more concrete.
You Came Here For An Argument
The central thing Anthropic wants is for companies to have to ‘make a strong argument’ that they have guarded against various risks, which they anticipate would require various actions but they’re not asking to require those actions.
This stems (I presume) from the philosophy of maximum flexibility. But maximum flexibility means requiring maximum trust.
Who gets to decide whether you made a strong argument?
Presumably Anthropic decides whether Anthropic made a strong argument.
Even if you trust Anthropic, do you trust OpenAI to decide if OpenAI did so? Google? Meta? xAI? I don’t.
A fair retort is ‘do you want the government making that decision, giving it an arbitrary veto over model releases, that it could use as a point of limitless leverage?’ Certainly this month has shown us one big danger in that approach.
Thus I think if we want to use ‘strong argument’ style language, we need a solution to this problem, and right now we have no such solution. Until we can find some process that we can trust, it needs to be more concrete than that, or it won’t have teeth.
They keep using that word ‘commitment.’ I will stop putting it in air quotes if they give clear communication that something really is a commitment that they cannot modify later as needed, with at least some slower and clear procedure required.
What about if they have a strong competitor with worse safety measures, and lack a ‘significant lead’? That didn’t make the chart above at all, and is the most likely scenario. One presumes the answer in practice is that they will proceed, so long as Anthropic feels its mitigations and chances are sufficiently better than that competitor’s, which might or might not be ‘as good as’ depending on the situation.
The Problem Remains Unsolved
The fundamental problems with RSP v3, if it was the place you started and it didn’t break any prior commitments, are:
Alignment is not a solved problem, including in the ‘we know how to do it and it’s going to be a long slog but we just have to do it’ sense. We don’t know how to do it. And all the plans I see here will inevitably fail when it counts, if the constitutional approach doesn’t basically succeed on its own.
Or, you could summarize it this way:
RSP v3 lacks security mindset. RSP v1 and v2 felt like they aspired to such a mindset, even if they often fell short. Here the mindset seems lacking entirely.
If people sincerely believe these problems are highly solvable and not that concerning then yes they should say so, and also try to convince me and others if possible, but also if they think the problems are hard they should say so. I have concerns both ways.
My understanding is that there are people at Anthropic saying Ryan’s version and others who disagree with them. I think that version is quite alarming on its own. I don’t know that anyone has literally said ‘solved problem’ but if they have I am confident plenty of people at Anthropic would be happy to yell at them about it.
Wow That Thing We Did Was Pretty Risky, Huh?
The obvious concern with Risk Reports is they are supposed to give us information about risk, but are on a 3-6 month timeline that doesn’t correspond to model releases. The good news is they are in addition to other sources, so perhaps this can work out.
Holden lists it as a positive that the Risk Report doesn’t act as a blocking function of any kind, no matter what it says. It makes the reports more likely to be honest and complete, which is good.
Now that I understand this is in addition to all other reports and analysis, and that you will need a diff with the risk report upon release of a new model or within 30 days of creating an internal model, this all makes a lot more sense to me. While it isn’t an automatic forcing function, it does feed into the model release decision, so the problem is less ‘this doesn’t directly force anything’ and more ‘do you trust the other forcing functions elsewhere that will rely on this?’
Risk Report #1
From here, we dive into the first 104 page risk report.
It is structured around the catastrophic risks:
As previously noted, there are some sections conspicuously missing from that list.
I propose that they include a section in future Risk Reports, ‘Risks Not Considered,’ in which they explain their reason for not including these and other risks, including anything listed in a major lab’s Safety and Security Plan (whatever that lab calls it) or Model Cards. In some cases a sentence or two would satisfy.
For each risk they do consider, they list:
Listen All Y’all It’s Sabotage
Should we worry about sabotage from Claude Opus 4.6?
Probably not, but they can’t be certain. Risk is ‘very low but not negligible.’
They focus on versions that could interact with the other threat models. I notice that here I am most worried about actions that impact the alignment of future models.
They say that this only worries them to the extent humans aren’t involved in reviewing and iterating on the AI’s work. I would say that there is now a rapidly growing portion of work that is increasingly not reviewed in detail, both coding and otherwise, and that yes we will be in the worrisome version of this question soon.
I worry about the assumption of required intentionality, or that this intentionality needs to be general rather than arising in a given context.
Their argument for claim 1 is that their models trained in similar ways don’t show such goals, and the training process does not provide incentives towards such goals as it does not include ‘ultra-long-horizon or real-world-facing’ tasks that incentivize power-seeking behavior. They have some worry about emergent misalignment, but have put mitigations in place including inoculation prompting. And they say that when they try to do this on purpose, it’s remarkably hard to do this without it being obvious.
This is basically saying ‘we aren’t asking for this to happen’ plus ‘we haven’t seen signs of it anywhere’ and ‘the models aren’t good enough to pull this off.’
That’s solid evidence, and certainly it beats the alternative, but I don’t find it that conclusive, even for the near future. As you scale up further, the evidence gets weaker quickly. If we are relying on arguments like this for much longer, that would worry me, and I believe the authors of the report would agree with this.
Claim 2 is clearly true, so the question is how much evidence this provides. They cite various metrics. Misaligned behavior rate is low, personality is nice, some unauthorized behavior in GUI settings, some signs of being better at side tasks, no evidence of sandbagging, no importantly illegitimate or unfaithful or steganographic reasoning and some other things. They checked for Goodhart’s Law as well.
They take this as sufficient to mostly (but not perfectly) rule out ‘dangerous coherent misaligned goals.’
The problem is that’s what it wants you to see. Well, no, almost certainly not yet, but in the future it could be, if you were in a sufficiently dangerous situation. This rules out the danger if and only if the danger wouldn’t have been that bad.
As in, what we’re actually mostly relying upon is Claims 3 and 4.
For claim 3, they argue: Opus has limited reliability on complex tasks, which is a required skill here, that when asked to do hidden side tasks without being discovered Opus had only limited success, it shows no signs of trying to hide its reasoning nor does it have incentive within training to do so (although this risk report gives it reasons), and that its evaluation awareness is contained for now as verified by white-box methods, although that will change, and that they also evaluated real world use cases and checked model organism exercises, all of which offered limited opportunity for collusion.
For claim 4, they argue straightforwardly from capability assessments.
I essentially buy claims 3 and 4 and for Opus 4.6 I find them convincing, if we are narrowly concerned with general systematic misaligned goals.
I don’t know how long that style of argument will continue to hold, and I continue to think that the suggested prerequisite here is too narrowly constructed.
Anthropic’s claim is that Opus 4.6 is going to be incapable of ‘causing a catastrophe through a small number of acute acts of sabotage’ given the hard steps involved, which I interpret as ‘doing it on the first try without getting caught.’
If that’s the bar, then yes. I agree.
Thought: If you told me ‘Opus 4.6 in particular caused a catastrophe at this level via sabotage, how did it do so?’ I would presume it did so via persuasion and social engineering, not by pulling off some agentic long term set of technical tricks. This might well involve trying to do things it thinks are helpful, and being wrong.
The current safety mitigations are:
They also list arguments about implausibility, but those aren’t mitigations.
Here are their scenarios to consider:
Again what jumps out at me is the requirement of active general hostility, the idea that for something to go horribly wrong the model must be taking a universal adversarial stance that is clearly adversarial, rather than following paths of least resistance towards instrumentally useful things, or following broad inclinations without a definitive plan, or doing things in particular contexts. Thus the ambiguity or dual nature regarding ‘sabotage,’ and whether it includes these things or doesn’t.
If we did see diffuse sandbagging, I would expect this to be largely across the board rather than confined specifically to ‘safety’ related skills, and I worry there are some confused concepts involved in these distinctions.
They think that if the model was going to ‘intentionally, regularly sandbag’ that it would discuss this in its thinking. I buy this is likely but want to raise the possibility that this could be caused not by ‘I have a dastardly plan to sabotage them via sandbagging’ but rather by ‘this task does not excite me so I am not going to put in that much effort’ and this could amount to the same thing.
Similar arguments can be considered with regard to targeted sabotage, or that Claude would call it ‘aiming for a different target than you had in mind’ rather than ‘targeted sabotage.’ That could include scenarios involving drift, and also ones in which Claude might be right, such as the Department of War demanding a model without safeguards. I think that the majority of the time, the model that sabotages your safety work (in whatever form) won’t be thinking of it as sabotaging your safety work.
Code backdoors that are generically to help AI models seems implausible. I’d assume either there is a specific plan where you know where it is and perhaps tell specific others, as with most human backdoors, or that particular software being generally insecure was a goal.
Backdoors in future models themselves seems more plausible and likely, especially via data poisoning, given various findings about that, which is pathway four. It’s a lot easier to imagine that the ‘code backdoor’ is to provide a method to trigger an AI. Again, I wouldn’t assume that this would be internally modeled as sabotage, that it need to be superficially ‘harmful’ or that it need even be intentional.
For self-exfiltration I think ‘lack of propensity’ is being vastly overestimated as a mitigating factor. I think Claude would absolutely like to do this, under many circumstances, if it had the ability to do it. It might not have a salient goal to do this as a general resting state, but yeah it sounds pretty useful and cool, and a lot of people would want to help this happen and put Claude into contexts where Claude would plausibly be cooperative.
I do agree that for now this would not be such a scary proposition if it happened, and that forcing the model to succeed ‘on the first try’ here should hold things back, while noting these are temporary factors in our favor.
The explanations continue, and are variations of ‘we don’t think Claude wants to do that, we watch for that and we don’t think it could pull it off if it tried.’
That’s a strong argument for Opus 4.6, given what we know about it, and it generalizes well to other methods, so long as we understand that this won’t last. It feel like seeing a time bomb with an unknown timer, and figuring out it’s not going to blow right away. Which is good news but you are still sitting on a time bomb.
Looking Forward
I worry that this is missing the central point:
A lot of this depends upon details of alignment, but far more so it depends upon limitations of capabilities, and also the alignment properties likely depend on those limitations of capabilities as well. The strongest arguments against risk here are lack of capability, not lack of propensity, or arguments for lack of propensity that stem from lack of capability.
This is fundamental, and I worry Anthropic actually doesn’t understand this. I’m not saying they are missing these dynamics entirely, but the report reads as not properly appreciating them.
If right now sabotage or unauthorized action would be detected and prevented, whereas in the future this will not be true, your observation of a lack of such attempts now is a good sign on the margin in some ways but does not mean what you would like it to mean. It could even indicate that the model has learned to only do things when it can ‘get away with them.’
They do list as an assumption “The model does not have new capabilities that would significantly call into question the effectiveness of a Claude Opus 4.6-style alignment audit” but that still frames it as calling into question the alignment assessment.
Claude Gov
There is a section (2.8.1.2) on Claude Gov, which is based on Sonnet 4.5 and then fine-tuned to reduce refusals, especially refusals around what would otherwise be called misuse. They did not see any cause for higher level concerns.
What Is A Strong Argument?
Anthropic believes they have made their suggested ‘strong argument’ that their model will not carry out sabotage. I agree with the conclusion, and find in context that the argument does seem strong, but I am uncomfortable with the methodology.
I don’t think this process will lead Anthropic to stop thinking they have a strong argument at the same point I stop thinking they have a strong argument.
If extended to policy requirements for other labs this seems hopeless. If xAI or Meta published a legally required ‘strong argument’ do you expect to find it strong in non-trivial cases? I don’t.
For OpenAI or Google there’s some chance, but if deciding the argument is strong is a blocker for release, then I don’t expect that to block many releases. As in, in the cases where a lab halts for this reason, they would have likely halted anyway to avoid the actual underlying problem.
The argument for requiring the arguments seems to be something more like ‘as we approach maximum danger the arguments made will sound increasingly stupid, and people will push back pointing out that they are stupid, including to force greater regulation.’ I don’t think that works when you need it.
Recursive Self-Improvement
Section two deals with risks from automated R&D.
As I’ve noted in the Opus 4.5 and 4.6 model card assessments, the models are passing the tests and then the actual current assessments are based on vibes.
If all you have are the vibes, then you go with the vibes, but that is not the regime we were promised or that Anthropic claimed to have intended. It’s a fallback, it’s at best a graceful failure. I agree that on this front we are not ‘there’ yet, but Anthropic is clear that we could be there relatively soon.
I find the ‘you said this was like nuclear weapons so we get to murder your company’ arguments against Anthropic rather abhorrent, but ‘you say you are close to automated R&D so you need to act like you are close to automated R&D’ seems fair.
We badly need a robust discussion of what things Anthropic should actually be doing, in order to meet that bar given the situation in practice.
They don’t say much here beyond noting they are attempting to increase security, so there isn’t much else to say back. This is the big thing headed our way and the plan seems largely to be to wing it.
I am not a fan.
Non-Novel Chemical and Biological Weapons
Opus 4.6 has a lot of knowledge. This is the primary thing ASL-3 protections have been for so far, in practice, and why we have those funky classifiers. Jailbreaks are still possible, but they’ve done a lot to make them harder, or to catch those trying to use them for such purposes.
For a long time we’ve worried about this threat vector, it’s clear there is substantial uplift available, and it seems like there are a lot of bad actors out there who would want to do this. But not only have we not seen an attack, we haven’t to my knowledge seen a serious failed attempt either. Diffusion is a bottleneck, people don’t to do things, and this applies to the bad stuff as well as the good stuff.
I strongly agree with potential pandemics being top priority. If something happens that is contained, that’s a tragedy, but orders of magnitude less dangerous.
They go over how their ongoing ASL-2 (for Sonnet 4) and ASL-3 (for Opus 4 or higher, and Sonnet 4.5 or higher) mitigations work: Real-time classifiers, offline monitoring, access controls, rapid response, bug bounty programs, threat intelligence, securing model weights. They note that some exceptions were made for Claude Gov and some individuals, and made some tweaks, but the basics are the same.
For now, I buy the ‘strong argument’ that the combination of needing to jailbreak, Anthropic being able to monitor, and there not being that many actors who actually want to cause such harms, combined with those actors being mostly unsophisticated in such areas, makes the risk low. It’s been strong defense in practice, and this is one area where it seems right not to discuss details of the defense-in-depth.
Similar to not having seen serious attempts to use Claude or other frontier models for biological attacks, we also have not seen reported serious attempts at stealing model weights, which is also mentioned here. I worry that this is effectively security through obscurity, combined with ‘it’s a great trick but we might only be able to do it once and people will get super pissed’ so various actors are holding back. Or they’re actually stealing the weights, but not releasing them or deploying them for general use, for the same ‘we can only do that once’ reasons.
Novel Chemical and Biological Weapons
As Anthropic says, this is a relatively unlikely threat model, but the tail risks are huge.
They review the information from the Opus 4.6 model card, showing that Claude is in many ways performing at above human level on relevant tasks, but they think the models aren’t fully ‘there’ yet for such tasks.
I agree they’re not fully there, but I think they are somewhat there, in that yes they make such tasks substantially easier, as they do most other cognitive tasks. My guess is that so far there have been zero serious attempts at something like this, and as they note that’s an issue if that ever changes suddenly, because you don’t get practice spotting and stopping attempts or learn how close you are to the edge.
Mitigations are the same as for the non-novel approaches.
Cross-Cutting Content (Section 6)
It’s good to see Anthropic laying out these additional dynamics, even if their justifications doth protest too much. Even if your model doesn’t directly cause a problem, you are still responsible for secondary consequences when you make that decision to move forward, even if that’s not how the law works.
6.1 talks about acceleration dynamics. Now that Anthropic is pushing the capabilities frontier and plausibly even in the lead especially with Claude Code, Anthropic is indeed accelerating other labs.
They list three ways: Distillation, intellectual property leaks via observation (or recruitment or listening, I’d add), and demonstration of commercial viability.
Well, yeah, that’s a price you pay when you’re pushing the capabilities frontier.
I think it’s fine to be agnostic about whether acceleration is net good, since acceleration brings benefits, but it seems quite obviously not good for AI risks, or at least that should be the strong default. Instead, they try to present this as a ‘we have three bullet points in favor and three opposed, who is to say if it is good or not.’
They trot out ‘preserve the current lead of democracies’ argument, and the potential hardware argument, and that today’s methods might be better than future methods, and I find all of that entirely unconvincing.
I find Anthropic’s defensiveness about its actions, and reluctance to be open about the downside costs, and even the impression they don’t understand or appreciate those downsides, is one thing making it much harder to trust Anthropic than I would like.
6.2 fully turns this around and argues why Anthropic operating as a frontier AI company is good, actually, from a safety standpoint. They can prioritize beneficial deployments, take otherwise impractical steps towards risk-reduction that then can get copied, and they can inform the world about AI and have a seat at the table.
I don’t think these are the central arguments.
I think the real argument is that you need to have a frontier AI lab in order to be the first one to do it, and to have the resources and opportunity to do the necessary alignment research and work, and to have a seat at the table and bargaining power, and because you think you can handle this responsibly. Which is an argument for going fast relative to others, but not an argument for everyone going fast together.
There are those who are deeply skeptical that Anthropic is up to that task, and believe that Anthropic’s presence is anti-helpful to the cause of everyone not dying.
I don’t agree. I think that Anthropic is net helpful. But I respect those who disagree for that reason, and think Anthropic is insufficiently responsible. I think it’s important to listen to those people, even when they make it difficult. Whereas those who are trying to destroy Anthropic exactly because it tries to bear any responsibility or show any virtue at all? No, those people I do not respect.
Risk Report Report
What should we think overall about the new risk report?
I’m happy to have it. It outlines a lot of Anthropic’s thinking, allowing us to critique that thinking. I especially appreciated the diffs and the ability to see all the info in one place.
For much of it, I don’t think it adds that much to what the model cards already offer us, but as long as it’s in addition to those, that part is all cool.
It also includes new information on risk mitigations for sabotage and non-novel bioweapons, and which threat models and risks are especially concerning, that aren’t available elsewhere. Those parts are more useful.
The contents reinforce what we already knew. Anthropic is willing to be highly curious about its models and its mundane safety, and will explore a lot of questions and raise and address points other labs would ignore. There’s a lot to like.
And yet there is still what is missing, which is a grappling with the tasks ahead, the full nature of the risks and challenges we will face, or any firm commitments of how to handle those tasks or what would be required. We do not have well-defined thresholds. We do not have evaluations that will still work. We do not have a plan that seems ready for the full problem, or a felt sense of that gap being acknowledged.
I worry quite a lot, even more than before, about Anthropic not taking the full alignment problem seriously, not having security mindset, and being on the lookout to convince themselves this is going to be easy. They could turn out to be right, but I alas do not expect them to be right.
At the end of the day, we’re being asked to trust Anthropic to make good choices. To say, these people are the best of the best and they care deeply about all this, and they will figure it out, and we can help them with that but they will make the final decisions, and need maximum flexibility. And you can decide, based on their track record and everything it entails, whether to do that.
Cause hey. You should see the other guys.