Imo the a natural reading of "public deployment" in the Anthropic RSP is about whether a member of the general public can access the model, which isn't the case currently.
I agree that the literal text of Section 1.2.2 weakly implies that this is considered a public deployment, but it seems way more likely that there just wasn't that much thought put into that one particular paragraph, e.g. maybe it was copied from an earlier system card and nobody noticed the implication, maybe it was drafted earlier when they planned to release to the public and it didn't get changed when the plans changed, etc.
Anthropic really needs to cut it out with the pattern of
At this point I have very little patience for any claim that people misunderstood Anthropic's commitment and actually it was weaker than that. If Anthropic doesn't want people holding them to commitments they didn't intend to make, then they need to stop making ambiguous commitments.
*I realize you [Rohin] don't work at Anthropic, my point is that I don't think Anthropic is allowed to use the defense that you're suggesting they could use.
I think the strongest argument here is that Anthropic themselves refer to the section of the RSP that says they have to do a risk report when they "publicly deploy" a model, when they talk about why they are releasing the current risk report:
And if we release a model that is “significantly more capable” than those discussed in the prior Risk Report, we must “publish a discussion (in our System Card or elsewhere) of how that model’s capabilities and propensities affect or change analysis in the Risk Report.”
"significantly more capable" is a quote from this paragraph:
When we publicly deploy a model that we determine is significantly more capable than any of the models covered in the most recent Risk Report, we will publish a discussion (in our System Card or elsewhere) of how that model’s capabilities and propensities affect or change analysis in the Risk Report.
It's not like a perfectly airtight case, but it seems to me that Anthropic is saying in the first paragraph that they are considering the Mythos release to be the kind of thing that would trigger the second paragraph, which would be a "public deployment".
I agree the common-sense reading of "public deployment" could reasonably not apply to the present situation (though it's IMO a bit of a stretch), but I think given these paragraphs, it seems like Anthropic themselves think it met the relevant threshold.
I agree it's a reasonable reading, but I'm not sure I'm sold on "natural", since in that case there's no category defined for the situation Anthropic finds themselves in with Mythos.
It seems a little awkard for Anthropic to have deployed the model to a small number of external parties and for the RSP to have nothing to say about that, while it does have things to say about "internal only" and "generally available" deployments. (Though I do think this kind of categorization should be driven by risk modeling, not vice-versa.)
Separately, I think both the text of Section 1.2.2 and the entire Alignment Risk Update being conducted suggest that Anthropic was treating something about this release the way they'd treat a public deployment, but as you say it's possible they were planning on a general release but changed their minds, and so happened to have done all the necessary work already.
I and some other people noticed a potential discrepancy in Anthropic's announcement of Claude Mythos. The version of the RSP that was operative over the relevant period of time (3.0) included a section (3.1) that suggested some internal deployments would require Anthropic to publish a discussion of that model's effect on the analysis in their previously-published Risk Reports within 30 days.
A separate issue that Claude Opus noticed while I was writing this post is that Anthropic's earlier release to "a small set of external customers via a limited research access program" might count as a public deployment, which would trigger the same publishing requirement immediately. I will argue this one first, since I think the case here is stronger.
Anthropic only published their Alignment Risk Update on April 7th, when they publicly announced Claude Mythos.
Did Anthropic mess up?
tl;dr: they probably messed up on the public deployment thing, and it's unclear whether they messed up on the 30-day internal deployment thing. My guess is that Anthropic would argue they're in the clear on the 30-day one, but this depends on some interpretations that are at least slightly favorable to them. I don't know how they'd argue the public deployment one. Relatedly, the RSP has some gaps and ambiguities that should probably be fixed. In some sense this is all nitpicking over details that I don't think matter much for x-risk, except to the extent that it reveals an organizational inability to create and follow a checklist. That is a bad skill to be lacking.
Requirement to publish "discussion" when publicly deployed
Anthropic's RSP requires that they publish "discussion" of how their risk analysis would change, when they publicly deploy a model that's significantly more capable than their previous models.
Anthropic probably believes the public deployment condition was met when they shared access to Mythos with the "40 additional organizations" they describe in their announcement of Project Glasswing. At a minimum they think this condition was met by the time of their public announcement, based on the "RSP decision-making" section in their system card.
However, they gave their "launch partners" access to Mythos weeks earlier. The potential blast radius of their launch partners is approximately "the whole internet". Anthropic does not define "public release", but I do not think there is a principled definition under which a release to a limited set of 40 additional organizations counts, but their earlier release to their "launch partners" does not.
I think this is a violation under most plausible readings of the source text and surrounding circumstances. Peek inside if you want all the gory details.
More detailed analysis with reference to source text
Here is the full text of the RSP's section 3.1:
Here, we care about the first Timing condition.
Anthropic considers themselves to have publicly deployed the model by the time of the announcement on April 7th. This is implied by section 1.2.2, "RSP decision-making" in their system card:
Two sources of evidence suggest that Anthropic's launch partners received access to Mythos well before April 7th. The first is the Alignment Risk Update for Claude Mythos, which says:
The second is the Project Glasswing announcement, which also includes testimonials from some of their launch partners (bolding mine):
Now, the question is whether this counts as a "public deployment". The strongest argument I can imagine Anthropic making is that the "40 additional organizations" received access on April 7th, and that is what they're counting as a public deployment[1]. This seems like an extremely unprincipled distinction to me, if so. Anthropic clearly considers some deployments narrower than "generally available" to be "public deployments". The most natural boundary I can imagine is between "internal deployment" - purely within Anthropic - and "external deployment" - anything outside of Anthropic. If there are external deployments which nonetheless don't count as public deployments, this seems like a choice more motivated by economic and organizational realpolitik rather than a principled judgment downstream of specific risk considerations. And in this case, their list of launch partners seem pretty close to the worst possible set of targets to release a potentially misaligned model to, if that sort of thing matters under your threat model[2].
Maybe there's a principled argument for why the early release to their launch partners didn't count, but the subsequent[3] release to additional organizations did. But right now this seems like a failure to me.
Requirement to publish "discussion" within 30 days of a qualified internal deployment
Anthropic's RSP also requires that they publish the same kind of "discussion" within 30 days of an internal deployment of a model they judge poses "significant risks above and beyond those posed by our public models", for four categories of risk outlined in their RSP.
Anthropic internally deployed Mythos on February 24th. Their Alignment Risk Update says that "the overall risk is very low, but higher than for previous models" for one of the four risk categories. They do not anywhere spell out in plain language whether they believe Mythos fulfills the criteria described above.
Opus thinks it's a violation. I think it's iffy. There's a really huge amount of ambiguity in much of the relevant wording. Peek inside if you want all the gory details.
More detailed analysis with reference to source text
Mythos was internally deployed on February 24th, so the earliest "deadline" that Anthropic could have had to publish that discussion was March 25th[4], if they "determined" that they internally deployed an in-scope model on February 24th. From this we can infer that they did not make that determination between February 24th and March 8th (which is the last day that such a determination would have required publication of a discussion before April 7th, when they published the discussion as required by the first condition).
Immediately we see a problem: the RSP doesn't say anything about when such a determination must be made, relative to the internal deployment. I, personally, hope that this determination needs to be made before the internal deployment, especially given the "High-stakes sabotage opportunities" category of capability thresholds as a source of increased risk. If that's not the case, there needs to be an explicit deadline, else it turns into a get-out-of-jail-free card.
Let's put that aside for now and check whether Mythos is, in fact, "in-scope" at all.
Mythos is not available to the general public, so we need to rely on Anthropic's own evaluations here. Fortunately, section 1.2.2 of the system card contains a summary of whether the model poses increased risks downstream of capabilities within the 4 categories of capability thresholds described above. (More detail is included in Section 2, "RSP Evaluations".) Anthropic claims:
Let me quote the relevant part of Section 2.1.3.1 (bolding mine):
Does this count as posing "significant risks above and beyond those posed by our public models"? I dunno, man, I have no idea what they meant by "significant". Let's ask Claude Opus 4.6:
Claude Opus 4.6
The short answer: I think there's a genuinely strong case that Mythos Preview should have been treated as in-scope, and that Anthropic's position relies on an interpretation of "significant" that's in tension with their own stated findings.
The reasoning:
The critical chain is: RSP §3.1 defines in-scope internal models as those posing "significant risks above and beyond those posed by our public models," with footnote 4 narrowing this to risks from the four capability thresholds in Section 1. The "High-stakes sabotage opportunities" threshold is the relevant one here.
Now look at what Anthropic's own documents say about Mythos Preview on this exact dimension:
The tension is this: Anthropic's alignment risk update makes a detailed case that Mythos Preview poses categorically new alignment-related risks relative to their public models — risks that fall squarely within the "High-stakes sabotage opportunities" threshold. The document's entire reason for existing is that Mythos Preview's risk profile is materially different from previous models on this dimension. It's hard to read the alignment risk update and conclude that the risks it describes are not "above and beyond those posed by our public models."
Anthropic's apparent defense would be that "significant" means something like "non-trivially large in absolute terms" rather than "meaningfully elevated relative to prior models." Under this reading, a risk that is "very low, but higher" wouldn't qualify as "significant." But this reading has problems:
Where I'm less certain:
The footnote 4 narrowing — "risks arising from the capability thresholds" — could be read to mean "risks that arise because the model is approaching or crossing those thresholds," rather than "any risk that falls under those categories." Under that stricter reading, since Anthropic determined Mythos Preview doesn't cross any of the thresholds, maybe it wouldn't qualify. But this reading would make the in-scope provision nearly vacuous for internal models, since crossing a threshold would trigger much stronger commitments anyway.
I also want to flag a structural issue your post identifies that I think is underappreciated: the RSP is silent on when the determination of in-scope status must be made relative to internal deployment. This is a significant gap. Given that the "High-stakes sabotage opportunities" threshold is specifically about risks from internal use of AI systems, it seems like the determination should precede or coincide with deployment, not follow it at Anthropic's discretion. The fact that Anthropic admits they didn't understand the risk level when they first deployed Mythos internally makes this gap more concerning, not less.
Bottom line: I think the most natural reading of the RSP, combined with Anthropic's own characterization of Mythos Preview's risk profile, suggests it should have been treated as in-scope. The counterarguments exist but require reading "significant" in a way that undermines the purpose of the provision. Whether this constitutes a "violation" depends on how much interpretive discretion you think the RSP grants Anthropic — which, as you note, is itself a problem.
I think I disagree with Opus' list of problems with its simulated Anthropic's reading of the trigger condition:
I also disagree with the strength of its conclusion.
The bit about Autonomy threat model 1 seems like the most compelling section of the argument, to me. At a minimum, it seems unclear if this does or does not fulfill the requirement.
List of RSP Issues
Here is a list of problems with the way the RSP is written, which were revealed by this release and system card:
Does any of this matter?
I mean, maybe. Like I said at the beginning, I think zooming in on the details of the RSP is in some sense missing the point - trying to take the RSP as object is mostly making a mistake that's already been made a few times before. Even if we were still on RSP v1 my view would probably be something like, "this is based on a bunch of assumptions I don't agree with that dramatically curtail any positive effect it might have".
But I do think there are some tiny slivers of hope contained in worlds where Anthropic is an organization that can more reliably follow its own checklists, which might at some point include halting, melting, and catching fire. And some people have hopes routing through worlds that depend on rapidly bootstrapping up a chain of progressively more capable automated alignment researchers with minimal human supervision. To the extent that the remaining human supervision is doing any work, it will be operating under similar (or worse) time pressure, but trying to make much harder judgment calls than "does this count as a public deployment". I think getting the easy stuff right is in fact predictive of getting the hard stuff right.
If those 40 additional organizations received access before April 7th, this seems like a pretty straightforward failure to comply with that bullet point in the RSP, given that Anthropic considers something about their current deployment to count as a public deployment.
It mostly doesn't on mine. I'd guess it matters at least slightly more under "Anthropic's" threat model, to the extent that Anthropic as an organization can be said to have a coherent threat model. (Decision-making-power weighted average over founder and employee threat models?)
Which, I remind you, we're not even sure is the case - we don't know when the additional organizations received access.
If you count February 24th as one of the days, which I think must be the case.
"Specifically, risks arising from the capability thresholds in our recommendations for industry-wide safety (see Section 1)."
The phrase in RSP v3.1 (published April 2nd) is "significant risks beyond those posed by models that are covered by a prior Risk Report". The content of the footnote after "risks" is the same. The phrasing in v3.1 might have excluded Mythos since there was no Risk Report to compare anything to, but that doesn't matter since v3.1 was published after the 30-day deadline would have passed.