I think I probably agree, although I feel somewhat wary about it. My main hesitations are:
All of that said, I do broadly agree with the set of arguments, and I think it’s a really cool activity for people to write up what they believe. I’m glad they did it. But I’m not sure how comfortable I feel about sending it to people who haven’t thought much about AI.
"It's plausible that things could go much faster than this, but as a prediction about what will actually happen, humanity as a whole probably doesn't want things to get incredibly crazy so fast, and so we're likely to see something tamer." I basically agree with that.
I feel confused about how this squares with Dario’s view that AI is "inevitable," and "driven by powerful market forces." Like, if humanity starts producing a technology which makes practically all aspects of life better, the idea is that this will just… stop? I’m sure some people will be scared of how fast it’s going, but it’s hard for me to see the case for the market in aggregate incentivizing less of a technology which fixes ~all problems and creates tremendous value. Maybe the idea, instead, is that governments will step in...? Which seems plausible to me, but as Ryan notes, Dario doesn’t say this.
Thanks, I think you’re right on both points—that the old RSP also didn’t require pre-specified evals, and that the section about Capability Reports just describes the process for non-threshold-triggering eval results—so I’ve retracted those parts of my comment; my apologies for the error. I’m on vacation right now so was trying to read quickly, but I should have checked more closely before commenting.
That said, it does seem to me like the “if/then” relationships in this RSP have been substantially weakened. The previous RSP contained sufficiently much wiggle room that I didn’t interpret it as imposing real constraints on Anthropic’s actions; but it did at least seem to me to be aiming at well-specified “if’s,” i.e., ones which depended on the results of specific evaluations. Like, the previous RSP describes their response policy as: “If an evaluation threshold triggers, we will follow the following procedure” (emphasis mine), where the trigger for autonomous risk happens if “at least 50% of the tasks are passed.”
In other words, the “if’s” in the first RSP seemed more objective to me; the current RSP strikes me as a downgrade in that respect. Now, instead of an evaluation threshold, the “if” is determined by some opaque internal process at Anthropic that the document largely doesn’t describe. I think in practice this is what was happening before—i.e., that the policy basically reduced to Anthropic crudely eyeballing the risk—but it’s still disappointing to me to see this level of subjectivity more actively codified into policy.
My impression is also that this RSP is more Orwellian than the first one, and this is part of what I was trying to gesture at. Not just that their decision process has become more ambiguous and subjective, but that the whole thing seems designed to be glossed over, such that descriptions of risks won’t really load in readers’ minds. This RSP seems much sparser on specifics, and much heavier on doublespeak—e.g., they use the phrase “unable to make the required showing” to mean “might be terribly dangerous.” It also seems to me to describe many things too vaguely to easily argue against. For example, they claim they will “explain why the tests yielded such results,” but my understanding is that this is mostly not possible yet, i.e., that it’s an open scientific question, for most such tests, why their models produce the behavior they do. But without knowing what “tests” they mean, nor the sort of explanations they’re aiming for, it’s hard to argue with; I’m suspicious this is intentional.
In the previous RSP, I had the sense that Anthropic was attempting to draw red lines—points at which, if models passed certain evaluations, Anthropic committed to pause and develop new safeguards. That is, if evaluations triggered, then they would implement safety measures. The “if” was already sketchy in the first RSP, as Anthropic was allowed to “determine whether the evaluation was overly conservative,” i.e., they were allowed to retroactively declare red lines green. Indeed, with such caveats it was difficult for me to see the RSP as much more than a declared intent to act responsibly, rather than a commitment. But the updated RSP seems to be far worse, even, than that: the “if” is no longer dependent on the outcomes of pre-specified evaluations, but on the personal judgment of Dario Amodei and Jared Kaplan.
Indeed, such red lines are now made more implicit and ambiguous. There are no longer predefined evaluations—instead employees design and run them on the fly, and compile the resulting evidence into a Capability Report, which is sent to the CEO for review. A CEO who, to state the obvious, is hugely incentivized to decide to deploy models, since refraining to do so might jeopardize the company.
This seems strictly worse to me. Some room for flexibility is warranted, but this strikes me as almost maximally flexible, in that practically nothing is predefined—not evaluations, nor safeguards, nor responses to evaluations. This update makes the RSP more subjective, qualitative, and ambiguous. And if Anthropic is going to make the RSP weaker, I wish this were noted more as an apology, or along with a promise to rectify this in the future. Especially because after a year, Anthropic presumably has more information about the risk than before. Why, then, is even more flexibility needed now? What would cause Anthropic to make clear commitments?
I also find it unsettling that the ASL-3 risk threshold has been substantially changed, and the reasoning for this is not explained. In the first RSP, a model was categorized as ASL-3 if it was capable of various precursors for autonomous replication. Now, this has been downgraded to a “checkpoint,” a point at which they promise to evaluate the situation more thoroughly, but don’t commit to taking any particular actions:
We replaced our previous autonomous replication and adaption (ARA) threshold with a “checkpoint” for autonomous AI capabilities. Rather than triggering higher safety standards automatically, reaching this checkpoint will prompt additional evaluation of the model’s capabilities and accelerate our preparation of stronger safeguards.
This strikes me as a big change. The ability to self-replicate is already concerning, but the ability to perform AI R&D seems potentially catastrophic, risking loss of control or extinction. Why does Anthropic now think this shouldn’t count as ASL-3? Why have they substituted this criteria with a substantially riskier one instead?
Dario estimates the probability of something going “really quite catastrophically wrong, on the scale of human civilization” as between 10-25%. He also thinks this might happen soon—perhaps between 2025-2027. It seems obvious to me that a policy this ambiguous, this dependent on figuring things out on the fly, this beset with such egregious conflicts of interest, is a radically insufficient means of managing risk from a technology which poses so grave and imminent a threat to our world.
Basically I just agree with what James said. But I think the steelman is something like: you should expect shorter (or no) pauses with an RSP if all goes well, because the precautions are matched to the risks. Like, the labs aim to develop safety measures which keep pace with the dangers introduced by scaling, and if they succeed at that, then they never have to pause. But even if they fail, they're also expecting that building frontier models will help them solve alignment faster. I.e., either way the overall pause time would probably be shorter?
It does seem like in order to not have this complaint about the RSP, though, you need to expect that it's shorter by a lot (like by many months or years). My guess is that the labs do believe this, although not for amazing reasons. Like, the answer which feels most "real" to me is that this complaint doesn't apply to RSPs because the labs aren't actually planning to do a meaningful pause.
Does the category “working/interning on AI alignment/control” include safety roles at labs? I’d be curious to see that statistic separately, i.e., the percentage of MATS scholars who went on to work in any role at labs.
Similarly in Baba is You: when people don't have a crisp understanding of the puzzle, they tend to grasp and straws and motivatedly-reason their way into accepting sketchy sounding premises. But, the true solution to a level often feels very crisp and clear and inevitable.
A few of the scientists I’ve read about have realized their big ideas in moments of insight (e.g., Darwin for natural selection, Einstein for special relativity). My current guess about what’s going on is something like: as you attempt to understand a concept you don’t already have, you’re picking up clues about what the shape of the answer is going to look like (i.e., constraints). Once you have these constraints in place, your mind is searching for something which satisfies all of them (both explicitly and implicitly), and insight is the thing that happens when you find a solution that does.
At least, this is what it feels like for me when I play Baba is You (i.e., when I have the experience you’re describing here). I always know when a fake solution is fake, because it’s really easy to tell that it violates one of the explicit constraints the game has set out (although sometimes in desperation I try it anyway :p). But it’s immediately clear when I've landed on the right solution (even before I execute it), because all of the constraints I’ve been holding in my head get satisfied at once. I think that’s the “clicking” feeling.
Darwin’s insight about natural selection was also shaped by constraints. His time on the Beagle had led him to believe that “species gradually become modified,” but he was pretty puzzled as to how the changes were being introduced. If you imagine a beige lizard that lives in the sand, for instance, it seems pretty clear that it isn’t the lizard itself (its will) which causes its beigeness, nor is it the sand that directly causes the coloring (as in, physically causes it within the lizards lifetime). But then, how are changes introduced, if not by the organism, and not by the environment directly? He was stuck on this for awhile, when: “I can remember the very spot in the road, whilst in my carriage, when to my joy the solution occurred to me.”
There’s more going on to Darwin’s story than that, but I do think it has elements of the sort of thing you're describing here. Jeff Hawkins also describes insight as a constraint satisfaction problem pretty explicitly (I might’ve gotten this idea from him), and he experienced it when coming up with the idea of a thousand brains.
Anyway, I don’t have a strong sense of how crucial this sort of thing is to novel conceptual inquiry in general, but I do think it’s quite interesting. It seems like one of the ways that someone can go from a pre-paradigmatic grasping around for clues sort of thing to a fully formed solution.
I'm somewhat confused about when these evaluations are preformed (i.e., how much safety training the model has undergone). OpenAI's paper says: "Red teamers had access to various snapshots of the model at different stages of training and mitigation maturity starting in early August through mid-September 2024," so it seems like this evaluation was probably performed several times. Were these results obtained only prior to safety training or after? The latter seems more concerning to me, so I'm curious.
You actually want evaluators to have as much skin in the game as other employees so that when they take actions that might shut the company down or notably reduce the value of equity, this is a costly signal.
Further, it's good if evaluators are just considered normal employees and aren't separated out in any specific way. Then, other employees at the company will consider these evaluators to be part of their tribe and will feel some loyalty. (Also reducing the chance that risk evaluators feel like they are part of rival "AI safety" tribe.) This probably has a variety of benefits in terms of support from the company. For example, when evaluators make a decision with is very costly for the company, it is more likely to respected by other employees.
This situation seems backwards to me. Like, presumably the ideal scenario is that a risk evaluator estimates the risk in an objective way, and then the company takes (hopefully predefined) actions based on that estimate. The outcome of this interaction should not depend on social cues like how loyal they seem, or how personally costly it was for them to communicate that information. To the extent it does, I think this is evidence that the risk management framework is broken.
The interpretability safety case section seems to make the rather strong claim that the activation of known sabotage-related features is necessary for sabotage to occur (emphasis mine):
Afaict, the way that they establish this link (that a feature activation is necessary for a behavior to occur) is through ablation studies. Basically, if deactivating a feature prevents a model organism from committing a dangerous behavior, then they conclude that the feature causes the behavior. And although it does seem like this sort of test will generate Bayesian evidence about which features contribute to sabotage-capacity, I’m confused why the author thinks this would generate strong Bayesian evidence, much less identify the necessity conditions for sabotage with ~certainty?
Like, it seems plausible to me that some other unknown combination of features might create sabotage-like behavior, or that some relevant features were missed, or that there were other, ill-described-as-features mechanisms which can cause sabotage, etc. And my guess is that the evidence this sort of testing does generate will become weaker over time—i.e., once our safety precautions start to really matter—since more intelligent systems will presumably be able to cause sabotage in more numerous and creative ways.
Some of these potential setbacks are mentioned, and I get that this is just a preliminary sketch. But still, I’m confused where the confidence is coming from—why does the author expect this process to yield high confidence of safety?