I'm the chief scientist at Redwood Research.
I think I roughly stand behind my perspective in this dialogue. I feel somewhat more cynical than I did at the time I did this dialogue, perhaps partially due to actual updates from the world and partially because I was trying to argue for the optimistic case here which put me in a somewhat different frame.
Here are some ways my perspective differs now:
I think the post could directly say "voluntary RSPs seem unlikely to suffice (and wouldn't be pauses done right), but ...".
I agree it does emphasize the importance of regulation pretty strongly.
Part of my perspective is that the title implies a conclusion which isn't quite right and so it would have been good (at least with the benefit of hindsight) to clarify this explicitly. At least to the extent you agree with me.
This post seems mostly reasonable in retrospect, except that it doesn't specifically note that it seems unlikely that voluntary RSP commitments would result in AI companies unilaterally pausing until they were able to achieve broadly reasonable levels of safety. I wish the post more strongly emphasized that regulation was a key part of the picture---my view is that "voluntary RSPs are pauses done right" is wrong, but "RSPs via (international) regulation are pauses done right" seems like it could be roughly right. That said, I do think that purely voluntary RSPs are pretty reasonable and useful, at least if the relevant company is transparent about when they would proceed despite being unable to achieve a reasonable level of safety.
As of now at the start of 2025, I think we know more information that makes this plan looks worse.[1] I don't see a likely path to ensuring 80% of companies have a reasonble RSP in short timelines. (For instance, not even Anthropic has expanded their RSP to include ASL-4 requirements about 1.5 years after the RSP came out.) And, beyond this, I think the current regulatory climate is such that we might not get RSPs enforced in durable regulation[2] applying to at least US companies in short timelines even if 80% of companies had good RSPs.
I edited to add the first sentence of this paragraph for clarity. ↩︎
The EU AI act is the closest thing at the moment, but it might not be very durable as the EU doesn't have that much leverage over tech companies. Also, it wouldn't be very surprising if components of this end up being very unreasonable such that companies are basically forced to ignore parts of it or exit the EU market. ↩︎
Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.
However, I think the way Anthropic presented their RSP was misleading in practice (at least misleading to the AI safety community) in that it neither strictly requires pausing nor do I expect Anthropic to pause until they have sufficient safeguards in practice. I discuss why I think pausing until sufficient safeguards are in place is unlikely, at least in timelines as short as Dario's (Dario Amodei is the CEO of Anthropic), in my earlier post.
I also have serious doubts about whether the LTBT will serve as a meaningful check to ensure Anthropic serves the interests of the public. The LTBT has seemingly done very little thus far—appointing only 1 board member despite being able to appoint 3/5 of the board members (a majority) and the LTBT is down to only 3 members. And none of its members have technical expertise related to AI. (The LTBT trustees seem altruistically motivated and seem like they would be thoughtful about questions about how to widely distribute benefits of AI, but this is different from being able to evaluate whether Anthropic is making good decisions with respect to AI safety.)
Additionally, in this article, Anthropic's general counsel Brian Israel seemingly claims that the board probably couldn't fire the CEO (currently Dario) if the board did this despite believing it would greatly reduce profits to shareholders[1]. Almost all of a board's hard power comes from being able to fire the CEO, so if this claim were to be true, that would greatly undermine the ability of the board (and the LTBT which appoints the board) to ensure Anthropic, a public benefit corporation, serves the interests of the public in cases where this conflicts with shareholder interests. In practice, I think this claim by the general counsel of Anthropic is likely false and, because Anthropic is a public benefit corporation, the board could fire the CEO and win in court even if they openly thought this would massively reduce shareholder value (so long as the board could show they used a reasonable process to consider shareholder interests and decided that the public interest outweighed in this case). Regardless, Brian Israel making such claims is evidence the LTBT won't provide a meaningful check on Anthropic in practice.
On the RSP, this post says:
On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures.
While I think this exact statement might be technically true, people have sometimes interpreted this quote and similar statements as a claim that Anthropic would pause until their safety measures sufficed for more powerful models. I think Anthropic isn't likely to do this; in particular:
Anthropic and Anthropic employees often use similar language to this quote when describing the RSP, potentially contributing to a poor sense of what will happen. My impression is that lots of Anthropic employees just haven't thought about this, and believe that Anthropic will behave much more cautiously than I think is plausible (and more cautiously than I think is prudent given other actors).
While I focus on Anthropic in this comment, it is worth emphasizing that the policies and governance of other AI companies seem substantially worse. xAI, Meta and DeepSeek have no public safety policies at all, though they have said they will make a policy like this. Google DeepMind has published that they are working on making a frontier safety framework with commitments, but thus far they have just listed potential threat models corresponding to model capabilities and security levels without committing to security for specific capability levels. OpenAI has the beta preparedness framework, but the current security requirements seem inadequate and the required mitigations and assessment process for this is unspecified other than saying that the post-mitigation risk must be medium or below prior to deployment and high or below prior to continued development. I don't expect OpenAI to keep the spirit of this commitment in short timelines. OpenAI, Google DeepMind, xAI, Meta, and DeepSeek all have clearly much worse governance than Anthropic.
Given these concerns about the RSP and the LTBT, what do I think should happen? First, I'll outline some lower cost measures that seem relatively robust and then I'll outline more expensive measures that don't seem obviously good (at least not obviously good to strongly prioritize) but would be needed to make the situation no longer be problematic.
Lower cost measures:
Unfortunately, these measures aren't straightforwardly independently verifiable based on public knowledge. As far as I know, some of these measures could already be in place.
More expensive measures:
Here are some relevant objections to my points and my responses:
The article says: "However, even the board members who are selected by the LTBT owe fiduciary obligations to Anthropic's stockholders, Israel says. This nuance means that the board members appointed by the LTBT could probably not pull off an action as drastic as the one taken by OpenAI's board members last November. It's one of the reasons Israel was so confidently able to say, when asked last Thanksgiving, that what happened at OpenAI could never happen at Anthropic. But it also means that the LTBT ultimately has a limited influence on the company: while it will eventually have the power to select and remove a majority of board members, those members will in practice face similar incentives to the rest of the board." This indicates that the board couldn't fire the CEO if they thought this would greatly reduce profits to shareholders though it is somewhat unclear. ↩︎
I think this is very different from RSPs: RSPs are more like "if everyone is racing ahead (and so we feel we must also race), there is some point where we'll still chose to unilaterally stop racing"
In practice, I don't think any currently existing RSP-like policy will result in a company doing this as I discuss here.
Some people seem to have updated towards a narrower US-China gap around the time of transformative AI if transformative AI is soon, due to recent releases from DeepSeek. However, since I expect frontier AI companies in the US will have inadequate security in short timelines and China will likely steal their models and algorithmic secrets, I don't consider the current success of China's domestic AI industry to be that much of an update. Furthermore, if DeepSeek or other Chinese companies were in the lead and didn't open-source their models, I expect the US would steal their models and algorithmic secrets. Consequently, I expect these actors to be roughly equal in short timelines, except in their available compute and potentially in how effectively they can utilize AI systems.
I do think that the Chinese AI industry looking more competitive makes security look somewhat less appealing (and especially less politically viable) and makes it look like their adaptation time to stolen models and/or algorithmic secrets will be shorter. Marginal improvements in security still seem important, and ensuring high levels of security prior to at least ASI (and ideally earlier!) is still very important.
Using the breakdown of capabilities I outlined in this prior post, the rough picture I expect is something like:
Given this, I expect that key early models will be stolen, including models that can fully substitute for human experts, and so the important differences between actors will mostly be driven by compute, adaptation time, and utilization. Of these, compute seems most important, particularly given that adaptation and utilization time can be accelerated by the AIs themselves.
This analysis suggests that export controls are particularly important, but they would need to apply to hardware used for inference rather than just attempting to prevent large training runs through memory bandwidth limitations or similar restrictions.
Seems very sensitive to the type of misalignment right? As an extreme example suppose literally all AIs have long run and totally inhuman preferences with linear returns. Such AIs might instrumentally decide to be as useful as possible (at least in domains other than safety research) for a while prior to a treacherous turn.
- Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.
This requires a capability, but also requires a propensity. For example, smart humans are all capable of avoiding doing armed robbery with pretty high reliability, but some of them do armed robbery despite being told not to do armed robbery at a earlier point in their life. You could say these robbers didn't have the capability to follow instructions, but this would be an atypical use of these (admittedly fuzzy) words.
FWIW, I think recusive self-improvment via just software (software only singularity) is reasonably likely to be feasible (perhaps 55%), but this alone doesn't suffice for takeoff being arbitrary fast.
Further, even objectively very fast takeoff (von Neumann to superintelligence in 6 months) can be enough time to win a war etc.
Making more statements would also be fine! I wouldn't mind if there were just clarifying statements even if the original statement had some problems.
(To try to reduce the incentive for less statements, I criticized other labs for not having policies at all.)