ryan_greenblatt

I'm the chief scientist at Redwood Research.

Comments

Sorted by

Yes. But also, I'm afraid that Anthropic might solve this problem by just making less statements (which seems bad).

Making more statements would also be fine! I wouldn't mind if there were just clarifying statements even if the original statement had some problems.

(To try to reduce the incentive for less statements, I criticized other labs for not having policies at all.)

I think I roughly stand behind my perspective in this dialogue. I feel somewhat more cynical than I did at the time I did this dialogue, perhaps partially due to actual updates from the world and partially because I was trying to argue for the optimistic case here which put me in a somewhat different frame.

Here are some ways my perspective differs now:

  • I wish I said something like: "AI companies probably won't actually pause unilaterally, so the hope for voluntary RSPs has to be building consensus or helping to motivate developing countermeasures". I don't think I would have disagreed with this statement in the past, or at least I wouldn't have fully disagreed with it and it seems like important context.
  • I think in practice, we're unlikely to end up with specific tests that are defined in advance and aren't goodhartable or cheatable. I do think that control could in principle be defined in advance and hard to goodhart using external evaluation, but I don't expect companies to commit to specific tests which are hard to goodhart/cheat. They could make procedural commitments for third party review which are hard to cheat. Something like "this third party will review the available evidence (including our safety report and all applicable internal knowledge) and then make a public statement about the level of risk and whether there is important information which should be disclosed to the public" (I could outline this proposal in more detail, it's mostly not my original idea.)
  • I'm somewhat more interested in companies focusing on things other than safety cases and commitments. Either trying to get evidence of risk that might be convincing to others (in worlds where these risks are large) or working on at-the-margin safety interventions from a cost benefit perspective.

I think the post could directly say "voluntary RSPs seem unlikely to suffice (and wouldn't be pauses done right), but ...".

I agree it does emphasize the importance of regulation pretty strongly.

Part of my perspective is that the title implies a conclusion which isn't quite right and so it would have been good (at least with the benefit of hindsight) to clarify this explicitly. At least to the extent you agree with me.

This post seems mostly reasonable in retrospect, except that it doesn't specifically note that it seems unlikely that voluntary RSP commitments would result in AI companies unilaterally pausing until they were able to achieve broadly reasonable levels of safety. I wish the post more strongly emphasized that regulation was a key part of the picture---my view is that "voluntary RSPs are pauses done right" is wrong, but "RSPs via (international) regulation are pauses done right" seems like it could be roughly right. That said, I do think that purely voluntary RSPs are pretty reasonable and useful, at least if the relevant company is transparent about when they would proceed despite being unable to achieve a reasonable level of safety.

As of now at the start of 2025, I think we know more information that makes this plan looks worse.[1] I don't see a likely path to ensuring 80% of companies have a reasonble RSP in short timelines. (For instance, not even Anthropic has expanded their RSP to include ASL-4 requirements about 1.5 years after the RSP came out.) And, beyond this, I think the current regulatory climate is such that we might not get RSPs enforced in durable regulation[2] applying to at least US companies in short timelines even if 80% of companies had good RSPs.


  1. I edited to add the first sentence of this paragraph for clarity. ↩︎

  2. The EU AI act is the closest thing at the moment, but it might not be very durable as the EU doesn't have that much leverage over tech companies. Also, it wouldn't be very surprising if components of this end up being very unreasonable such that companies are basically forced to ignore parts of it or exit the EU market. ↩︎

Anthropic releasing their RSP was an important change in the AI safety landscape. The RSP was likely a substantial catalyst for policies like RSPs—which contain if-then commitments and more generally describe safety procedures—becoming more prominent. In particular, OpenAI now has a beta Preparedness Framework, Google DeepMind has a Frontier Safety Framework but there aren't any concrete publicly-known policies yet, many companies agreed to the Seoul commitments which require making a similar policy, and SB-1047 required safety and security protocols.

However, I think the way Anthropic presented their RSP was misleading in practice (at least misleading to the AI safety community) in that it neither strictly requires pausing nor do I expect Anthropic to pause until they have sufficient safeguards in practice. I discuss why I think pausing until sufficient safeguards are in place is unlikely, at least in timelines as short as Dario's (Dario Amodei is the CEO of Anthropic), in my earlier post.

I also have serious doubts about whether the LTBT will serve as a meaningful check to ensure Anthropic serves the interests of the public. The LTBT has seemingly done very little thus far—appointing only 1 board member despite being able to appoint 3/5 of the board members (a majority) and the LTBT is down to only 3 members. And none of its members have technical expertise related to AI. (The LTBT trustees seem altruistically motivated and seem like they would be thoughtful about questions about how to widely distribute benefits of AI, but this is different from being able to evaluate whether Anthropic is making good decisions with respect to AI safety.)

Additionally, in this article, Anthropic's general counsel Brian Israel seemingly claims that the board probably couldn't fire the CEO (currently Dario) if the board did this despite believing it would greatly reduce profits to shareholders[1]. Almost all of a board's hard power comes from being able to fire the CEO, so if this claim were to be true, that would greatly undermine the ability of the board (and the LTBT which appoints the board) to ensure Anthropic, a public benefit corporation, serves the interests of the public in cases where this conflicts with shareholder interests. In practice, I think this claim by the general counsel of Anthropic is likely false and, because Anthropic is a public benefit corporation, the board could fire the CEO and win in court even if they openly thought this would massively reduce shareholder value (so long as the board could show they used a reasonable process to consider shareholder interests and decided that the public interest outweighed in this case). Regardless, Brian Israel making such claims is evidence the LTBT won't provide a meaningful check on Anthropic in practice.

Misleading communication about the RSP

On the RSP, this post says:

On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures.

While I think this exact statement might be technically true, people have sometimes interpreted this quote and similar statements as a claim that Anthropic would pause until their safety measures sufficed for more powerful models. I think Anthropic isn't likely to do this; in particular:

  • The RSP leaves open the option of revising it to reduce required countermeasures (so pausing is only required until the policy is changed).
  • This implies countermeasures would suffice for ensuring a reasonable level of safety, but given that commitments still haven't been made for ASL-4 (the level at which existential or near-existential risks become plausible) and there aren't clear procedural reasons to expect countermeasures to suffice to ensure a reasonable level of safety, I don't think we should assume this will be the case.
  • Protections for ASL-3 are defined vaguely rather than having some sort of credible and independent risk analysis process (in addition to best guess countermeasures) and a requirement to ensure risk is sufficiently low with respect to this process. Perhaps ASL-4 requirements will differ; something more procedural seems particularly plausible as I don't see a route to outlining specific tests in advance for ASL-4.
  • As mentioned earlier, I expect that if Anthropic ends up being able to build transformatively capable AI systems (as in, AI systems capable of obsoleting all human cognitive labor), they'll fail to provide assurance of a reasonable level of safety. That said, it's worth noting that insofar as Anthropic is actually a more responsible actor (as I currently tentatively think is the case), then from my perspective this choice is probably overall good—though I wish their communication was less misleading.

Anthropic and Anthropic employees often use similar language to this quote when describing the RSP, potentially contributing to a poor sense of what will happen. My impression is that lots of Anthropic employees just haven't thought about this, and believe that Anthropic will behave much more cautiously than I think is plausible (and more cautiously than I think is prudent given other actors).

Other companies have worse policies and governance

While I focus on Anthropic in this comment, it is worth emphasizing that the policies and governance of other AI companies seem substantially worse. xAI, Meta and DeepSeek have no public safety policies at all, though they have said they will make a policy like this. Google DeepMind has published that they are working on making a frontier safety framework with commitments, but thus far they have just listed potential threat models corresponding to model capabilities and security levels without committing to security for specific capability levels. OpenAI has the beta preparedness framework, but the current security requirements seem inadequate and the required mitigations and assessment process for this is unspecified other than saying that the post-mitigation risk must be medium or below prior to deployment and high or below prior to continued development. I don't expect OpenAI to keep the spirit of this commitment in short timelines. OpenAI, Google DeepMind, xAI, Meta, and DeepSeek all have clearly much worse governance than Anthropic.

What could Anthropic do to address my concerns?

Given these concerns about the RSP and the LTBT, what do I think should happen? First, I'll outline some lower cost measures that seem relatively robust and then I'll outline more expensive measures that don't seem obviously good (at least not obviously good to strongly prioritize) but would be needed to make the situation no longer be problematic.

Lower cost measures:

  • Have the leadership clarify its views to Anthropic employees (at least alignment science employees) in terms of questions like: "How likely is Anthropic to achieve an absolutely low (e.g., 0.25%) lifetime level of risk (according to various third-party safety experts) if AIs that obsolete top human experts are created in the next 4 years?", "Will Anthropic aim to have an RSP that would be the policy that a responsible developer would follow in a world with reasonable international safety practices?", "How likely is Anthropic to exit from its RSP commitments if this is needed to be a competitive frontier model developer?".
  • Clearly communicate to Anthropic employees (or at least a relevant subset of Anthropic employees) about in what circumstances the board could (and should) fire the CEO due to safety/public interest concerns. Additionally, explain the leadership's policies with respect to cases where the board does fire the CEO—does the leadership of Anthropic commit to not fighting such an action?
  • Have an employee liaison to the LTBT who provides the LTBT with more information that isn't filtered through the CEO or current board members. Ensure this employee is quite independent-minded, has expertise on AI safety (and ideally security), and ideally is employed by the LTBT rather than Anthropic.

Unfortunately, these measures aren't straightforwardly independently verifiable based on public knowledge. As far as I know, some of these measures could already be in place.

More expensive measures:

  • In the above list, I explain various types of information that should be communicated to employees. Ensure that this information is communicated publicly including in relevant places like in the RSP.
  • Ensure the LTBT has 2 additional members with technical expertise in AI safety or minimally in security.
  • Ensure the LTBT appoints the board members it can currently appoint and that these board members are independent from the company and have their own well-formed views on AI safety.
  • Ensure the LTBT has an independent staff including technical safety experts, security experts, and independent lawyers.

Likely objections and my responses

Here are some relevant objections to my points and my responses:

  • Objection: "Sure, but from the perspective of most people, AI is unlikely to be existentially risky soon, so from this perspective it isn't that misleading to think of deviating from safe practices as an edge case." Response: To the extent Anthropic has views, Anthropic has the view that existentially risky AI is reasonably likely to be soon and Dario espouses this view. Further, I think this could be clarified in the text: the RSP could note that these commitments are what a responsible developer would do if we were in a world where being a responsible developer was possible while still being competitive (perhaps due to all relevant companies adopting such policies or due to regulation).
  • Objection: "Sure, but if other companies followed a similar policy then the RSP commitments would hold in a relatively straightforward way. It's hardly Anthropic's fault if other companies force it to be more reckless than it would like." Response: This may be true, but it doesn't mean that Anthropic isn't being potentially misleading in their description of the situation. They could instead directly describe the situation in less misleading ways.
  • Objection: "Sure, but obviously Anthropic can't accurately represent the situation publicly. That would result in bad PR and substantially undermine their business in other ways. To the extent you think Anthropic is a good actor, you shouldn't be pressuring good actors like them to take actions that will make them differentially less competitive than worse actors." Response: This is pretty fair, but I still think Anthropic could at least avoid making substantially misleading statements and ensure employees are well informed (at least for employees for whom this information is very relevant to their job decision-making). I think it is a good policy to correct misleading statements that result in differentially positive impressions and result in the safety community taking worse actions, because not having such a policy in general would result in more exploitation of the safety community.

  1. The article says: "However, even the board members who are selected by the LTBT owe fiduciary obligations to Anthropic's stockholders, Israel says. This nuance means that the board members appointed by the LTBT could probably not pull off an action as drastic as the one taken by OpenAI's board members last November. It's one of the reasons Israel was so confidently able to say, when asked last Thanksgiving, that what happened at OpenAI could never happen at Anthropic. But it also means that the LTBT ultimately has a limited influence on the company: while it will eventually have the power to select and remove a majority of board members, those members will in practice face similar incentives to the rest of the board." This indicates that the board couldn't fire the CEO if they thought this would greatly reduce profits to shareholders though it is somewhat unclear. ↩︎

I think this is very different from RSPs: RSPs are more like "if everyone is racing ahead (and so we feel we must also race), there is some point where we'll still chose to unilaterally stop racing"

In practice, I don't think any currently existing RSP-like policy will result in a company doing this as I discuss here.

DeepSeek's success isn't much of an update on a smaller US-China gap in short timelines because security was already a limiting factor

Some people seem to have updated towards a narrower US-China gap around the time of transformative AI if transformative AI is soon, due to recent releases from DeepSeek. However, since I expect frontier AI companies in the US will have inadequate security in short timelines and China will likely steal their models and algorithmic secrets, I don't consider the current success of China's domestic AI industry to be that much of an update. Furthermore, if DeepSeek or other Chinese companies were in the lead and didn't open-source their models, I expect the US would steal their models and algorithmic secrets. Consequently, I expect these actors to be roughly equal in short timelines, except in their available compute and potentially in how effectively they can utilize AI systems.

I do think that the Chinese AI industry looking more competitive makes security look somewhat less appealing (and especially less politically viable) and makes it look like their adaptation time to stolen models and/or algorithmic secrets will be shorter. Marginal improvements in security still seem important, and ensuring high levels of security prior to at least ASI (and ideally earlier!) is still very important.

Using the breakdown of capabilities I outlined in this prior post, the rough picture I expect is something like:

  • AIs that can 10x accelerate AI R&D labor: Security is quite weak (perhaps <=SL3 as defined in "Securing Model Weights"), so the model is easily stolen if relevant actors want to steal it. Relevant actors are somewhat likely to know AI is a big enough deal that stealing it makes sense, but AI is not necessarily considered a top priority.
  • Top-Expert-Dominating AI: Security is somewhat improved (perhaps <=SL4), but still pretty doable to steal. Relevant actors are more aware, and the model probably gets stolen.
  • Very superhuman AI: I expect security to be improved by this point (partially via AIs working on security), but effort on stealing the model could also plausibly be unprecedentedly high. I currently expect security implemented before this point to suffice to prevent the model from being stolen.

Given this, I expect that key early models will be stolen, including models that can fully substitute for human experts, and so the important differences between actors will mostly be driven by compute, adaptation time, and utilization. Of these, compute seems most important, particularly given that adaptation and utilization time can be accelerated by the AIs themselves.

This analysis suggests that export controls are particularly important, but they would need to apply to hardware used for inference rather than just attempting to prevent large training runs through memory bandwidth limitations or similar restrictions.

Seems very sensitive to the type of misalignment right? As an extreme example suppose literally all AIs have long run and totally inhuman preferences with linear returns. Such AIs might instrumentally decide to be as useful as possible (at least in domains other than safety research) for a while prior to a treacherous turn.

  1. Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.

This requires a capability, but also requires a propensity. For example, smart humans are all capable of avoiding doing armed robbery with pretty high reliability, but some of them do armed robbery despite being told not to do armed robbery at a earlier point in their life. You could say these robbers didn't have the capability to follow instructions, but this would be an atypical use of these (admittedly fuzzy) words.

FWIW, I think recusive self-improvment via just software (software only singularity) is reasonably likely to be feasible (perhaps 55%), but this alone doesn't suffice for takeoff being arbitrary fast.

Further, even objectively very fast takeoff (von Neumann to superintelligence in 6 months) can be enough time to win a war etc.

Load More