Review

Views are my own, not Open Philanthropy’s. I am married to the President of Anthropic and have a financial interest in both Anthropic and OpenAI via my spouse.

Over the last few months, I’ve spent a lot of my time trying to help out with efforts to get responsible scaling policies adopted. In that context, a number of people have said it would be helpful for me to be publicly explicit about whether I’m in favor of an AI pause. This post will give some thoughts on these topics.

I think transformative AI could be soon, and we’re not ready

I have a strong default to thinking that scientific and technological progress is good and that worries will tend to be overblown. However, I think AI is a big exception here because of its potential for unprecedentedly rapid and radical transformation.1

I think sufficiently advanced AI would present enormous risks to the world. I’d put the risk of a world run by misaligned AI (or an outcome broadly similar to that) between 10-90% (so: above 10%) if it is developed relatively soon on something like today’s trajectory. And there are a whole host of other issues (e.g.) that could be just as important if not more so, that it seems like no one has really begun to get a handle on.

Is that level of AI coming soon, and could the world be “ready” in time? Here I want to flag that timelines to transformative or even catastrophically risky AI are very debatable, and I have tried to focus my work on proposals that make sense even for people who disagree with me on the below points. But my own views are that:

  • There’s a serious (>10%) risk that we’ll see transformative AI2 within a few years.
  • In that case it’s not realistic to have sufficient protective measures for the risks in time.
  • Sufficient protective measures would require huge advances on a number of fronts, including information security that could take years to build up and alignment science breakthroughs that we can’t put a timeline on given the nascent state of the field, so even decades might or might not be enough time to prepare, even given a lot of effort.

If it were all up to me, the world would pause now - but it isn’t, and I’m more uncertain about whether a “partial pause” is good

In a hypothetical world where everyone shared my views about AI risks, there would (after deliberation and soul-searching, and only if these didn’t change my current views) be a global regulation-backed pause on all investment in and work on (a) general3 enhancement of AI capabilities beyond the current state of the art, including by scaling up large language models; (b) building more of the hardware (or parts of the pipeline most useful for more hardware) most useful for large-scale training runs (e.g., H100’s); (c) algorithmic innovations that could significantly contribute to (a).

The pause would end when it was clear how to progress some amount further with negligible catastrophic risk and reinstitute the pause before going beyond negligible catastrophic risks. (This means another pause might occur shortly afterward. Overall, I think it’s plausible that the right amount of time to be either paused or in a sequence of small scaleups followed by pauses could be decades or more, though this depends on a lot of things.) This would require a strong, science-backed understanding of AI advances such that we could be assured of quickly detecting early warning signs of any catastrophic-risk-posing AI capabilities we didn’t have sufficient protective measures for.

I didn’t have this view a few years ago. Why now?

  • I think today’s state-of-the-art AIs are already in the zone where (a) we can already learn a huge amount (about AI alignment and other things) by studying them; (b) it’s hard to rule out that a modest scaleup from here - or an improvement in “post-training enhancements” (advances that make it possible to do more with an existing AI than before, without having to do a new expensive training run)4 - could lead to models that pose catastrophic risks.
  • I think we’re pretty far from being ready even for early versions of catastrophic-risk-posing models (for example, I think information security is not where it needs to be, and this won’t be a quick fix).
  • If a model’s weights were stolen and became widely available, it would be hard to rule out that model becoming more dangerous later via post-training enhancements. So even training slightly bigger models than today’s state of the art seems to add nontrivially to the risks.

All of that said, I think that advocating for a pause now might lead instead to a “partial pause” such as:

  • Regulation-mandated pauses in some countries and not others, with many researchers going elsewhere to work on AI scaling.
  • Temporary bans on large training runs, but not on post-training improvements or algorithmic improvements or expansion of hardware capacity. In this case, an “unpause” - including via new scaling methods that didn’t technically fall under the purview of the regulatory ban, or via superficially attractive but insufficient protective measures, or via a sense that the pause advocates had “cried wolf” - might lead to extraordinarily fast progress, much faster than the default and with a more intense international race.
  • Regulation with poor enough design and/or enough loopholes as to create a substantial “honor system” dynamic, which might mean that people more concerned about risks become totally uninvolved in AI development while people less concerned about risks race ahead. This in turn could mean a still-worse ratio of progress on AI capabilities to progress on protective measures.
  • No regulation or totally mis-aimed regulation (e.g., restrictions on deploying large language models but not on training them), accompanied by the same dynamic from the previous bullet point.

It’s much harder for me to say whether these various forms of “partial pause” would be good.

To pick a couple of relatively simple imaginable outcomes and how I’d feel about them:

  • If there were a US-legislated moratorium on training runs exceeding a compute threshold in line with today’s state-of-the-art models, with the implicit intention of doing so until there was a convincing and science-backed way of bounding the risks - with broad but not necessarily overwhelming support from the general public - I’d consider this to be probably a good thing. I’d think this even if the ban (a) didn’t yet come with signs of progress on international enforcement; (b) started with only relatively weak domestic enforcement; and (c) didn’t include any measures to slow production of hardware, advances in algorithmic efficiency or post-training enhancements. In this case I would be hopeful about progress on (a) and (b), as well as on protective measures generally, because of the strong signal this moratorium would send internationally about the seriousness of the threat and the urgency of developing a better understanding of the risks, and of making progress on protective measures. I have very low confidence in my take here and could imagine changing my mind easily.
  • If a scaling pause were implemented using executive orders that were likely to be overturned next time the party in power changed, with spotty enforcement and no effects on hardware and algorithmic progress, I’d consider this pause a bad thing. This is also a guess that I’m not confident in.

Overall I don’t have settled views on whether it’d be good for me to prioritize advocating for any particular policy.5 At the same time, if it turns out that there is (or will be) a lot more agreement with my current views than there currently seems to be, I wouldn’t want to be even a small obstacle to big things happening, and there’s a risk that my lack of active advocacy could be confused with opposition to outcomes I actually support.

I feel generally uncertain about how to navigate this situation. For now I am just trying to spell out my views and make it less likely that I’ll get confused for supporting or opposing something I don’t.

Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine (with some risks that I think can be managed)

My sense is that people have views all over the map about AI risk, such that it would be hard to build a big coalition around the kind of pause I’d support most.

  • Some people think that the kinds of risks I’m worried about are far off, farfetched or ridiculous.
  • Some people think such risks might be real and soon, but that we’ll make enough progress on security, alignment, etc. to handle the risks - and indeed, that further scaling is an important enabler of this progress (e.g., a lot of alignment research will work better with more advanced systems).
  • Some people think the risks are real and soon, but might be relatively small, and that it’s therefore more important to focus on things like the U.S. staying ahead of other countries on AI progress.

I’m excited about RSPs partly because it seems like people in those categories - not just people who agree with my estimates about risks - should support RSPs. This raises the possibility of a much broader consensus around conditional pausing than I think is likely around immediate (unconditional) pausing. And with a broader consensus, I expect an easier time getting well-designed, well-enforced regulation.

I think RSPs represent an opportunity for wide consensus that pausing under certain conditions would be good, and this seems like it would be an extremely valuable thing to establish.

Importantly, agreeing that certain conditions would justify a pause is not the same as agreeing that they’re the only such conditions. I think agreeing that a pause needs to be prepared for at all seems like the most valuable step, and revising pause conditions can be done from there.

Another reason I am excited about RSPs: I think optimally risk-reducing regulation would be very hard to get right. (Even the hypothetical, global-agreement-backed pause I describe above would be hugely challenging to design in detail.) When I think something is hard to design, my first instinct is to hope for someone to take a first stab at it (or at least at some parts of it), learn what they can about the shortcomings, and iterate. RSPs present an opportunity to do something along these lines, and that seems much better than focusing all efforts and hopes on regulation that might take a very long time to come.

There is a risk that RSPs will be seen as a measure that is sufficient to contain risks by itself - e.g., that governments may refrain from regulation, or simply enshrine RSPs into regulation, rather than taking more ambitious measures. Some thoughts on this:

  • I think it’s good for proponents of RSPs to be open about the sorts of topics I’ve written about above, so they don’t get confused with e.g. proposing RSPs as a superior alternative to regulation. This post attempts to do that on my part. And to be explicit: I think regulation will be necessary to contain AI risks (RSPs alone are not enough), and should almost certainly end up stricter than what companies impose on themselves.
  • In a world where there’s significant political support for regulations well beyond what companies support, I expect that any industry-backed setup will be seen as a minimum for regulation. In a world where there isn’t such political support, I think it would be a major benefit for industry standards to include conditional pauses. So overall, the risk seems relatively low and worth it here.
  • I think it’d be unfortunate to try to manage the above risk by resisting attempts to build consensus around conditional pauses, if one does in fact think conditional pauses are better than the status quo. Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.

Footnotes

  1. The other notable exception I’d make here is biology advances that could facilitate advanced bioweapons, again because of how rapid and radical the destruction potential is. I default to optimism and support for scientific and technological progress outside of these two cases. 

  2. I like this discussion of why improvements on pretty narrow axes for today’s AI systems could lead quickly to broadly capable transformative AI. 

  3. People would still be working on making AI better at various specific things (for example, resisting attempts to jailbreak harmlessness training, or just narrow applications like search and whatnot). It’s hard to draw a bright line here, and I don’t think it could be done perfectly using policy, but in the “if everyone shared my views” construction everyone would be making at least a big effort to avoid finding major breakthroughs that were useful for general enhancement of very broad and hard-to-bound suites of AI capabilities. 

  4. Examples include improved fine-tuning methods and datasets, new plugins and tools for existing models, new elicitation methods in the general tradition of chain-of-thought reasoning, etc. 

  5. I do think that at least someone should be trying it. There’s a lot to be learned from doing this - e.g., about how feasible it is to mobilize the general public - and this could inform expectations about what kinds of “partial victories” are likely.  

New Comment
33 comments, sorted by Click to highlight new comments since:
[-]Akash4819

I think it’s good for proponents of RSPs to be open about the sorts of topics I’ve written about above, so they don’t get confused with e.g. proposing RSPs as a superior alternative to regulation. This post attempts to do that on my part. And to be explicit: I think regulation will be necessary to contain AI risks (RSPs alone are not enough), and should almost certainly end up stricter than what companies impose on themselves.

Strong agree. I wish ARC and Anthropic had been more clear about this, and I would be less critical of their RSP posts if they had said this loudly & clearly. I think your post is loud and clear (you state multiple times, unambiguously, that you think regulation is necessary and that you wish the world had more political will to regulate). I appreciate this, and I'm glad you wrote this post.

I think it’d be unfortunate to try to manage the above risk by resisting attempts to build consensus around conditional pauses, if one does in fact think conditional pauses are better than the status quo. Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.

A few thoughts:

  1. One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going. It is nice that they said they would run some evals at least once every 4X in effective compute and that they don't want to train catastrophe-capable models until their infosec makes it more expensive for actors to steal their models. It is nice that they said that once they get systems that are capable of producing biological weapons, they will at least write something up about what to do with AGI before they decide to just go ahead and scale to AGI. But I mostly look at the RSP and say "wow, these are some of the most bare minimum commitments I could've expected, and they don't even really tell me what a pause would look like and how they would end it."
  2.  Meanwhile, we have OpenAI (that plans to release an RSP at some point), DeepMind (rumor has it they're working on one but also that it might be very hard to get Google to endorse one), and Meta (oof). So I guess I'm sort of left thinking something like "If Anthropic's RSP is the best RSP we're going to get, then yikes, this RSP plan is not doing so well." Of course, this is just a first version, but the substance of the RSP and the way it was communicated about doesn't inspire much hope in me that future versions will be better.
  3. I think the RSP frame is wrong, and I don't want regulators to use it as a building block. My understanding is that labs are refusing to adopt an evals regime in which the burden of proof is on labs to show that scaling is safe. Given this lack of buy-in, the RSP folks concluded that the only thing left to do was to say "OK, fine, but at least please check to see if the system will imminently kill you. And if we find proof that the system is pretty clearly dangerous or about to be dangerous, then will you at least consider stopping" It seems plausible to me that governments would be willing to start with something stricter and more sensible than this "just keep going until we can prove that the model has highly dangerous capabilities" regime. 
  4. I think some improvements on the status quo can be net negative because they either (a) cement in an incorrect frame or (b) take a limited window of political will/attention and steer it toward something weaker than what would've happened if people had pushed for something stronger. For example, I think the UK government is currently looking around for substantive stuff to show their constituents (and themselves) that they are doing something serious about AI. If companies give them a milktoast solution that allows them to say "look, we did the responsible thing!", it seems quite plausible to me that we actually end up in a worse world than if the AIS community had rallied behind something stronger. 
  5. If everyone communicating about RSPs was clear that they don't want it to be seen as sufficient, that would be great. In practice, that's not what I see happening. Anthropic's RSP largely seems devoted to signaling that Anthropic is great, safe, credible, and trustworthy. Paul's recent post is nuanced, but I don't think the "RSPs are not sufficient" frame was sufficiently emphasized (perhaps partly because he thinks RSPs could lead to a 10x reduction in risk, which seems crazy to me, and if he goes around saying that to policymakers, I expect them to hear something like "this is a good plan that would sufficiently reduce risks"). ARC's post tries to sell RSPs as a pragmatic middle ground and IMO pretty clearly does not emphasize (or even mention?) some sort of "these are not sufficient" message. Finally, the name itself sounds like it came out of a propaganda department– "hey, governments, look, we can scale responsibly". 
  6. At minimum, I hope that RSPs get renamed, and that those communicating about RSPs are more careful to avoid giving off the impression that RSPs are sufficient.
  7. More ambitiously, I hope that folks working on RSPs seriously consider whether or not this is the best thing to be working on or advocating for. My impression is that this plan made more sense when it was less clear that the Overton Window was going to blow open, Bengio/Hinton would enter the fray, journalists and the public would be fairly sympathetic, Rishi Sunak would host an xrisk summit, Blumenthal would run hearings about xrisk, etc. I think everyone working on RSPs should spend at least a few hours taking seriously the possibility that the AIS community could be advocating for stronger policy proposals and getting out of the "we can't do anything until we literally have proof that the model is imminently dangerous" frame. To be clear, I think some people who do this reflection will conclude that they ought to keep making marginal progress on RSPs. I would be surprised if the current allocation of community talent/resources was correct, though, and I think on the margin more people should be doing things like CAIP & Conjecture, and fewer people should be doing things like RSPs. (Note that CAIP & Conjecture both impt flaws/limitations– and I think this partly has to do with the fact that so much top community talent has been funneled into RSPs/labs relative to advocacy/outreach/outside game).

Cross-posted to the EA Forum.

One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going.

It's hard to take anything else you're saying seriously when you say things like this; it seems clear that you just haven't read Anthropic's RSP. I think that the current conditions and resulting safeguards are insufficient to prevent AI existential risk, but to say that it doesn't make them clear is just patently false.

The conditions under which Anthropic commits to pausing in the RSP are very clear. In big bold font on the second page it says:

Anthropic’s commitment to follow the ASL scheme thus implies that we commit to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL.

And then it lays out a serious of safety procedures that Anthropic commits to meeting for ASL-3 models or else pausing, with some of the most serious commitments here being:

  • Model weight and code security: We commit to ensuring that ASL-3 models are stored in such a manner to minimize risk of theft by a malicious actor that might use the model to cause a catastrophe. Specifically, we will implement measures designed to harden our security so that non-state attackers are unlikely to be able to steal model weights, and advanced threat actors (e.g. states) cannot steal them without significant expense. The full set of security measures that we commit to (and have already started implementing) are described in this appendix, and were developed in consultation with the authors of a forthcoming RAND report on securing AI weights.
  • Successfully pass red-teaming: World-class experts collaborating with prompt engineers should red-team the deployment thoroughly and fail to elicit information at a level of sophistication, accuracy, usefulness, detail, and frequency which significantly enables catastrophic misuse. Misuse domains should at a minimum include causes of extreme CBRN risks, and cybersecurity.
    • Note that in contrast to the ASL-3 capability threshold, this red-teaming is about whether the model can cause harm under realistic circumstances (i.e. with harmlessness training and misuse detection in place), not just whether it has the internal knowledge that would enable it in principle to do so.
    • We will refine this methodology, but we expect it to require at least many dozens of hours of deliberate red-teaming per topic area, by world class experts specifically focused on these threats (rather than students or people with general expertise in a broad domain). Additionally, this may involve controlled experiments, where people with similar levels of expertise to real threat actors are divided into groups with and without model access, and we measure the delta of success between them.

And a clear evaluation-based definition of ASL-3:

We define an ASL-3 model as one that can either immediately, or with additional post-training techniques corresponding to less than 1% of the total training cost, do at least one of the following two things. (By post-training techniques we mean the best capabilities elicitation techniques we are aware of at the time, including but not limited to fine-tuning, scaffolding, tool use, and prompt engineering.)

  1. Capabilities that significantly increase risk of misuse catastrophe: Access to the model would substantially increase the risk of deliberately-caused catastrophic harm, either by proliferating capabilities, lowering costs, or enabling new methods of attack. This increase in risk is measured relative to today’s baseline level of risk that comes from e.g. access to search engines and textbooks. We expect that AI systems would first elevate this risk from use by non-state attackers. Our first area of effort is in evaluating bioweapons risks where we will determine threat models and capabilities in consultation with a number of world-class biosecurity experts. We are now developing evaluations for these risks in collaboration with external experts to meet ASL-3 commitments, which will be a more systematized version of our recent work on frontier red-teaming. In the near future, we anticipate working with CBRN, cyber, and related experts to develop threat models and evaluations in those areas before they present substantial risks. However, we acknowledge that these evaluations are fundamentally difficult, and there remain disagreements about threat models.
  2. Autonomous replication in the lab: The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [Appendix on Autonomy Evaluations]. The appendix includes an overview of our threat model for autonomous capabilities and a list of the basic capabilities necessary for accumulation of resources and surviving in the real world, along with conditions under which we would judge the model to have succeeded. Note that the referenced appendix describes the ability to act autonomously specifically in the absence of any human intervention to stop the model, which limits the risk significantly. Our evaluations were developed in consultation with Paul Christiano and ARC Evals, which specializes in evaluations of autonomous replication.

This is the basic substance of the RSP; I don't understand how you could have possibly read it and missed this. I don't want to be mean, but I am really disappointed in these sort of exceedingly lazy takes.

I think Akash's statement that the Anthropic RSP basically doesn't specify any real conditions that would cause them to stop scaling seems right to me. 

They have some deployment measures, which are not related to the question of when they would stop scaling, and then they have some security-related measures, but those don't have anything to do with the behavior of the models and are the kind of thing that Anthropic can choose to do any time independent of how the facts play out.

I think Akash is right that the Anthropic RSP does concretely not answer the two questions you quote him for: 

  • The RSP does not specify the conditions under which Anthropic would stop scaling models (it only says that in order to continue scaling it will implement some safety measures, but that's not an empirical condition, since Anthropic is confident it can implement the listed security measures)
  • The RSP does not specify under what conditions Anthropic would scale to ASL-4 or beyond, though they have promised they will give those conditions. 

I agree the RSP says a bunch of other things, and that there are interpretations of what Akash is saying that are inaccurate, but I do think on this (IMO most important question) the RSP seems quiet. 

I do think the deployment measures are real, though I don't currently think much of the risk comes from deploying models, so they don't seem that relevant to me (and think the core question is what prevents organizations from scaling models up in the first place).

those don't have anything to do with the behavior of the models and are the kind of thing that Anthropic can choose to do any time independent of how the facts play out.

I mean, they are certainly still conditions on which Anthropic would stop scaling. The sentence

the Anthropic RSP basically doesn't specify any real conditions that would cause them to stop scaling

is clearly false. If you instead said

the Anthropic RSP doesn't yet detail the non-security-related conditions that would cause them to stop training new models

then I would agree with you. I think it's important to be clear here, though: the security conditions could trigger a pause all on their own, and there is a commitment to develop conditions that will halt scaling after ASL-3 by the time ASL-3 is reached.

the security conditions could trigger a pause all on their own

I don't understand how this is possible. The RSP appendix has the list of security conditions, and they are just a checklist of things that Anthropic is planning to do and can just implement whenever they want. It's not cheap for them to implement it, but I don't see any real circumstance where they fail to implement the security conditions in a way that would force them to pause.

Like, I agree that some of these commitments are costly, but I don't see how there is any world where Anthropic would like to continue scaling but finds itself incapable of doing so, which is what I would consider a "pause" to mean. Like, they can just implement their checklist of security requirements and then go ahead. 

Maybe this is quibbling over semantics, but it does really feels quite qualitatively different to me. When OpenAI said that they would spend some substantial fraction of their compute on "Alignment Research" while they train their next model, I think it would be misleading to say "OpenAI has committed to conditionally pausing model scaling".

I mean, I agree that humanity theoretically knows how to implement these sorts of security commitments, so the current conditions should always be possible for Anthropic to unblock with enough time and effort, but the commitment to the sequencing that they have to happen before Anthropic has a model that is ASL-3 means that there are situations where Anthropic commits to pause scaling until the security commitments are met. I agree with you that this is a relatively weak commitment in terms of a scaling pause, though to be fair I don't actually think simply having (but not deploying) a just-barely-ASL-3 model poses much of a risk, so I think it does make sense from a risk-based perspective why most of the commitments are around deployment and security. That being said, even if a just-barely-ASL-3 model doesn't pose an existential risk, so long as ASL-3 is defined only with a lower bound rather than also an upper bound, it's obviously the case that eventually it will contain models that pose a potential existential risk, so I agree that a lot is tied up in the upcoming definition of ASL-4. Regardless, it is still the case that Anthropic has already committed to a scaling pause under certain circumstances.

Regardless, it is still the case that Anthropic has already committed to a scaling pause under certain circumstances.

I disagree that this is an accurate summary, or like, it's only barely denotatively true but not connotatively. 

I do think it's probably best to let this discussion rest, not because it's not important, but because I do think actually resolving this kind of semantic dispute in public comments like this is really hard, and I think it's unlikely either of us will change their mind here, and we've both made our points. I appreciate you responding to my comments.

I think that there's a reasonable chance that the current security commitments will lead Anthropic to pause scaling (though I don't know whether Anthropic would announce publicly if they paused internally). Maybe a Manifold market on this would be a good idea.

That seems cool! I made a market here: 

Feel free to suggest edits about the operationalization or other things before people start trading. 

Looks good—the only thing I would change is that I think this should probably resolve in the negative only once Anthropic has reached ASL-4, since only then will it be clear whether at any point there was a security-related pause during ASL-3.

That seems reasonable. Edited the description (I can't change when trading on the market closes, but I think that should be fine).

[+]simeon_c-20-12

Thanks for the thoughts! Some brief (and belated) responses:

  • I disagree with you on #1 and think the thread below your comment addresses this.
  • Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks).
  • I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If they need to definitively rule out risks like “The weights leak, then the science of post-training enhancements moves forward to the point where the leaked weights are catastrophically dangerous” in order to do any further scaling, my sense is that nobody (certainly not me) has any idea how to do this, and so this proposal seems pretty much equivalent to “immediate pause,” which I’ve shared my thoughts on. If you have a lower burden of proof in mind, I think that’s potentially consistent with the work on RSPs that is happening (it depends on exactly what you are hoping for).
  • I agree with the conceptual point that improvements on the status quo can be net negative for the reasons you say. When I said “Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate,” I didn’t mean to say that there’s no way this can make intellectual sense. To take a quick stab at what bugs me: I think to the extent a measure is an improvement but insufficient, the strategy should be to say “This is an improvement but insufficient,” accept the improvement and count on oneself to win the “insufficient” argument. This kind of behavior seems to generalize to a world in which everyone is clear about their preferred order of object-level states of the world, and incentive gradients consistently point in the right direction (especially important if - as I believe - getting some progress generally makes it easier rather than harder to get more). I worry that the behavior of opposing object-level improvements on the grounds that others might find them sufficient seems to generalize to a world with choppier incentive gradients, more confusing discourse, and a lot of difficulty building coalitions generally (it’s a lot harder to get agreement on “X vs. the best otherwise achievable outcome” than on “X vs. the status quo”).
  • I think nearly all proponents of RSPs do not see them as a substitute for regulation. Early communications could have emphasized this point more (including the METR post, which has been updated). I think communications since then have been clearer about it. 
  • I lean toward agreeing that another name would be better.  I don’t feel very strongly, and am not sure it matters at this point anyway with different parties using different names.
  • I don’t agree that "we can't do anything until we literally have proof that the model is imminently dangerous" is the frame of RSPs, although I do agree that the frame is distinct from a “pause now” frame. I’m excited about conditional pauses as something that can reduce risk a lot while having high tractability and bringing together a big coalition; the developments you mention are great, but I think we’re still a long way from where “advocate for an immediate pause” looks better to me than working within this framework. I also disagree with your implication that the RSP framework has sucked up a lot of talent; while evals have drawn in a lot of people and momentum, hammering out conditional pause related frameworks seems to be something that only a handful of people were working on as of the date of your comment. (Since then the number has gone up due to AI companies forming teams dedicated to this; this seems like a good thing to me.) Overall, it seems to me that most of the people working in this area would otherwise be working on evals and other things short of advocating for immediate pauses.

Why do you think RSPs don't put the burden of proof on labs to show that scaling is safe?

I think the RSP frame is wrong, and I don't want regulators to use it as a building block. My understanding is that labs are refusing to adopt an evals regime in which the burden of proof is on labs to show that scaling is safe. Given this lack of buy-in, the RSP folks concluded that the only thing left to do was to say "OK, fine, but at least please check to see if the system will imminently kill you. And if we find proof that the system is pretty clearly dangerous or about to be dangerous, then will you at least consider stopping" It seems plausible to me that governments would be willing to start with something stricter and more sensible than this "just keep going until we can prove that the model has highly dangerous capabilities" regime.

I think good RSPs would in fact put the burden of proof on the lab. The goal is that the lab would have to make a high quality safety argument prior to taking each risky action (for instance, scaling or further deployment). That said, it's unclear if the safety arguments from voluntary RSPs will end up being very good. In the event that something like RSPs are required by a regulatory body, it's also unclear if that body will require good safety arguments. Presumably people advocating for RSPs will also advocate for voluntary RSPs to contain good safety arguments and for regulation to require good safety arguments.

For example, I think the baseline safety argument from the Anthropic RSP actually does ensure a high degree of safety for some particular AI. The argument is "we ran these capability evals and those indicated to us that the model is only ASL2 (not ASL3+), so it's probably safe". This argument will obviously fail at some point, but it does currently demonstrate safety to pretty high degree in my opinion[1]. This argument doesn't guarantee safety (e.g. what if models learn to very competently sandbag evaluations prior to learning how to accomplish these tasks or what if there is a dangerous action which is easier than this evaluation) and it also might be the case that running this eval every 4x effective compute scale up is insufficient due to rapid increases in capabilities wrt. effective compute. But, I still think overall risk is <1% as long as this exact safety argument is in place (I think most of the risk comes from rapid increases in capabilities rather than sandbagging or easier paths to doom than covered by our evaluations).

Another way of putting this is: getting labs to check if their models could be dangerous is putting the burden of proof on labs. (And then we can argue at the object level about the quality of these evaluations.)

To be clear, I think it's reasonable to object at the object level with any of:

  • The reduction in P(doom) which is being targeted (e.g. 5-10x) isn't good enough and we should ask for more (or you could object to the absolute level of doom, but this might depend more on priors).
  • The countermeasures discussed in this exact RSP don't reduce P(doom) that much.
  • There are no known evaluations, countermeasures, or approaches which would allow for reducing P(doom) by the targeted amount other than stopping scaling right now, so we should just do that.

  1. It's less clear to me that this ensures safety from ongoing scaling due to the possiblity for rapid (perhaps mostly discontinuous) increases in capabilities such that running the evalution periodically is insufficient. I'll discuss concerns with rapid increases in capabilities later. ↩︎

At minimum, I hope that RSPs get renamed, and that those communicating about RSPs are more careful to avoid giving off the impression that RSPs are sufficient.

OpenAI's RDP name seems nicer than the RSP name, for roughly the reason they explain in their AI summit proposal (and also 'risk-informed' feels more honest than 'responsible'):

We refer to our policy as a Risk-Informed Development Policy rather than a Responsible Scaling Policy because we can experience dramatic increases in capability without significant increase in scale, e.g., via algorithmic improvements.

I think we should keep the acronym, but change the words as necessary! Imagine

A Whitehouse spokesperson announced, "Anthropic's Rapid Scaling Policy has been discussed in detail and we have sought cross-party consensus on its relevance". Meanwhile, DeepMind has recently released a Rampant Sales Pitch of its own, and a Reeking ShitPost from Meta's Chief AI Scientist LeCun has caused controversy. An insider in the EU revealed, "What we are looking for in a Really Slick Propaganda is something that is convincing - that can convince everyone"

(this is tongue in cheek and I don't have strongly-held opinions on RSPs yet)

One reason I'm critical of the Anthropic RSP is that it does not make it clear under what conditions it would actually pause, or for how long, or under what safeguards it would determine it's OK to keep going.

Can you link an example of what you believe to be a well-worded RSP?

You're aware that there's only one public RSP?

You can find the current closest thing various companies have at https://www.aisafetysummit.gov.uk/policy-updates/#company-policies

I never said it had to be an officially sanctioned one, plenty of folks are capable of writing drafts, ideations, conjectures, etc...

If literally no one has ever published something along these lines, then that's probably the most promising avenue of investigation.

[-]aysja4330

Thanks for writing this post—I appreciate the candidness about your beliefs here, and I agree that this is a tricky topic. I, too, feel unsettled about it on the object level.

On the meta level, though, I feel grumpy about some of the framing choices. There’s this wording which both you and the original ARC evals post use: that responsible scaling policies are a “robustly good compromise,” or, in ARC’s case, that they are a “pragmatic middle ground.” I think these stances take for granted that the best path forward is compromising, but this seems very far from clear to me. 

Like, certainly not all cases of “people have different beliefs and preferences” are ones where compromise is the best solution. If someone wants to kill me, I’m not going to be open to negotiating about how many limbs I’m okay with them taking. This is obviously an extreme example, but I actually don’t think it’s that far off from the situation we find ourselves in now where, e.g., Dario gives a 10-25% probability that the sort of technology he is advancing will either cause massive catastrophe or end the human race. When people are telling me that their work has a high chance of killing me, it doesn't feel obvious that the right move is “compromising” or “finding a middle ground.” 

The language choices here feel sketchy to me in the same way that the use of the term “responsible” feels sketchy to me. I certainly wouldn’t call the choice to continue building the unsettling-chance-of-annihilation machine responsible. Perhaps it’s more responsible than the default, but that’s a different claim and not one that is communicated in the name. Similarly, “compromise” and “middle ground” are the kinds of phrases that seem reasonable from a distance, but if you look closer they’re sort of implicitly requesting that we treat “keep racing ahead to our likely demise” as a sensible option. 

“Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me.” 

This seems to me to misrepresent the argument. At the very least, it misrepresents mine. It’s not that I’m fighting an improvement to the status quo, it’s that I don’t think responsible scaling policies are an improvement if they end up being confused for sufficient progress. 

Like, in the worlds where alignment is hard, and where evals do not identify the behavior which is actually scary, then I claim that the existence of such evals is concerning. It’s concerning because I suspect that capabilities labs are more incentivized to check off the “passed this eval” box than they are to ensure that their systems are actually safe. And in the absence of a robust science of alignment, I claim that this most likely results in capability labs goodharting on evals which are imperfect proxies for what we care about, making systems look safer than they are. This does not seem like an improvement to me. I want the ability to say what’s actually true, here: that we do not know what’s going on, and that we’re building a godlike power anyway. 

And I’m not saying that this is the only way responsible scaling policies could work out, or that it would necessarily be intentional, or that nobody in capabilities labs take the risk seriously. But it seems like a blindspot to neglect the the existence of the incentive landscape, here, one which is almost certainly affecting the policies that capabilities labs establish.

Thanks for the thoughts!

I don’t think the communications you’re referring to “take for granted that the best path forward is compromising.” I would simply say that they point out the compromise aspect as a positive consideration, which seems fair to me - “X is a compromise” does seem like a point in favor of X all else equal (implying that it can unite a broader tent), though not a dispositive point.

I address the point about improvements on the status quo in my response to Akash above.

Holden, thanks for this public post. 

  1. I would love if you could write something along the lines of what you wrote in "If it were all up to me, the world would pause now - but it isn’t, and I’m more uncertain about whether a “partial pause” is good" at the top of ARC post, which as we discussed and as I wrote in my post would make RSPs more likely to be positive in my opinion by making the policy/voluntary safety commitments distinction clearer.

Regarding 

Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine

2. It seems like it's empirically wrong based on the strong pushback RSPs received so that at least you shouldn't call it "robustly", unless you mean a kind of modified version that would accommodate the most important parts of the pushback. 

3. I feel like overall the way you discuss RSPs here is one of the many instances of people chatting about idealized RSPs that are not specified, and pointed to against disagreement. See below, from my post:

And second, the coexistence of ARC's RSP framework with the specific RSPs labs implementations allows slack for commitments that are weak within a framework that would in theory allow ambitious commitments. It leads to many arguments of the form:

  • “That’s the V1. We’ll raise ambition over time”. I’d like to see evidence of that happening over a 5 year timeframe, in any field or industry. I can think of fields, like aviation where it happened over the course of decades, crashes after crashes. But if it’s relying on expectations that there will be large scale accidents, then it should be clear. If it’s relying on the assumption that timelines are long, it should be explicit. 
  • “It’s voluntary, we can’t expect too much and it’s way better than what’s existing”. Sure, but if the level of catastrophic risks is 1% (which several AI risk experts I’ve talked to believe to be the case for ASL-3 systems) and that it gives the impression that risks are covered, then the name “responsible scaling” is heavily misleading policymakers. The adequate name for 1% catastrophic risks would be catastrophic scaling, which is less rosy.

Thanks for the post.

[-]Malo125

Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine

2. It seems like it's empirically wrong based on the strong pushback RSPs received so that at least you shouldn't call it "robustly", unless you mean a kind of modified version that would accommodate the most important parts of the pushback. 

FWIW, my read here was that “people who have different views from mine” was in reference to these sets of people:

  • Some people think that the kinds of risks I’m worried about are far off, farfetched or ridiculous.
  • Some people think such risks might be real and soon, but that we’ll make enough progress on security, alignment, etc. to handle the risks - and indeed, that further scaling is an important enabler of this progress (e.g., a lot of alignment research will work better with more advanced systems).
  • Some people think the risks are real and soon, but might be relatively small, and that it’s therefore more important to focus on things like the U.S. staying ahead of other countries on AI progress.

That may be right but then the claim is wrong. The true claim would be "RSPs seem like a robustly good compromise with people who are more optimistic than me".

And then the claim becomes not really relevant?

That may be right but then the claim is wrong. The true claim would be "RSPs seem like a robustly good compromise with people who are more optimistic than me".

IDK man, this seems like nitpicking to me ¯\_(ツ)_/¯. Though I do agree that, on my read, it’s technically more accurate.

My sense here is that Holden is speaking from a place where he considers himself to be among the folks (like you and I) who put significant probability on AI posing a catastrophic/existential risk in the next few years, and “people who have different views from mine” is referring to folks who aren’t in that set.

(Of course, I don’t actually know what Holden meant. This is just what seemed like the natural interpretation to me.) 

And then the claim becomes not really relevant?

Why?

Because it's meaningless to talk about a "compromise" dismissing one entire side of the people who disagree with you (but only one side!).

Like I could say "global compute thresholds is a robustly good compromise with everyone who disagrees with me"

*Footnote: only those who're more pessimistic than me.

Thanks for the thoughts!

#1: METR made some edits to the post in this direction (in particular see footnote 3).

On #2, Malo’s read is what I intended. I think compromising with people who want "less caution" is most likely to result in progress (given the current state of things), so it seems appropriate to focus on that direction of disagreement when making pragmatic calls like this.

On #3: I endorse the “That’s a V 1” view.  While industry-wide standards often take years to revise, I think individual company policies often (maybe usually) update more quickly and frequently.

That all makes sense. To expand a little more on some of the logic:

It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.

I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.

On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs

In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.

This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.

The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.

This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.

It does seem clear that creating mechanisms and political will for a pause are a good idea.

Advocating for more safety work also seems clear cut.

To this end, I think it's true that you create more political capitol by successfully pushing for policy.

A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.

So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.

There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.

Thanks for clarifying your views. I think it's important.
 

...build consensus around conditional pauses...

My issue with this is that it's empty unless the conditions commit labs to taking actions they otherwise wouldn't. Anthropic's RSP isn't terrible, but I think a reasonable summary is "Anthropic will plan ahead a bit, take the precautions they think make sense, and pause when they think it's a good idea".

It's a commitment to take some actions that aren't pausing - defining ASL4 measures, implementing ASL3 measures that they know are possible. That's nice as far as it goes. However, there's nothing yet in there that commits them to pause when they don't think it's a good idea.

They could have included such conditions, even if they weren't concrete, and wouldn't come in to play until ASL4 (e.g. requiring that particular specifications or evals be approved by an external board before they could move forward). That would have signaled something. They chose not to.

That might be perfectly reasonable, given that it's unilateral. But if (even) Anthropic aren't going to commit to anything with a realistic chance of requiring a lengthy pause, that doesn't say much for RSPs as conditional pause mechanisms.

The transparency probably does help to a degree. I can imagine situations where greater clarity in labs' future actions might help a little with coordination, even if they're only doing what they'd do without the commitment.

Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.

This seems a reasonable criticism only if it's a question of [improvement with downside] vs [status-quo]. I don't think the RSP critics around here are suggesting that we throw out RSPs in favor of the status-quo, but that we do something different.

It may be important to solve x, but also that it's not prematurely believed we've solved x. This applies to technical alignment, and to alignment regulation.

Things being "confused for sufficient progress" isn't a small problem: this is precisely what makes misalignment an x-risk.

Initially, communication around RSPs was doing a bad job of making their insufficiency clear.
Evan's, Paul's and your posts are welcome clarifications - but such clarifications should be in the RSPs too (not as vague, easy-enough-to-miss caveats).

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?