Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.
I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:
What would a good RSP look like?
Strongly agree with almost all of this.
My main disagreement is that I don't think the "What would a good RSP look like?" description is sufficient without explicit conditions beyond evals. In particular that we should expect that our suite of tests will be insufficient at some point, absent hugely improved understanding - and that we shouldn't expect to understand how and why it's insufficient before reality punches us in the face.
Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended].
We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none)
This can't be implicit, since it's a central way that we die.
If it's hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn't make it ok. Blank map is not blank territory.
If we're thinking...
Therefore, it's not enough to show: [here are tests covering all the problems anyone thought of, and reasons why we expect them to work as intended]. We also need strong evidence that there'll be no catastrophe-inducing problems we didn't think of. (not [none that SotA methods could find], not [none discoverable with reasonable cost]; none) This can't be implicit, since it's a central way that we die. If it's hard/impractical to estimate, then we should pause until we can estimate it more accurately
This is the kind of thing that I expect to be omitted from RSPs as a matter of course, precisely because we lack the understanding to create good models/estimates, legible tests etc. That doesn't make it ok. Blank map is not blank territory.
Yeah, I agree—that's why I'm specifically optimistic about understanding-based evals, since I think they actually have the potential to force us to catch unknown unknowns here, the idea being that they require you to prove that you actually understand your model to a level where you'd know if there were anything wrong that your other evals might miss.
...Define ASLs or similar now rather than waiting until we're much closer to achieving them. Waiting
IANAL, but I think that this is currently impossible due to anti-trust regulations.
I don't know anything about anti-trust enforcement, but it seems to me that this might be a case where labs should do it anyways & delay hypothetical anti-trust enforcement by fighting in court.
I happen to think that the Anthropic RSP is fine for what it is, but it just doesn't actually make any interesting claims yet. The key thing is that they're committing to actually having an ASL-4 criteria and safety argument in the future. From my perspective, the Anthropic RSP effectively is an outline for the sort of thing an RSP could be (run evals, have safety buffer, assume continuity, etc) as well as a commitment to finish the key parts of the RSP later. This seems ok to me.
I would preferred if they included tentative proposals for ASL-4 evaluations and what their current best safety plan/argument for ASL-4 looks like (using just current science, no magic). Then, explain that plan wouldn't be sufficient for reasonable amounts of safety (insofar as this is what they think).
Right now, they just have a bulleted list for ASL-4 countermeasures, but this is the main interesting thing at me. (I'm not really sold on substantial risk from systems which aren't capable of carrying out that harm mostly autonomously, so I don't think ASL-3 is actually important except as setup.)
What would a good RSP look like?
- Clear commitments along the lines of "we promise to run these 5 specific tests to evaluate these 10 specific dangerous capabilities."
- Clear commitments regarding what happens if the evals go off (e.g., "if a model scores above a 20 on the Hubinger Deception Screener, we will stop scaling until it has scored below a 10 on the relatively conservative Smith Deception Test.")
- Clear commitments regarding the safeguards that will be used once evals go off (e.g., "if a model scores above a 20 on the Cotra Situational Awareness Screener, we will use XYZ methods and we believe they will be successful for ABC reasons.")
- Clear evidence that these evals will exist, will likely work, and will be conservative enough to prevent catastrophe
- Some way of handling race dynamics (such that Bad Guy can't just be like "haha, cute that you guys are doing RSPs. We're either not going to engage with your silly RSPs at all, or we're gonna publish our own RSP but it's gonna be super watered down and vague").
Yeah, of course this would be nice. But the reason that ARC and Anthropic didn't write this 'good RSP' isn't that they're reckless, but because writing such an RSP is a hard op...
Thanks, Zach! Responses below:
But the reason that ARC and Anthropic didn't write this 'good RSP' isn't that they're reckless, but because writing such an RSP is a hard open problem
I agree that writing a good RSP is a hard open problem. I don't blame ARC for not having solved the "how can we scale safely" problem. I am disappointed in ARC for communicating about this poorly (in their public blog post, and [speculative/rumor-based] maybe in their private government advocacy as well.).
I'm mostly worried about the comms/advocacy/policy implications. If Anthropic and ARC had come out and said "look, we have some ideas, but the field really isn't mature enough and we really don't know what we're doing, and these voluntary commitments are clearly insufficient, but if you really had to ask us for our best-guesses RE what to do if there is no government regulation coming, and for some reason we had to keep racing toward god-like AI, here are our best guesses. But please note that this is woefully insufficient and we would strongly prefer government intervention to buy enough time so that we can have actual plans."
I also expect most of the (positive or negative) impact of the recent RSP post...
i would love to see competing RSPs (or, better yet, RTDPs, as @Joe_Collman pointed out in a cousin comment).
It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs.
I mean, I am very explicitly trying to communicate what I see as the success story here. I agree that there are many ways that this could fail—I mention a bunch of them in the last section—but I think that having a clear story of how things could go well is important to being able to work to actually achieve that story.
On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe.
I want to be very clear that I've been really happy to see all the people pushing for strong asks here. I think it's a really valuable thing to be doing, and what I'm trying to do here is not stop that but help it focus on more concrete asks.
...I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RS
To be clear, I definitely agree with this. My position is not “RSPs are all we need”, “pauses are bad”, “pause advocacy is bad”, etc.—my position is that getting good RSPs is an effective way to implement a pause: i.e. “RSPs are pauses done right.”
Some feedback on this: my expectation upon seeing your title was that you would argue, or that you implicitly believe, that RSPs are better than other current "pause" attempts/policies/ideas. I think this expectation came from the common usage of the phrase "done right" to mean that other people are doing it wrong or at least doing it suboptimally.
Sorry yeah I could have explained what I meant further. The way I see it:
‘X is the most effective way that I know of’ = X tops your ranking of the different ways, but could still be below a minimum threshold (e.g. X doesn’t have to even properly work, it could just be less ineffective than all the rest). So one could imagine someone saying “X is the most effective of all the options I found and it still doesn’t actually do the job!”
‘X is an effective way’ = ‘X works, and it works above a certain threshold’.
‘X is Y done right’ = ‘X works and is basically the only acceptable way to do Y,’ where it’s ambiguous or contextual as to whether ‘acceptable’ means that it at least works, that it’s effective, or sth like ‘it’s so clearly the best way that anyone doing the 2nd best thing is doing something bad’.
In terms of explicit claims:
"So one extreme side of the spectrum is build things as fast as possible, release things as much as possible, maximize technological progress [...].
The other extreme position, which I also have some sympathy for, despite it being the absolutely opposite position, is you know, Oh my god this stuff is really scary.
The most extreme version of it was, you know, we should just pause, we should just stop, we should just stop building the technology for, indefinitely, or for some specified period of time. [...] And you know, that extreme position doesn't make much sense to me either."
Dario Amodei, Anthropic CEO, explaining his company's "Responsible Scaling Policy" on the Logan Bartlett Podcast on Oct 6, 2023.
Starts at around 49:40.
This example is not a claim by ARC though, seems important to keep track of this in a discussion of what ARC did or didn't claim, even as others making such claims is also relevant.
I’m sympathetic to the idea that it would be good to have concrete criteria for when to stop a pause, were we to start one. But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
I’m first going to zoom out a bit—to a broader trend which I’m worried about in AI Safety, and something that I believe evaluation-gating might exacerbate, although it is certainly not the only contributing factor.
I think there is pressure mounting within the field of AI Safety to produce measurables, and to do so quickly, as we continue building towards this godlike power under an unknown timer of unknown length. This is understandable, and I think can often be good, because in order to make decisions it is indeed helpful to know things like “how fast is this actually going” and to assure things like “if a system fails such and such metric, we'll stop.”
But I worry that in our haste we will end up focusing our efforts under the streetlight. I worry, in other words, that the hard problem of finding robust measurements—those which enable us to predict the behavior and safety of AI systems wi...
But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can't do are concrete safety evals, which I'm very clear about not expecting us to have right now.
And I'm not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn't work as a way to make things go well: but that's how any story of success has to work right now given that we don't currently know how to make things go well. And actually telling success stories is an important thing to do!
If you have an alternative success story that doesn't involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
...It also seems to me like somethin
But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don't yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
When assumptions are clear, it's not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.
This was just an example of a process I expect labs wouldn't commit to, not (necessarily!) a suggestion.
The key criterion isn't even appropriate levels of understanding, but rather appropriate levels of caution - and of sufficient respect for what we don't know. The criterion [...if aysja thinks it's good for global safety] may well be about as good as [...if Eliezer thinks it's good for global safety].
It's much less about [This person knows], than about [This person knows that no-one knows, and has integrated this knowledge into their decision-making].
Importantly, a cautious person telling an incautious person "you really need to be cautious here" is not going to make the incautious person cautious (perhaps slightly more cautious than their baseline - but it won't change the way they think).
A few other thoughts:
Thanks for writing this up.
I agree that the issue is important, though I'm skeptical of RSPs so far, since we have one example and it seems inadequate - to the extent that I'm positively disposed, it's almost entirely down to personal encounters with Anthropic/ARC people, not least yourself. I find it hard to reconcile the thoughtfulness/understanding of the individuals with the tone/content of the Anthropic RSP. (of course I may be missing something in some cases)
Going only by the language in the blog post and the policy, I'd conclude that they're an excuse to continue scaling while being respectably cautious (though not adequately cautious). Granted, I'm not the main target audience - but I worry about the impression the current wording creates.
I hope that RSPs can be beneficial - but I think much more emphasis should be on the need for positive demonstration of safety properties, that this is not currently possible, and that it may take many years for that to change. (mentioned, but not emphasized in the Anthropic policy - and without any "many years" or similar)
It's hard to summarize my concerns, so apologies if the following ends up somewhat redundant.
I'll focus on your post f...
I'm mostly not going to comment on Anthropic's RSP right now, since I don't really want this post to become about Anthropic's RSP in particular. I'm happy to talk in more detail about Anthropic's RSP maybe in a separate top-level post dedicated to it, but I'd prefer to keep the discussion here focused on RSPs in general.
One of my main worries with RSPs is that they'll be both [plausibly adequate as far as governments can tell] and [actually inadequate]. That's much worse than if they were clearly inadequate.
I definitely share this worry. But that's part of why I'm writing this post! Because I think it is possible for us to get good RSPs from all the labs and governments, but it'll take good policy and advocacy work to make that happen.
My main worry here isn't that we'll miss catastrophic capabilities in the near term (though it's possible). Rather it's the lack of emphasis on this distinction: that tests will predictably fail to catch problems, and that there's a decent chance some of them fail before we expect them to.
I agree that this is a serious concern, though I think that at least in the case of capabilities evaluations, it should be solvable. Though it'll require tho...
I think we at least do know how to do effective capabilities evaluations
This seems an overstatement to me:
Where the main risk is misuse, we'd need to know that those doing the testing have methods for eliciting capabilities that are as effective as anything people will come up with later. (including the most artful AutoGPT 3.0 setups etc)
It seems reasonable to me to claim that "we know how to do effective [capabilities given sota elicitation methods] evaluations", but that doesn't answer the right question.
Once the main risk isn't misuse, then we have to worry about assumptions breaking down (no exploration hacking / no gradient hacking / [assumption we didn't realize we were relying upon]). Obviously we don't expect these to break yet, but I'd guess that we'll be surprised the first time they do break.
I expect your guess on when they will break to be more accurate than mine - but that [I don't have much of a clue, so I'm advocating extreme caution] may be the more reasonable policy.
...My concern with trying to put something like [understanding-based evals] into an RSP right now is that it'll end up evaluating the wrong thing: since we don't yet know how to effectively evaluate unders
The point is that advocating for a “pause” is nebulous and non-actionable
Setting aside the potential advantages of RSPs, this strikes me as a pretty weird thing to say. I understand the term "pause" in this context to mean that you stop building cutting-edge AI models, either voluntarily or due to a government mandate. In contrast, "RSP" says you eventually do that but you gate it on certain model sizes and test results and unpause it under other test results. This strikes me as a bit less nebulous, but only a bit.
I'm not quite sure what's going on here - it's possible that the term "pause" has gotten diluted? Seems unfortunate if so.
I think the problem is that nobody really has an idea for what the resumption condition should be for a pause, and nobody's willing to defend the (actually actionable) six-month pause proposal.
the FLI letter asked for “pause for at least 6 months the training of AI systems more powerful than GPT-4” and i’m very much willing to defend that!
my own worry with RSPs is that they bake in (and legitimise) the assumptions that a) near term (eval-less) scaling poses trivial xrisk, and b) there is a substantial period during which models trigger evals but are existentially safe. you must have thought about them, so i’m curious what you think.
that said, thank you for the post, it’s a very valuable discussion to have! upvoted.
Sure, but I guess I would say that we're back to nebulous territory then—how much longer than six months? When if ever does the pause end?
i agree that, if hashed out, the end criteria may very well resemble RSPs. still, i would strongly advocate for scaling moratorium until widely (internationally) acceptable RSPs are put in place.
I'd very surprised if there was substantial x-risk from the next model generation.
i share the intuition that the current and next LLM generations are unlikely an xrisk. however, i don't trust my (or anyone else's) intuitons strongly enough to say that there's a less than 1% xrisk per 10x scaling of compute. in expectation, that's killing 80M existing people -- people who are unaware that this is happening to them right now.
Do you think if Anthropic (or another leading AGI lab) unilaterally went out of its way to prevent building agents on top of its API, would this reduce the overall x-risk/p(doom) or not?
Probably, but Anthropic is actively working in the opposite direction:
This means that every AWS customer can now build with Claude, and will soon gain access to an exciting roadmap of new experiences - including Agents for Amazon Bedrock, which our team has been instrumental in developing.
Currently available in preview, Agents for Amazon Bedrock can orchestrate and perform API calls using the popular AWS Lambda functions. Through this feature, Claude can take on a more expanded role as an agent to understand user requests, break down complex tasks into multiple steps, carry on conversations to collect additional details, look up information, and take actions to fulfill requests. For example, an e-commerce app that offers a chat assistant built with Claude can go beyond just querying product inventory – it can actually help customers update their orders, make exchanges, and look up relevant user manuals.
Obviously, Claude 2 as a conversational e-commerce agent is not going to pose catastrophic risk, b...
Is the idea that an indefinite pause is unactionable? If so, I'm not sure why you think that.
I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.
On RSPs vs pauses, my basic take is that hardcore pauses are better than RSPs and RSPs are considerably better than weak pauses.
Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.
I think good RSPs are worse than this, but probably much better than just having a lab pause scaling.
It's possible that various actors should explicitly state that hardcore pauses would be better (insofar as they think so).
- A capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. ...
- A safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. ...
I propose changing the term for this second type of evaluation to "propensity evaluations". I think this is a better term as it directly fits the definition you provided: "a model evaluation designed to test under what circumstances a model would actually try to do some task".
Moreover, I think that both capabilities evaluations and propensity evaluations can be types of safety evaluations. Therefore, it's misleading to label only one of them as "safety evaluations". For example, we could construct a compelling safety argument for current models using solely capability evaluations.
Either can be sufficient for safety: a strong argument based on capabilities (we've conclusively determined that the AI is too dumb to do anything very dangerous) or a strong argument based on propensity (we have a theoretically robust and empirically validated case that our training process will result in an AI that never attempts to do anything harmful).
Alternatively, a moderately strong argument based on capabilities combined with a moderately strong argument based on propensity can be sufficient, provided that the evidence is sufficiently independent.
I sometimes refer to capability based arguments as control arguments.
Then, we can name two lines of defense:
It's possible to develop techniques which advance either the control line of defense or the propensity line of defense. Of course, many research directions are useful for both.
I expect that in practice, we're not very close to being able to make good propensity arguments (for instance, theory and interpretability both seem to me to be unlikely to establish this in the next several doublings of total R&D effort). However, we're not that far off from building quite powerful control based countermeasures. I think these control based countermeasures could scale to establishing barely acceptable safety arguments (e.g. 1-5% doom per year) for around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x.
Further, as Evan noted, control style arguments seem much more straightforward to evaluate (though various complications can result from exploration and gradient hacking). So, I think the feedback loops on control look quite promising.
(Redwood Research, where I work, is currently pursuing several control style projects and we should be putting out various things on this soon.)
I'm not going to respond to everything you're saying here right now. It's pretty likely I won't end up responding to everything you're saying at any point; so apologies for that.
Here are some key claims I want to make:
- "So coordination to do better than this would be great".
- I'd be curious to know what you'd want to aim for here - both in a mostly ideal world, and what seems most expedient.
As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:
Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.
As far as expedient, something like:
Both of these are unlikely to perfectly succeed, but seems like good directions to p...
if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end [...] just advocate for that condition being baked into RSPs
Resume when the scientific community has a much clearer idea about how to build AGIs that don't pose a large extinction risk for humanity. This consideration can't be turned into a benchmark right now, hence the technical necessity for a pause to remain nebulous.
RSPs are great, but not by themselves sufficient. Any impression that they are sufficient bundles irresponsible neglect of the less quantifiable risks with the useful activity of creating benchmarks.
(comment crossposted from EA forum)
Very interesting post! But I'd like to push back. The important things about a pause, as envisaged in the FLI letter, for example, are that (a) it actually happens, and (b) the pause is not lifted until there is affirmative demonstration that the risk is lifted. The FLI pause call was not, in my view, on the basis of any particular capability or risk, but because of the out-of-control race to do larger giant scaling experiments without any reasonable safety assurances. This pause should still happen, and it should not be lifted until there is a way in place to assure that safety. Many of the things FLI hoped could happen during the pause are happening — there is huge activity in the policy space developing standards, governance, and potentially regulations. It's just that now those efforts are racing the un-paused technology.
In the case of "responsible scaling" (for which I think the ideas of "controlled scaling" or "safety-first scaling" would be better), what I think is very important is that there not be a presumption that the pause will be temporary, and lifted "once" the right mitigations are in place. We may well hit point (and may be there...
most potentially dangerous capabilities should be highly correlated, such that measuring any of them should be okay. Thus, I think it should be fine to mostly focus on measuring the capabilities that are most salient to policymakers and most clearly demonstrate risks.
Once labs are trying to pass capability evaluations, they will spend effort trying to suppress the specific capabilities being evaluated*, so I think we'd expect them to stop being so highly correlated.
* If they try methods of more generally suppressing the kinds of capabilities that might be dangerous, I think they're likely to test them most on the capabilities being evaluated by RSPs.
My guess is that the hard "Pause" advocates are focussed on optimizing for actions that are "necessary" and "sufficient" for safety, but at the expense of perhaps not being politically or practically "feasible".
Whereas, "Responsible Scaling Policies" advocates may instead describe actions that are "necessary", and more "feasible" however are less likely to be "sufficient".
The crux of this disagreement might be related to how feasible, or how sufficient each of these two pathways respectively are?
Absent any known pathways that solve all three, I'm glad peop...
From my reading of ARC Evals' example of a "good RSP", RSPs set a standard that roughly looks like: "we will continue scaling models and deploying if and only if our internal evals team fails to empirically elicit dangerous capabilities. If they do elicit dangerous capabilities, we will enact safety controls just sufficient for our models to be unsuccessful at, e.g., creating Super Ebola."
This is better than a standard of "we will scale and deploy models whenever we want," but still has important limitations. As noted by the "coordinated pausing" paper, it...
The problem with a naive implementation of RSPs is that we're trying to build a safety case for a disaster that we fundamentally don't understand and where we haven't even produced a single disaster example or simulation.
To be more specific, we don't know exactly which bundles of AI capabilities and deployments will eventually result in a negative outcome for humans. Worse, we're not even trying to answer that question - nobody has run an "end of the world simulator" and as far as I am aware there are no plans to do that.
Without such a model it's very diff...
Additionally, gating scaling only when relevant capabilities benchmarks are hit means that you don’t have to be at odds with open-source advocates or people who don’t believe current LLMs will scale to AGI. Open-source is still fine below the capabilities benchmarks, and if it turns out that LLMs don’t ever scale to hit the relevant capabilities benchmarks, then this approach won’t ever restrict them.
Can you clarify whether this is implying that open-source capability benchmark thresholds will be at the same or similar levels to closed-source ones? That is...
If the model is smart enough, you die before writing the evals report; if it’s just kinda smart, you don’t find it to be too intelligent and die after launching your scalable oversight system that, as a whole, is smarter than individual models.
An international moratorium on all training runs that could stumble on something that might kill everyone is much more robust than regulations around evaluated capabilities of already trained models
Edit: Huh. I would’ve considered the above to be relatively uncontroversial on LessWrong. Can someone explain where I’m wrong?
If the model is smart enough, you die before writing the evals report; if it’s just kinda smart, you don’t find it to be too intelligent and die after launching your scalable oversight system that, as a whole, is smarter than individual models.
An international moratorium on all training runs that could stumble on something that might kill everyone is much more robust than regulations around evaluated capabilities of already trained models
COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.
Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.
Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.
If you’re not familiar with the concept of an RSP, the central idea of RSPs is evaluation-gated scaling—that is, AI labs can only scale models depending on some set of evaluations that determine whether additional scaling is appropriate. ARC’s definition is:
How do we make it to a state where AI goes well?
I want to start by taking a step back and laying out a concrete plan for how we get from where we are right now to a policy regime that is sufficient to prevent AI existential risk.
The most important background here is my “When can we trust model evaluations?” post, since knowing the answer to when we can trust evaluations is extremely important for setting up any sort of evaluation-gated scaling. The TL;DR there is that it depends heavily on the type of evaluation:
With that as background, here’s a broad picture of how things could go well via RSPs (note that everything here is just one particular story of success, not necessarily the only story of success we should pursue or a story that I expect to actually happen by default in the real world):
Reasons to like RSPs
Obviously, the above is only one particular story for how things go well, but I think it’s a pretty solid one. Here are some reasons to like it:
How do RSPs relate to pauses and pause advocacy?
In my opinion, RSPs are pauses done right: if you are advocating for a pause, then presumably you have some resumption condition in mind that determines when the pause would end. In that case, just advocate for that condition being baked into RSPs! And if you have no resumption condition—you want a stop rather than a pause—I empathize with that position but I don’t think it’s (yet) realistic. As I discussed above, it requires labs and governments to sacrifice too much present value (rather than just potential future value), isn’t legibly risk-based, doesn’t provide early wins, etc. Furthermore, I think the best way to actually make a full stop happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.
Furthermore, I want to be very clear that I don’t mean “stop pestering governments and focus on labs instead”—we should absolutely try to get governments to adopt RSP-like policies and get as strong conditions as possible into any RSP-like policies that they adopt. What separates pause advocacy from RSP advocacy isn’t who it’s targeted at, but the concreteness of the policy recommendations that it’s advocating for. The point is that advocating for a “pause” is nebulous and non-actionable—“enact an RSP” is concrete and actionable. Advocating for labs and governments to enact as good RSPs as possible is a much more effective way to actually produce concrete change than highly nebulous pause advocacy.
Furthermore, RSP advocacy is going to be really important! I’m very worried that we could fail at any of the steps above, and advocacy could help substantially. In particular: