Cross-posted with the EA Forum.
I found this post very frustrating, because it's almost all dedicated to whether current RSPs are sufficient or not (I agree that they are insufficient), but that's not my crux and I don't think it's anyone else's crux either. And for what I think is probably the actual crux here, you only have one small throwaway paragraph:
Which brings us to the question: “what’s the effect of RSPs on policy and would it be good if governments implemented those”. My answer to that is: An extremely ambitious version yes; the misleading version, no. No, mostly because of the short time we have before we see heightened levels of risks, which gives us very little time to update regulations, which is a core assumption on which RSPs are relying without providing evidence of being realistic.
As I've talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted. It seems that your main reason for disagreement is that you believe in extremely short timelines / fast takeoff, such that we will never get future opportunities to revise AI regulation. That seems pretty unlikely to me: my expectation especially is that as AI continues to heat up in terms of its economic impact, new policy windows will keep arising in rapid succession, and that we will see many of them before the end of days.
Thanks for your comment.
@evhub I think it's great when you and other RSP supporters make it explicit that (a) you don't think they're sufficient and (b) you think they can lead to more meaningful regulation.
With that in mind, I think the onus is on you (and institutions like Anthropic and ARC) to say what kind of regulations they support & why. And then I think most of the value will come from "what actual regulations are people proposing" and not "what is someone's stance on this RSP thing which we all agree is insufficient."
Except for the fact that there are ways to talk about RSPs that are misleading for policymakers and reduce the chance of meaningful regulations. See the end of my comment and see also Simeon's sections on misleading and how to move forward.
Also, fwiw, I imagine that timelines/takeoff speeds might be relevant cruxes. And IDK if it's the main disagreement that you have with Siméon, but I don't think it's the main disagreement you have with me.
Even if I thought we would have 3 more meaningful policy windows, I would still think that RSPs have not offered a solid frame/foundation for meaningful regulation, I would still think that they are being communicated about poorly, and I would still want people to focus more on proposals for other regulations & focus less on RSPs.
With that in mind, I think the onus is on you (and institutions like Anthropic and ARC) to say what kind of regulations they support & why.
Two questions related to it:
Do you mind pointing me to the section? I skimmed your post again, and the only relevant thing I saw was this part:
- Seeing the existing RSP system in place at labs, governments step in and use it as a basis to enact hard regulation.
- By the time it is necessary to codify exactly what safety metrics are required for scaling past models that pose a potential takeover risk, we have clearly solved the problem of understanding-based evals and know what it would take to demonstrate sufficient understanding of a model to rule out e.g. deceptive alignment.
- Understanding-based evals are adopted by governmental RSP regimes as hard gating evaluations for models that pose a potential takeover risk.
- Once labs start to reach models that pose a potential takeover risk, they either:
- Solve mechanistic interpretability to a sufficient extent that they are able to pass an understanding-based eval and demonstrate that their models are safe.
- Get blocked on scaling until mechanistic interpretability is solved, forcing a reroute of resources from scaling to interpretability.
My summary of this is something like "maybe voluntary RSPs will make it more likely for governments to force people to do evals. And not just the inadequate dangerous capabilities evals we have now– but also the better understanding-based evals that are not yet developed, but hopefully we will have solved some technical problems in time."
I think this is better than no government regulation, but the main problem (if I'm understanding this correctly) is that it relies on evals that we do not have.
IMO, a more common-sense approach would be "let's stop until we are confident that we can proceed safely", and I'm more excited about those who are pushing for this position.
Aside: I don't mean to nitpick your wording, but I think a "full plan" would involve many more details. In the absence of those details, it's hard to evaluate the plan. Examples of some details that would need to be ironed out:
To be clear I don't think you need to specify all of this, and some of these are pretty specific/nit-picky, but I don't think you should be calling this a "full plan."
it relies on evals that we do not have
I agree that this is a problem, but it strikes me that we wouldn't necessarily need a concrete eval - i.e. we wouldn't need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by "understanding a model", such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it's not obvious to me that defining the right question isn't already most of the work)
Personally, I'd prefer that this were done already - i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved - e.g. deferring to an external board. [Anthropic's Long Term Benefit Trust doesn't seem great for this, since it's essentially just Paul who'd have relevant expertise (?? I'm not sure about this - it's just unclear that any of the others would)]
I do think it's reasonable for labs to say that they wouldn't do this kind of thing unilaterally - but I would want them to push for a more comprehensive setup when it comes to policy.
A more in-depth answer:
As I've talked about now extensively, I think enacting RSPs in policy now makes it easier not harder to get even better future regulations enacted.
I appreciate the evidence you've provided on this, and in particular I think it's more than has been provided for the opposite claim and would encourage Simeon and others criticizing RSPs along these lines to provide more evidence (as I commented on a draft of his post).
That being said, I don't yet find the evidence you've provided particularly compelling. I believe you are referring mainly to this section of your posts:
In the theory of political capital, it is a fairly well-established fact that “Everybody Loves a Winner.” That is: the more you succeed at leveraging your influence to get things done, the more influence you get in return. This phenomenon is most thoroughly studied in the context of the ability of U.S. presidents’ to get their agendas through Congress—contrary to a naive model that might predict that legislative success uses up a president’s influence, what is actually found is the opposite: legislative success engenders future legislative success, greater presidential approval, and long-term gains for the president’s party.
I don't understand how the links in this section show that "Everybody Loves a Winner" is a fairly well-established fact that translates to the situation of RSPs. The first link is an op-ed that is paywalled. The second link is a 2013 paper with 7 citations. From the abstract it appears to show that US presidents get higher approval ratings when they succeed in passing legislation, and vice versa. The third link is a 2011 paper with 62 citations (which seems higher, not sure how high this is for its field). From the abstract it appears to show that Presidents which pass agendas in Congress help their party win more Congressional seats. These interpretations don't seem too different from the way you summarized it.
Assuming that this version of "Everybody Loves a Winner" is in fact a well-established fact in the field, it still seems like the claims it's making might not translate to the RSP context fairly well. In particular, RSPs are a legislative framework on a specific (currently niche) issue of AI safety. The fact that Presidents who in general get things done tend to get other benefits including perhaps getting more things done later doesn't seem that relevant to the question of to what extent frameworks on a specific issue tend to be "locked in" after being enacted into law, vs. useful blueprints for future iteration (including potentially large revisions to the framework).
Again, I appreciate you at least providing some evidence but it doesn't seem convincing to me. FWIW my intuitions lean a bit toward your claims (coming from a startup-y background of push out an MVP then iterate from there), but I have a lot of uncertainty.
(This comment is somewhat like an expanded version of my tweet, which also asked for "Any high-quality analyses on whether pushing more ambitious policies generally helps/hurts the more moderate policies, and vice/versa?". I received answers like "it depends" and "unclear".)
Thanks Eli for the comment.
One reason why I haven't provided much evidence is that I think it's substantially harder to give evidence of a "for all" claim (my side of the claim) than a "there exists" (what I ask Evan). I claim that it doesn't happen that a framework on a niche area evolves so fast without accidents based on what I've seen, even in domains with substantial updates, like aviation and nuclear.
I could potentially see it happening with large accidents, but I personally don't want to bet on that and I would want it to be transparent if that's the assumption. I also don't buy the "small coordinations allow larger coordinations" for domain-specific policy. Beyond what you said above, my sense is that policymakers satisfice and hence tend to not come back on a policy that sucks if that's sufficiently good-looking to stakeholders to not have substantial incentives to change.
GDPR cookies banner sucks for everyone and haven't been updated yet, 7 years after GDPR. Standards in the EU are not even updated more rapidly than 5y by default (I'm talking about standards, not regulation), and we'll have to bargain to try to bring it down to reasonable timeframes AI-specific.
IAEA & safety in nuclear upgraded substantially after each accident, likewise for aviation but we're talking about decades, not 5 years.
I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done. E.g. the comparison to other risk policies highlights lack of detail in various ways.
I think it takes a lot of time and work to build our something with lots of analysis and detail, years of work potentially to really do it right. And yes, much of that work hasn't happened yet.
But I would rather see labs post the work they are doing as they do it, so people can give feedback and input. If labs do so, the frameworks will necessarily be much less detailed than they would if we waited until they were complete.
So it seems to me that we are in a messy process that's still very early days. Feedback about what is missing and what a good final product would look like is super valuable, thank you for your work doing that. I hope the policy folks pay close attention.
But I think your view that RSPs are the wrong direction is misguided, or at least I don't find your reasons to be persuasive -- there's much more work to be done before they're good and useful, but that doesn't mean they're not valuable. Honestly I can't think of anything much better that could have been reasonably done given the limited time and resources we all have.
I think your comments on the name are well taken. I think your ideas about disclaimers and such are basically impossible for a modern corporation, unfortunately. I think your suggestion about pushing for risk management in policy are the clear next step, that's only enabled by the existence of an RSP in the first place.
Thanks for the detailed and thoughtful effortpost about RSPs!
Thanks for your comment.
I feel like a lot of the issues in this post are that the published RSPs are not very detailed and most of the work to flesh them out is not done.
I strongly disagree with this. In my opinion, a lot of the issue is that RSPs have been thought from first principles without much consideration for everything the risk management field has done, and hence doing wrong stuff without noticing.
It's not a matter of how detailed they are; they get the broad principles wrong. As I argued (the entire table is about this) I think that the existing principles of other existing standards are just way better and so no, it's not a matter of details.
As I said, the details & evals of RSPs is actually the one thing that I'd keep and include in a risk management framework.
Honestly I can't think of anything much better that could have been reasonably done given the limited time and resources we all have
Well, I recommend looking at Section 3 and the source links. Starting from those frameworks and including evals into it would be a Pareto improvement.
But most of the deficiencies you point out in the third column of that table is about missing and insufficient risk analysis. E.g.:
If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".
This makes me wonder: Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."
Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."
Great question! A few points:
Regarding
If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".
I don't think it's necessarily right, e.g. "the ISO standard asks the organization to define risk thresholds" could be a very simple task, much simpler than developing a full eval. The tricky thing is just to ensure we comply with such levels (and the inability to do that obviously reveals a lack of safety).
"ISO proposes a much more comprehensive procedure than RSPs", it's not right either that it would take longer, it's just that there exists risk management tools, that you can run in like a few days, that helps having a very broad coverage of the scenario set.
"imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?" once again you can cover the most obvious things in like a couple pages. Writing "Maybe they would give the weights to their team of hackers, which increases substantially the chances of leak and global cyberoffence increase". And I would be totally fine with half-baked things if they were communicated as such and not as RSPs are.
c. linked to a post which says "here's the actual thing which is needed to make us safe".
Has anyone made such a credible, detailed, and comprehensive list?
If not, what would it look like in your opinion?
Thanks for your effort in writing this.
I'm very glad people are taking this seriously and exploring various approaches.
However, I happen to think your policy suggestion would be a big mistake.
On you recommendations:
1) Entirely agree. The name sucks, for the reasons you state.
2) Agreed. Much more explicit clarity on this would be great.
3) No.
I'll elaborate on (3):
“measure the risks, deal with them, and make the residual level of risks and the methodology public”.
I'll agree that it would be nice if we knew how to do this, but we do not.
With our current level of understanding, we fall at the first hurdle (we can measure some of the risks).
“Inability to show that risks are below acceptable levels is a failure. Hence, the less we understand a system, the harder it is to claim safety.”
This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don't understand. We cannot demonstrate risks are below acceptable levels.
Assemble a representative group of risk management experts, AI risk experts...
The issue here is that AI risk "experts" in the relevant sense do not exist.
We have "experts" (those who understand more than almost anyone else).
We have no experts (those who understand well).
For a standard risk management approach, we'd need people who understand well.
Given our current levels of understanding, all a team of "experts" could do would be to figure out a lower bound on risk. I.e. "here are all the ways we understand that the system could go wrong, making the risk at least ...".
We don't know how to estimate an upper bound in any way that doesn't amount to a wild guess.
Why is pushing for risk quantification in policy a bad idea?
Because, logically, it should amount to an immediate stop on all development.
However, since "We should stop immediately because we don't understand" can be said in under ten words, if any much more lengthy risk-management approach is proposed, the implicit assumption will be that it is possible to quantify the risk in a principled way. It is not.
Quantified risk estimates that are wrong are much worse than underdefined statements.
I'd note here that I do not expect [ability to calculate risk for low-stakes failures] to translate into [ability to calculate risk for catastrophic failures] - many are likely to be different in kind.
Quantified risk estimates that are correct for low-stakes failures, but not for catastrophic failures are worse still.
One of the things Anthropic's RSP does right is not to quantify things that we don't have the understanding to quantify.
There is no principled way to quantify catastrophic risk without a huge increase in understanding.
The dangerous corollary is that there's no principled way to expose dubious quantifications as dubious (unless they exhibit obvious failures we concretely understand).
Once a lab has accounted for all the risks that are well understood, they'd be able to say "we think the remaining risk is very low because [many soothing words and much hand-waving]", and there'll be no solid basis to critique this - because we lack the understanding.
I think locking in the idea that AI risk can be quantified in a principled way would be a huge error.
If we think that standard risk-management approaches are best, then the thing to propose would be:
1) Stop now.
2) Gain sufficient understanding to quantify risk. (this may take decades)
3) Apply risk management techniques.
Thanks a lot for this constructive answer, I appreciate the engagement.
I'll agree that it would be nice if we knew how to do this, but we do not.
With our current level of understanding, we fall at the first hurdle (we can measure some of the risks).
Three points on that:
This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don't understand. We cannot demonstrate risks are below acceptable levels.
That's more complicated than that to the extent you could probably train code generation systems or other systems with narrowed down domain of operations, but I indeed think that on LLMs, risk levels would be too high to keep scaling >4 OOMs on fully general LLMs that can be plugged to tools etc.
I think that it would massively benefit to systems we understand and have could plausibly reach significant levels of capabilities at some point in the future (https://arxiv.org/abs/2006.08381). It would probably lead labs to massively invest into that.
Given our current levels of understanding, all a team of "experts" could do would be to figure out a lower bound on risk. I.e. "here are all the ways we understand that the system could go wrong, making the risk at least ...".
I agree by default we're unable to upper bound risks and I think that's it's one additional failure of RSP to make as if we were able to do so. The role of calibrated forecasters in the process is to ensure that they help keeping in mind the uncertainty arising from this.
Why is pushing for risk quantification in policy a bad idea?
[...]
However, since "We should stop immediately because we don't understand" can be said in under ten words, if any much more lengthy risk-management approach is proposed, the implicit assumption will be that it is possible to quantify the risk in a principled way. It is not.
Quantified risk estimates that are wrong are much worse than underdefined statements.
Once again, thanks a lot for your comment!
As a general point, I agree that your suggestion is likely to seem better than RSPs.
I'm claiming that this is a bad thing.
To the extent that an approach is inadequate, it's hugely preferable for it to be clearly inadequate.
Having respectable-looking numbers is not helpful.
Having a respectable-looking chain of mostly-correct predictions is not helpful where we have little reason to expect the process used to generate them will work for the x-risk case.
But I think that the AI risk experts x forecasters x risk management experts is a very solid baseline, much more solid than not measuring the aggregate risk at all.
The fact that you think that this is a solid baseline (and that others may agree), is much of the problem.
What we'd need would be:
[people who deeply understand AI x-risk] x [forecasters well-calibrated on AI x-risk] x [risk management experts capable of adapting to this context]
We don't have the first, and have no basis to expect the second (the third should be doable, conditional on having the others).
I do expect the first few shots of risk estimate to be overconfident
Ok, so now assume that [AI non-existential risk] and [AI existential risk] are very different categories in terms of what's necessary in order to understand/predict them (e.g. via the ability to use plans/concepts that no human can understand, and/or to exert power in unexpected ways that don't trigger red flags).
We'd then get: "I expect the first few shots of AI x-risk estimate to be overconfident, and that after many failures the field would be red-pilled, but for the inconvenient detail that they'll all be dead".
Feedback loops on non-existential incidents will allow useful updates on a lower bound for x-risk estimates.
A lower bound is no good.
...not only do that but also deterministic safety analysis and scenario based risk analysis...
Doing either of these effectively for powerful systems is downstream of understanding we lack.
This again gives us a lower bound at best - we can rule out all the concrete failure modes we think of.
However, saying "we're doing deterministic safety analysis and scenario based risk analysis" seems highly likely to lead to overconfidence, because it'll seem to people like the kind of thing that should work.
However, all three aspects fail for the same reason: we don't have the necessary understanding.
I think that one core feature you might miss here is that uncertainty should be reflected in quantified estimates if we get forecasters into it
This requires [well calibrated on AI x-risk forecasters]. That's not something we have. Nor is it something we can have. (we can believe we have it if we think calibration on other things necessarily transfers to calibration on AI x-risk - but this would be foolish)
The best case is that forecasters say precisely that: that there's no basis to think they can do this.
I imagine that some will say this - but I'd rather not rely on that.
It's not obvious all will realize they can't do it, nor is it obvious that regulators won't just keep asking people until they find those who claim they can do it.
Better not to create the unrealistic expectation that this is a thing that can be done. (absent deep understanding)
I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.
tl;dr:
Dario's statement seems likely to reduce overconfidence.
Risk-management-style policy seems likely to increase it.
Overconfidence gets us killed.
I think Dario's public estimate of 10-25% is useful in large part because:
Conditional on regulators adopting a serious risk-management-style approach, I expect that we've already achieved (1).
The reason I'm against it is that it'll actually be rough and unprincipled, but this will not clear - in most people's minds (including most regulators, I imagine) it'll map onto the kind of systems that we have for e.g. nuclear risks.
Further, I think that for AI risk that's not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it'll tend to lead to overconfidence.
I don't think I'm particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.
That's not what I expect would happen (if it's part of an official regulatory system, that is).
Two cases that spring to mind are:
The first makes this kind of approach unnecessary - better to get the cautious people make the case that we have no solid basis to make these assessments that isn't a wild guess.
The second seems likely to lead to overconfidence. If there's an officially sanctioned team of "experts" making "expert" assessments for an international(?) regulator, I don't expect this to be treated like the guess that it is in practice.
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it's worth I'm not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
The reason I'm against it is that it'll actually be rough and unprincipled, but this will not clear - in most people's minds (including most regulators, I imagine) it'll map onto the kind of systems that we have for e.g. nuclear risks.
I think quantifying "rough and unprincipled" estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that's not what I expect to happen if the team is competent.
The first makes this kind of approach unnecessary - better to get the cautious people make the case that we have no solid basis to make these assessments that isn't a wild guess.
I'd guess something like "expert team estimates a 1% chance of OpenAI's next model causing over 100 million deaths, causing 1 million deaths in expectation" might hit harder to policymakers than "experts say we have no idea whether OpenAI's models will cause a catastrophe". The former seems to sound more clearly an alarm than the latter. This is definitely not my area of expertise though.
That's reasonable, but most of my worry comes back to:
In part, I'm worried that the argument for (1) is too simple - so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I'd prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there's a version of using AI assistants for alignment work that reduces overall risk]. Here I'd like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it's more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don't think Eliezer's critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he's likely to be essentially correct - that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can't do this, we'll be accelerating a vehicle that can't navigate.
[EDIT: oh and of course there's the [if we really suck at navigation, then it's not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there's a decent case that improving our ability to navigate might be something that it's hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don't need complex analyses.
GPT-4 + unknown unknowns = stop. (whether they say "unknown unknowns so 5% chance of 8 billion deaths", or "unknown unknowns so 0.1% chance of 8 billion deaths
I feel like .1% vs. 5% might matter a lot, particularly if we don't have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I'd be surprised if it doesn't help substantially to have more granular estimates.
my worry isn't that it's not persuasive that time. It's that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we'll be screwed.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Anyway, overall I'd be surprised if it doesn't help substantially to have more granular estimates.
Oh, I'm certainly not claiming that no-one should attempt to make the estimates.
I'm claiming that, conditional on such estimation teams being enshrined in official regulation, I'd expect their results to get misused. Therefore, I'd rather that we didn't have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab's policy, rather than their immediate actions. I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]. (importantly, it won't always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
fOh, I'm certainly not claiming that no-one should attempt to make the estimates.
Ah my bad if I lost the thread there
I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]
Seems true in an ideal world but in practice I'd imagine it's much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there's lots of disagreement even within the current alignment field and I don't expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it's good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there's a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it's a question at the margin, etc.)
The other portions of your comment I think I've already given my thoughts on previously, but overall I'd say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I'm not sure it's particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it's perceived via pilots and go from there.
https://www.lesswrong.com/posts/9nEBWxjAHSu3ncr6v/responsible-scaling-policies-are-risk-management-done-wrong?commentId=zJzBaoBhP8tti4ezb
This implies an immediate stop to all frontier AI development (and probably a rollback of quite a few deployed systems). We don't understand. We cannot demonstrate risks are below acceptable levels.
Yes yes yes. A thousand times yes. We are already in terrible danger. We have crossed bad thresholds. I'm currently struggling with how to prove this to the world without actually worsening the danger by telling everyone exactly what the dangers are. I have given details about theses dangers to a few select individuals. I will continue working on evals.
In the meantime, I have contributed to this report which I think does a good job of gesturing in the direction of the bad things without going into too much detail on the actual heart of the matter:
Will releasing the weights of large language models grant widespread access to pandemic agents?
Anjali Gopal, Nathan Helm-Burger, Lenni Justen, Emily H. Soice, Tiffany Tzeng, Geetha Jeyapragasan, Simon Grimm, Benjamin Mueller, Kevin M. Esvelt
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.
Thanks! I can't help but compare this to Tyler Cowen asking AI risk to "have a model." Except I think he was imagining something like a macroeconomic model, or maybe a climate model. Whereas you're making a comparison to risk management, and so asking for an extremely zoomed-in risk model.
One issue is that everyone disagrees. Doses of radiation are quite predictable, the arc of new technology is not. But having stated that problem, I can already think of mitigations - e.g. regulation might provide a framework that lets the regulator ask lots of people with fancy job titles and infer distributions about many different quantities (and then problems with these mitigations - this is like inferring a distribution over the length of the emperor's nose by asking citizens to guess). Other quantities could be placed in historical reference classes (AI Impacts' work shows both how this is sometimes possible and how it's hard).
Thanks for your comment.
One issue is that everyone disagrees.
That's right and that's a consequence of uncertainty, which prevents us from bounding risks. Decreasing uncertainty (e.g. through modelling or through the ability to set bounds) is the objective of risk management.
Doses of radiation are quite predictable
I think it's mostly in hindsight. When you read stuff about nuclear safety in the 1970s, it's really not how it was looking.
See Section 2
the arc of new technology is not [predictable]
I think that this sets a "technology is magic" vibe which is only valid for scaling neural nets (and probably only because we haven't invested that much into understanding scaling laws etc.), and not for most other technologies. We can actually develop technology where we know what it's doing before building it and that's what we should aim for given what's at stakes here.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
It’s called “responsible scaling”. In its own name, it conveys the idea that not further scaling those systems as a risk mitigation measure is not an option.
That seems like an uncharitable reading of "responsible scaling." Strictly speaking, the only thing that name implies is that it is possible to scale responsibly. It could be more charitably interpreted as "we will only scale when it is responsible to do so." Regardless of whether Anthropic is getting the criteria for "responsibility" right, it does seem like their RSP leaves open the possibility of not scaling.
This NIST Risk Management approach sounds great, if AI Alignment was a mature field whose underlying subject matter wasn't itself advancing extremely fast — if only we could do this! But currently I think that for many of our risk estimates it would be hard to get agreement between topic experts at even an order of magnitude scale (e.g.: is AGI misalignment >90% likely or <10% likely? YMMV). I think we should aspire to be a field mature enough that formal Risk Management is applicable, and in some areas of short-term misuse risks from current well-understood models (open-source ones, say), that might even be approaching feasibility. But for the existential AGI-level risks several years out, which are the ones that matter most, the field just isn't mature enough yet.
Bear in mind that there was a point in the development of nuclear safety where people built open-circuit air-cooled nuclear reactors where the air-flow past the pile went up a chimney venting the hot air up into the atmosphere (yes, really: https://en.wikipedia.org/wiki/Windscale_Piles ). They didn't then know better, and there's an area near Ravenglass in England that's still somewhat contaminated as a result after one caught fire. Nuclear safety knowledge has since advanced to be able to do elaborate Risk Management, and a good thing too, but unfortunately AI safety isn't that mature yet.
Excellent article. I appreciate how you clarify that Anthropic's "Responsible Scaling Policy" is a set-up that allows for safety-washing. We would be depending on their well-considered good intentions, rather than any mechanism to hold them accountable.
Have you looked into how system safety engineers (eg. medical device engineers) scope the uses of software, such to be able to comprehensively design, test, and assess the safety of the software?
Operational Design Domains scope the use of AI in self-driving cars. Tweeted about that here.
Summary
TLDR
Responsible Scaling Policies (RSPs) have been recently proposed as a way to keep scaling frontier large language models safely.
While being a nice attempt at committing to specific practices, the framework of RSP is:
Given that, I expect the RSP framework to be negative by default (Section 3, 4 and 5). Instead, I propose to build upon risk management as the core underlying framework to assess AI risks (Section 1 and 2). I suggest changes to the RSP framework that would make it more likely to be positive and allow to demonstrate what it claims to do (Section 5).
Section by Section Summary:
General Considerations on AI Risk Management
This section provides background on risk management and a motivation for its relevance in AI.
Read more.
What Standard Risk Management Looks Like
This section describes the main steps of most risk management systems, explains how it applies to AI, and provides examples from other industries of what it looks like.
Read more.
RSPs vs Standard Risk Management
This section provides a table comparing RSPs and generic risk management standard ISO/IEC 31000, explaining the weaknesses of RSPs.
It then provides a list of 3 of the biggest failures of RSPs compared with risk management.
Prioritized RSPs failures against risk management:
Read more.
Why RSPs Are Misleading and Overselling
Misleading points:
Overselling and Underdelivering
Read more.
Are RSPs Hopeless?
This section explains why using RSPs as a framework is inadequate, even compared to just starting from already-existing AI risk management frameworks and practices such as:
A substantial amount of work that RSPs has done will be helpful as a part of detailing those frameworks, but core foundational principles of RSPs are wrong and so should be abandoned.
How to move forward?
Pragmatically, I suggest a set of changes that would make RSPs more likely to be helpful for safety. To mitigate the policy and communication nefarious effects:
To see whether already defined RSPs are consistent with reasonable levels of risks:
Read more.
Meta
Epistemic status: I've been looking into various dangerous industries safety standards for now about 4-6 months, with a focus on nuclear safety. I've been doing AI standardization work in standardization bodies (CEN-CENELEC & ISO/IEC) for about 10 months, along risk management experts from other domains (e.g. medical device, cars). In that context, I've read the existing AI ISO/IEC SC42 and JTC21 standards and started trying to apply them to LLMs and refining them. On RSPs, I've spent a few dozen hours reading the docs and discussing those with people involved and around it.
Tone: I hesitated in how charitable I wanted this piece to be. One the one hand, I think that RSPs is a pretty toxic meme (see Section 4) that got rushed towards a global promotion without much epistemic humility over how it was framed, and as far as I've seen without anyone caring much about existing risk management approaches. In that sense, I think that it should strongly be pushed back against under its current framing.
On the other hand, it's usually nice to try to not use negative connotations and calm discussion to move forward epistemically and constructively.
I aimed for something in-between, where I did emphasize with strong negative connotations what I think are the worst parts, while focusing on remaining constructive and focusing on the object-level in many parts.
This mixture may have cause me to land in an uncanney valley, and I'm curious to receive feedback on that.
Section 1: General Considerations on AI Risk Management
Risk management is about demonstrating that risks are below acceptable levels. Demonstrating the absence of risks is much more difficult than showing that some risks are dealt with. More specifically, the less you understand a system, the harder it is to rule out risks.
Let’s take an example: why can we prove more easily that the chances that a nuclear power plant causes a large-scale catastrophe are <1 / 100 000 while we can’t do so with GPT-5? In large part because we now understand nuclear power plants and many of their risks. We know how they work, and the way they can fail. They’ve turned a very unstable reaction (nuclear fission) into something manageable (with nuclear reactors). So the uncertainty we have over a nuclear power plant is much smaller than the one we have on GPT-5.
One corollary is that in risk management, uncertainty is an enemy. Saying “we don’t know” is a failure. Ruling out risks confidently requires a deep understanding of the system and disproving with very high confidence significant worry. To be clear: it is hard. In particular when the operational domain of your system is “the world”. That’s why safety is demanding. But is that a good reason to lower our safety standards when the lives of billions of people are at stake? Obviously no.
One could legitimately say: Wait, but there’s no risk in sight, the burden of proof is on those that claim that it’s dangerous. Where’s the evidence?
Well, there’s plenty:
One could legitimately say: No, but it’s not catastrophic, it’s not a big deal. Against this stance, I’ll quote the famous physicist R. Feynman reflecting on the Challenger disaster in rocket safety, a field with much higher standards than AI safety:
One could finally hope that we understand the past failures of our systems. Unfortunately, we don’t. We not only don’t understand their failure; we don’t understand how and why they work in the first place.
So how are we supposed to deal with risk?
Risk management proposes a few step methods that I’ll describe below. Most industries implement a process along those lines, with some minor variations and a varying degree of rigor and depth according to the level of regulation and type of risks. I’ll put a few tables on that in a table you can check in the Annex.
Section 2: What Standard Risk Management Looks Like
Here’s a description of the core steps of the risk management process. Names vary between frameworks but the gist of it is contained here and usually shared across frameworks.
What’s the point of those pretty generic steps and why would it help AI safety?
(1) The definition of risk thresholds is key 1) to make commitments falsifiable & avoid goalpost moving and 2) to keep the risk-generating organization accountable when other stakeholders are incurring risks due to its activity. If an activity is putting people’s lives at risk, it is important that they know how much and for what benefits and goals.
2. The UC Berkeley Center for Long-Term Cybersecurity NIST-inspired risk management profile for General-purpose AI systems co-written with D. Hendrycks provides some thoughts on how to define those in Map 1.
(2) Risk identification through systematic methods is key to try to reach something as close as possible from a full coverage of risks. As we said earlier, in risk management, uncertainty is a failure and a core way to substantially reduce it is to try to be as comprehensive as possible.
(3) Risk assessment through qualitative and quantitative means allows us to actually estimate the uncertainty we have. It is key to then prioritize safety measures and decide whether it’s reasonable to keep the project under its current forms or modify it.
The US Nuclear Regulatory Commission (NRC) specifies that reactor designs must meet a theoretical 1 in 10,000 year core damage frequency, but modern designs exceed this. US utility requirements are 1 in 100,000 years, the best currently operating plants are about 1 in one million and those likely to be built in the next decade are almost 1 in 10 million.
World Nuclear Association, 2022
b. Events are very fat-tailed and misunderstood, as was the case in nuclear safety in the 1970s. It has been done, and it is through the iterative practice of doing it that an industry can become more responsible and cautious. Reading a book review of Safe Enough?, a book on the history of quantitative risk assessment methods used in nuclear safety, there’s a sense of déjà-vu:
If nuclear plants were to malfunction at some measurable rate, the industry could use that data to anticipate its next failure. But if the plants don't fail, then it becomes very difficult to have a conversation about what the true failure rate is likely to be. Are the plants likely to fail once a decade? Once a century? Once a millennium? In the absence of shared data, scientists, industry, and the public were all free to believe what they wanted.
Astral Codex Ten, 2023, describing the genesis of probabilistic risk assessment in nuclear safety.
(4) Risk treatment is in reaction to the risk assessment and must be pursued until you reach the risk thresholds defined. The space of interventions here is very large, larger than is usually assumed. Better understanding one’s system, narrowing down its domain of operation by making it less general, increasing the amount of oversight, improving safety culture: all those are part of a broad set of interventions that can be used to meet thresholds. There can be a loop between the treatment and the assessment if substantial changes to the system are done.
(5) Monitoring is the part that ensures that the risk assessment remains valid and nothing major has been left out. This is what behavioral model evaluations are most useful for, i.e. ensuring that you track the risks you’ve identified. Good evaluations would map to a pre-defined risk appetite (e.g. 1% chance of >1% deaths) and would cover all risks brought up through systematic risk identification.
(6) Reporting is the part that ensures that all relevant stakeholders are provided with the right information. For instance, those incurring risks from the activities should be provided with information on the amount of risk they’re exposed to.
Now that we’ve done a rapid overview of standard risk management and why it is relevant to AI safety, let’s talk about how RSPs compare against that.
Section 3: RSPs vs Standard Risk Management
Some underlying principles of RSPs should definitely be pursued. There are just better ways to pursue these principles, that already exist in risk management, and happen to be what most other dangerous industries and fields do. To give two examples of such good underlying principles:
Below, I argue why RSPs are a bad implementation of some good risk management principles and why that makes the RSP framework inadequate to manage risks.
Direct Comparison
Let’s dive into a more specific comparison between the two approaches. The International Standards Organization (ISO) has developed two risk management standards that are relevant to AI safety, although not focused on it:
To be clear, those standards are not sufficient. They’re considered weak by most EU standardization actors or extremely weak by risk management experts from other industries like the medical device industry. There will be a very significant amount of work needed to refine such frameworks for general-purpose AI systems (see a first iteration by T. Barrett here, and a table of how it maps to ISO/IEC 23894 here) But those provide basic steps and principles that, as we explained above, are central to adequate risk management.
In the table below, I start from the short version of ARC Evals’ RSP principles and try to match the ISO/IEC 31000 version that most corresponds. I then explain what’s missing from the RSP version. Note that:
The table version:
Limits: which specific observations about dangerous capabilities would indicate that it is (or strongly might be) unsafe to continue scaling?
Defining risk criteria: The organization should specify the amount and type of risk that it may or may not take, relative to objectives.
It should also define criteria to evaluate the significance of risk and to support decision-making processes.
Risk criteria should be aligned with the risk management framework and customized to the specific purpose and scope of the activity under consideration.
[...]
The criteria should be defined taking into consideration the organization’s obligations and the views of stakeholders.
[...]
To set risk criteria, the following should be considered:
— the nature and type of uncertainties that can affect outcomes and objectives (both tangible and intangible);
— how consequences (both positive and negative) and likelihood will be defined and measured;
— time-related factors;
— consistency in the use of measurements;
— how the level of risk is to be determined;
— how combinations and sequences of multiple risks will be taken into account;
— the organization’s capacity.
RSPs doesn’t argue why systems passing evals are safe. This is downstream of the absence of risk thresholds with a likelihood scale. For example, Anthropic RSP also dismisses accidental risks as “speculative” and “unlikely” without much depth, without much understanding of their system, and without expressing what “unlikely” means.
On the other hand, the ISO standard asks the organization to define risk thresholds, and emphasizes the need to match risk management with organizational objectives (i.e. build human-level AI).
Protections: what aspects of current protective measures are necessary to contain catastrophic risks?
Evaluation: what are the procedures for promptly catching early warning signs of dangerous capability limits?
Risk analysis: The purpose of risk analysis is to comprehend the nature of risk and its characteristics including, where appropriate, the level of risk. Risk analysis involves a detailed consideration of uncertainties, risk sources, consequences, likelihood, events, scenarios, controls and their effectiveness.
Risk evaluation: The purpose of risk evaluation is to support decisions. Risk evaluation involves comparing the results of the risk analysis with the established risk criteria to determine where additional action is required. This can lead to a decision to:
— do nothing further;
— consider risk treatment options;
— undertake further analysis to better understand the risk;
— maintain existing controls;
— reconsider objectives.
ISO proposes a much more comprehensive procedure than RSPs, that doesn’t really analyze risk levels or have a systematic risk identification procedure.
The direct consequence is that RSPs are likely to lead to high levels of risks, without noticing.
For instance, RSPs don’t seem to cover capabilities interaction as a major source of risk.
Risk treatment plans: Risk treatment options are not necessarily mutually exclusive or appropriate in all circumstances. Options for treating risk may involve one or more of the following:
— avoiding the risk by deciding not to start or continue with the activity that gives rise to the risk; — taking or increasing the risk in order to pursue an opportunity; — removing the risk source;
— changing the likelihood;
— changing the consequences; — sharing the risk (e.g. through contracts, buying insurance);
— retaining the risk by informed decision
Treatment plans should be integrated into the management plans and processes of the organization, in consultation with appropriate stakeholders.
The information provided in the treatment plan should include:
— the rationale for selection of the treatment options, including the expected benefits to be gained;
— those who are accountable and responsible for approving and implementing the plan;
— the proposed actions;
— the resources required, including contingencies;
— the performance measures;
— the constraints;
— the required reporting and monitoring;
— when actions are expected to be undertaken and completed
ISO, thanks to the definition of a risk threshold, ensures that risk mitigation measures bring risks below acceptable levels. The lack of risk thresholds for RSPs makes the risk mitigation measures ungrounded.
Example: ASL-3 risk mitigation measures as defined by Anthropic (i.e. close to catastrophically dangerous) imply significant chances to be stolen by Russia or China (I don’t know any RSP person who denies that). What are the risks downstream of that? The hope is that those countries keep the weights secure and don’t cause too many damages with it.
Monitoring and review: The purpose of monitoring and review is to assure and improve the quality and effectiveness of process design, implementation and outcomes. Ongoing monitoring and periodic review of the risk management process and its outcomes should be a planned part of the risk management process, with responsibilities clearly defined. [...] The results of monitoring and review should be incorporated throughout the organization’s performance management, measurement and reporting activities.
Recording and reporting: The risk management process and its outcomes should be documented and reported through appropriate mechanisms. Recording and reporting aims to:
— communicate risk management activities and outcomes across the organization;
— provide information for decision-making;
— improve risk management activities;
— assist interaction with stakeholders, including those with responsibility and accountability for risk management activities.
Those parts have similar components.
But ISO encourages reporting the results of risk management to those that are affected by the risks, which seems like a bare minimum for catastrophic risks.
Anthropic’s RSP proposes to do so after deployment, which is a good accountability start, but still happens once a lot of the catastrophic risk has been taken.
Prioritized Risk Management Shortcomings of RSPs
Here’s a list of the biggest direct risk management failures of RSPs:
1. Using underspecified definitions of risk thresholds and not quantifying the risk. RSPs don't define risk thresholds in terms of likelihood. Instead, they focus straight away on symptoms of risks (certain capabilities that an evaluation is testing is one way a risk could instantiate) rather than the risk itself (the model helping in any possible way to build bioweapons). This makes it hard to verify whether safety requirements have been met and argue whether the thresholds are reasonable. Why is it an issue?
By describing neither quantitatively nor qualitatively why it is risky, expressed in terms of risk criteria (e.g. 0.1% chance of killing >1% of humans) it doesn’t do the most important step to demonstrate that below this threshold, things are safe and acceptable. For instance, in the example above, why is “the majority of people with any life sciences degree” relevant? Would it be fine if only 10% of this population was now able to create a bioweapon? Maybe, maybe not. But without clear criteria, you can’t tell.
2. Claiming “responsible scaling” without including a process to make the assessment comprehensive. When you look at nuclear accidents, what’s striking is how unexpected failures are. Fukushima is an example where everything goes wrong at the same time. Chernobyl is an example where engineers didn’t think that the accident that happened was possible (someone claims that they were so surprised that engineers actually ran another real-world test of the failure that happened at Chernobyl because they doubted too much it could happen).
Without a more comprehensive process to identify risks and compare their likelihood and severity against pre-defined risk thresholds, there’s very little chance that RSPs will be enough. When I asked some forecasters and AI safety researchers around me, the estimates of the annual probability of extinction caused by an ASL-3 system (defined in Anthropic RSPs) were several times above 1%, up to 5% conditioning on our current ability to measure capabilities (and not an idealized world where we know very well how to measure those).
3. Including the white knight clause that kills commitments.
One of the proposals that striked me the most when reading RSPs is the insertion of what deserves the name of the white knight clause.
I believe that there are companies that are safer than others. But that’s not the right question. The right question is: is there any company which wouldn’t consider itself as a bad guy? And the answer is: no. OpenAI, Anthropic and DeepMind would all argue about the importance of being at the frontier to solve alignment. Meta and Mistral would argue that it’s key to democratize AI to not prevent power centralization. And so on and so forth.
This clause is effectively killing commitments. I’m glad that Anthropic included only a weakened version of it in its own RSP but I’m very concerned that ARC is pitching it as an option. It’s not the role of a company to decide whether it’s fine or not to increase catastrophic risks for society as a whole.
Section 4: Why RSPs Are Misleading and Overselling
Misleading
Beyond the designation of misalignment risks as “speculative” on Anthropic RSPs and a three line argument for why it’s unlikely among next generation systems, there are several extremely misleading aspects of RSPs:
If you want to read more on that, you can read that.
Overselling, underdelivering
The RSP framework has some nice characteristics. But first, these are all already covered, in more detail, by existing risk assessment frameworks that no AI lab has implemented. And second, the coexistence of ARC's RSP framework with the specific RSPs labs implementations allows slack for commitments that are weak within a framework that would in theory allow ambitious commitments. It leads to many arguments of the form:
I also feel like it leads to many disagreements that all hinge on: do we expect labs to implement ambitious RSPs?
And my answer is: given their track record, no. Not without government intervention. Which brings us to the question: “what’s the effect of RSPs on policy and would it be good if governments implemented those”. My answer to that is: An extremely ambitious version yes; the misleading version, no. No, mostly because of the short time we have before we see heightened levels of risks, which gives us very little time to update regulations, which is a core assumption on which RSPs are relying without providing evidence of being realistic.
I expect labs to push hard for the misleading version, on the basis that pausing is unrealistic and would be bad for innovation or for international race. Policymakers will have a hard time distinguishing the risk levels between the two because it hinges on details and aren’t quantified in RSPs. They are likely to buy the bad misleading version because it’s essentially selling that there’s no trade-off between capabilities and safety. That would effectively enforce a trajectory with unprecedented levels of catastrophic risks.
Section 5: Are RSPs Hopeless?
Well, yes and no.
In the short-run, given that it seems that RSPs have started being pushed at the UK Summit and various other places, I’ll discuss what changes could make RSPs beneficial without locking in regulation a bad framework.
How to Move Forward?
Mitigating nefarious effects:
Checking whether RSPs manage risks adequately:
At a risk management level, if one wanted to demonstrate that RSPs like Anthropic’s one are actually doing what they claim to do (i.e. “require safety, security, and operational standards appropriate to a model’s potential for catastrophic risk”), a simple way to do so would be to run a risk assessment on ASL-3 systems with a set of forecasters, risk management experts and AI risk experts that are representative of views on AI risks and that have been selected by an independent body free of any conflict of interest.
I think that a solid baseline would be to predict the chances of various intermediary and final outcomes related to the risks of such systems:
It might not be too far from what Anthropic seems to be willing to do internally, but doing it with a publicly available methodology, and staff without self-selection or conflict of interests makes a big difference. Answers to questions 1) and 2) could raise risks so the output should be communicated to a few relevant actors but could potentially be kept private.
If anyone has the will but doesn’t have the time or resources to do it, I’m working with some forecasters and AI experts that could probably make it happen. Insider info would be helpful but mostly what would be needed from the organization is some clarifications on certain points to correctly assess the capabilities of the system and some info about organizational procedures.
Acknowledgments
I want to thank Eli Lifland, Henry Papadatos and my other NAIR colleague, Olivia Jimenez, Akash Wasil, Mikhail Samin, Jack Clark, and other anonymous reviewers for their feedback and comments. Their help doesn’t mean that they endorse the piece. All mistakes are mine.
Annex
Comparative Analysis of Standards
This (cropped) table shows the process of various standards for the 3 steps of risk management. As you can see, there are some differences but every standard seems to follow a similar structure.
From (Raz & Hillson, 2005)
Here is a comparable table for the last two parts of risk management.