ARC Evals: Responsible Scaling Policies

Zach Stein-Perlman

LESSWRONG
LW

ARC Evals: Responsible Scaling Policies — LessWrong

40 ARC Evals: Responsible Scaling Policies

by Zach Stein-Perlman

28th Sep 2023

3 min read

40

This is a linkpost for https://evals.alignment.org/blog/2023-09-26-rsp/

We’ve been consulting with several parties¹ on responsible scaling² policies (RSPs). An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.
We think RSPs are one of the most promising paths forward for reducing risks of major catastrophes from AI. We’re excited to advance the science of model evaluations to help labs implement RSPs that reliably prevent dangerous situations, but aren’t unduly burdensome and don’t prevent development when it’s safe.
This page will explain the basic idea of RSPs as we see it, then discuss:

Why we think RSPs are promising. In brief (more below):
A pragmatic middle ground. RSPs offer a potential middle ground between (a) those who think AI could be extremely dangerous and seek things like moratoriums on AI development, and (b) who think that it’s too early to worry about capabilities with catastrophic potential. RSPs are pragmatic and threat-model driven: rather than arguing over the likelihood of future dangers, we can implement RSPs that commit to measurement (e.g., evaluations) and empirical observation - pausing deployment and/or development if specific dangerous AI capabilities emerge, until protective measures are good enough to handle them safely.
Knowing which protective measures to prioritize. RSPs can help AI developers move from broad caution-oriented principles to specific commitments, giving a framework for which protective measures (information security; refusing harmful requests; alignment research; etc.) they need to prioritize to safely continue development and deployment.
Evals-based rules and norms. In the longer term, we’re excited about evals-based AI rules and norms more generally: requirements that AI systems be evaluated for dangerous capabilities, and restricted when that’s the only way to contain the risks (until protective measures improve). This could include standards, third-party audits, and regulation. Voluntary RSPs can happen quickly, and provide a testbed for processes and techniques that can be adopted to make future evals-based regulatory regimes work well.
What we see as the key components of a good RSP, with sample language for each component. In brief, we think a good RSP should cover:
Limits: which specific observations about dangerous capabilities would indicate that it is (or strongly might be) unsafe to continue scaling?
Protections: what aspects of current protective measures are necessary to contain catastrophic risks?
Evaluation: what are the procedures for promptly catching early warning signs of dangerous capability limits?
Response: if dangerous capabilities go past the limits and it’s not possible to improve protections quickly, is the AI developer prepared to pause further capability improvements until protective measures are sufficiently improved, and treat any dangerous models with sufficient caution?
Accountability: how does the AI developer ensure that the RSP’s commitments are executed as intended; that key stakeholders can verify that this is happening (or notice if it isn’t); that there are opportunities for third-party critique; and that changes to the RSP itself don’t happen in a rushed or opaque way?
Adopting an RSP should be a strong and reliable signal that an AI developer will in fact identify when it’s too dangerous to keep scaling up capabilities, and react appropriately.

I'm excited about labs adopting RSPs for several reasons:

Making commitments and planning-in-advance is good for safety
Making commitments is good for coordination and helping other labs act well
Making commitments is good for transparency and helping outsiders predict what labs will do
Making safety commitments conditional on risks is nice — everyone, including people who think risk is very small, should be able to agree on warning signs that should lead to stronger safety practices or pausing scaling

Possible discussion on twitter here and here.

Responsible Scaling PoliciesAI GovernanceMETR (org)AI

Frontpage

40

ARC Evals: Responsible Scaling Policies

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:28 PM

[-]Zach Stein-Perlman2y164

Habryka is pessimistic:

Ok, but, what protective measures are we talking about that are supposed to help with things like deceptive alignment and treacherous turn concerns? These seem to me like the primary ones driving risk, and we have little to no traction on even identifying these dynamics.
The natural result of this kind of strategy seems to me that you drive AI directly towards seeming less dangerous until you scaled them a lot, or they are proliferated enough that a takeover is easy, and then you die. This pretends we have more traction on alignment than we do.
The key component of this kind of strategy is to define a clear threshold which causes an approximately fully stop of AI development, and that threshold cannot be goodhearted or gamed. I don't see how the current "Responsible Scaling Policy" helps with that very much.
Despite the blogpost somewhat implying otherwise, I can't get over this feeling that the central story this is setting itself up as is "we will define some misalignment metrics, then we will make our AIs meet those metrics, then the AIs will be aligned".
Whereas the actual story that seems most likely to occur when you aim for that is "we will define some misalignment metrics, then we will goodheart those metrics or make our AIs deceptively aligned, then we scale them up a lot, then we have an AI takeover".

I hope to say more soon but I am optimistic that we can find thresholds below dangerous levels that "cannot be goodhearted or gamed" too much.

[-]Charlie Steiner2y4-2

I mean, the good news is that I expect that by focusing on evaluating capability of predictively trained models, (and potentially by doing more research on latent adversarial attacks on more agenty models), problems of deceptive alignment can be sidestepped in practice.

My worry is more along the lines of actually getting people to not build the cool next step AI that's tantalizingly over the horizon. Government may be a mechanism for enforcing some measure of institutional buy-in, though.

[-]Joe Collman2y*107

It's good news if you're right.
If you and others expect this and are wrong, it's very bad news.
Overconfidence in plausible-to-many schemes is one of the main ways we die.

I expect that by ... problems of deceptive alignment can be sidestepped in practice.

Why do you expect this? Is this a ~70% 'expect' or a ~99% 'expect'?
Do you expect capability evaluations to be robust (i.e. no false negatives)? Why?
Do you expect it'll be clear when we might start getting false negatives? Why?

What do you mean by "deceptive alignment" exactly? Does your definition cover every case of [this thing looked ~~aligned~~ [EDIT: safe] according to tests we believed were thorough, but later it killed us anyway]? If not, there's little comfort in knowing that the thing that killed us didn't meet some particular definition of deceptive alignment.

I can understand hoping that we'd sidestep most cases of deceptive alignment this way.
It's not clear to me how you get to expecting we'll avoid them all.

(And, of course, "We expect this will be ok" is not the bar to shoot for - though I imagine we all agree on that).

[-]Charlie Steiner2y52

Why do you expect this? Is this a ~70% 'expect' or a ~99% 'expect'?

Let's say 95%.

But I should probably clarify what "this" I'm supposed to be expecting here.

If you're trying to notice when AIs are dangerous in different ways by evaluating their capabilities, you face a thorny problem that you're probably way worse at eliciting capabilities than the aggregate effort of the public. And you're probably also less creative about ways that AIs can be dangerous. And there might be ecosystem effects, where AIs used together might be complementary in ways that bypass restrictions on single AIs.

My claim is that the AI deliberately deceiving you about its capabilities is not one of those thorny problems, in the current paradigm.

Do you expect capability evaluations to be robust (i.e. no false negatives)? Why?

I expect that we can make a false positive / false negative tradeoff. However, this is practically guaranteed to make correctly calibrated standards seem overly conservative to human intuition. This has obvious problems for getting people bought into "RSPs" that actually do their job.

Do you expect it'll be clear when we might start getting false negatives? Why?

I now suspect you're asking "do I expect it'll be clear when we start getting false negatives because the AI is situationally aware and is deceiving us." The previous answer should be interpreted as being about false negatives due to other issues as well.

I think that evaluating capabilities of an AI trained on predictive loss is so hard to deliberately deceive us about that we would be toast for other reasons before we ran any tests.

I can certainly imagine possible architectures that would have enough control over their internals to be able to sandbag on a test purely by knowing it was coming, but although someone would probably try to build AI like this, it does seem like the sort of thing you would notice, barring human deception to get around irksome safety oversight.

What do you mean by "deceptive alignment" exactly? Does your definition cover every case of [this thing looked aligned according to tests we believed were thorough, but later it killed us anyway]?

I'm unsure what you're getting at. I think "testing for alignment" isn't a very good goal. To succeed at AI alignment we should know ahead of time what we want the AI to be doing, and perhaps if this can be broken down into pieces we can test that the AI is doing each piece, and that would count as "testing for alignment." But I don't think there's some useful alignment test that's easier than knowing how to align an AI.

[-]Joe Collman2y10

I expect that we can make a false positive / false negative tradeoff.

So here you're talking about situations where a false negative doesn't have catastrophic consequences? Do you have reasons to believe catastrophic consequences are extremely unlikely beyond something like "it's not clear to me how catastrophic consequences would happen"?

I agree we can find a reasonable balance (which will appear too conservative to most) to address non-catastrophic issues - but this doesn't seem to help much. To the extent that it creates a false sense of security for more powerful systems, that's bad.

E.g. this kind of language seems tailor-made for a false sense of security:

At some point, more than one AI developer will have strong, practical, field-tested RSPs. Their practices (and the people who have been implementing them) will then become immensely valuable resources for broader evals-based rules and norms.

Everything's fine: we have "strong, practical, field-tested RSPs"!
(of course people shouldn't conclude that everything's fine - but they could be forgiven for asking why we weren't much clearer about the inadequacy of such RSPs ahead of time)

I think that evaluating capabilities of an AI trained on predictive loss is so hard to deliberately deceive us

We don't need "deliberately deceive us" for something as bad as deceptive alignment.
We only need to be robustly mistaken, not deliberately deceived.

A model doing something equivalent to this (though presumably not mechanistically like this) will do:

If (have influence over adequate resources)
- Search widely for ways to optimize for x.
Else
- Follow this heuristic-for-x that produces good behaviour on the training and test set.

So long as x isn't [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that's enough for us to be screwed. (of course I expect something like [both "have influence over adequate resources" and "find better ways to aim for x" to be processes that are used at lower levels by heuristic-for-x])

I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it's not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.

I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don't claim an example of this particular form is probable.

I do claim that knowing there's no deliberate deception is insufficient - and more generally that we'll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)

Are you 95% on [by ... we can sidestep all problems of the form [this thing looked ~~aligned~~ safe according to tests we believed were thorough, but later it killed us anyway]]? (oh I meant 'safe' the first time, not 'aligned', apologies)
If not, sidestepping deception doesn't look particularly important.
If so, I remain confused by your level of confidence.

[-]Charlie Steiner2y30

So here you're talking about situations where a false negative doesn't have catastrophic consequences?

No, we'll have to make false positive / false negative tradeoffs about ending the world as well. We're unlucky like that.

I agree that false sense of security / safetywashing is a potential use of this kind of program.

A model doing something equivalent to this (though presumably not mechanistically like this) will do:
If (have influence over adequate resources)
Search widely for ways to optimize for x.
Else
Follow this heuristic-for-x that produces good behaviour on the training and test set.
So long as x isn't [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that's enough for us to be screwed. (of course I expect something like [both "have influence over adequate resources" and "find better ways to aim for x" to be processes that are used at lower levels by heuristic-for-x])
I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it's not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.
I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don't claim an example of this particular form is probable.
I do claim that knowing there's no deliberate deception is insufficient - and more generally that we'll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)

I think this is a great point. For architectures that might learn this kind of behavior (ones that do self-reflection during inference), even somewhat-reliably evaluating their capabilities would require something like latent prompting - being able to search for what states of self-reflection would encourage them to display high capabilities.

I'm somewhat more confident in our ability to think of things to test for. If an AI has "that spark of generality," it can probably figure out how to strategically deceive humans and hack computers and other obvious danger signs.

If not, sidestepping deception doesn't look particularly important.
If so, I remain confused by your level of confidence.

I retain the right to be confident in unimportant things :P

Maybe it would help for me to point out that my comment was in reply to a big quote of Habryka warning about deceptive alignment as a fundamental problem with evals.

[-]Joe Collman2y32

Thanks, that's clarifying.

A couple of small points:

For architectures that might learn this kind of behavior (ones that do self-reflection during inference)

I think it's dangerous to assume that the kind of behaviour I'm pointing at requires explicit self-reflection during inference. That's the obvious example to illustrate the point - but I'm reluctant to assume [x is the obvious way to get y] implies [x is required for y].

Here again, I'd expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they'd imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.

a big quote of Habryka warning about deceptive alignment

Ah, I see. He warned about "things like" deceptive alignment and treacherous turns. I guess you were thinking "things such as", and I was thinking "things resembling". (probably because that's what I tend to think about - I assume that if deceptive alignment is solved it'll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn't likely to get us very far; of course I may be wrong :))

[-]Amalthea2y10

There's one piece of good news here: That there seems to be an effort to install infra-structure for monitoring and evaluation and building good habits. That itself seems to me definitely a move in the right direction, although if it is used for safety-washing could easily turn out to be very negative. On the other hand, it might create new leverage for the safety community by incentivizing labs to make their safety strategy concrete and thereby amenable to critique.

Beyond that, I feel like there's not much new information yet. It all depends on the details of the implementation. Both a total stop and an accelerationist attitude could in principle be pursued under this framework as far as I understand.

[-]Joe Collman2y86

Making safety commitments conditional on risks is nice

Making commitments conditional on risks would be nice.
Making commitments conditional on [risks that we notice] is clearly inadequate.

That this distinction isn't made in giant red letters by ARC Evals is disappointing.

Evals might well be great - conditional on clarity that, absent fundamental breakthroughs in our understanding, they can only tell us [model is dangerous], not [model is safe]. Without that clarity, both Evals generally, and RSPs specifically, seem likely to engender dangerous overconfidence.

Another thing I'd like to be clearer is that the following can both be true:

RSPs are a sensible approach for labs/nations to adopt unilaterally.
RSPs are inadequate as a target for global coordination.

I'm highly uncertain about (1).
(2) is obviously true. (obvious because the risk is much too high - not that it's impossible we get extremely lucky)

[-]Zach Stein-Perlman1y20Review for 2023 Review

This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic's RSP was published. It's since become clear that nobody knows how to write an RSP that's predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.

Moderation Log