Ok, but, what protective measures are we talking about that are supposed to help with things like deceptive alignment and treacherous turn concerns? These seem to me like the primary ones driving risk, and we have little to no traction on even identifying these dynamics.
The natural result of this kind of strategy seems to me that you drive AI directly towards seeming less dangerous until you scaled them a lot, or they are proliferated enough that a takeover is easy, and then you die. This pretends we have more traction on alignment than we do.
The key component of this kind of strategy is to define a clear threshold which causes an approximately fully stop of AI development, and that threshold cannot be goodhearted or gamed. I don't see how the current "Responsible Scaling Policy" helps with that very much.
Despite the blogpost somewhat implying otherwise, I can't get over this feeling that the central story this is setting itself up as is "we will define some misalignment metrics, then we will make our AIs meet those metrics, then the AIs will be aligned".
Whereas the actual story that seems most likely to occur when you aim for that is "we will define some misalignment metrics, then we will goodheart those metrics or make our AIs deceptively aligned, then we scale them up a lot, then we have an AI takeover".
I hope to say more soon but I am optimistic that we can find thresholds below dangerous levels that "cannot be goodhearted or gamed" too much.
I mean, the good news is that I expect that by focusing on evaluating capability of predictively trained models, (and potentially by doing more research on latent adversarial attacks on more agenty models), problems of deceptive alignment can be sidestepped in practice.
My worry is more along the lines of actually getting people to not build the cool next step AI that's tantalizingly over the horizon. Government may be a mechanism for enforcing some measure of institutional buy-in, though.
It's good news if you're right.
If you and others expect this and are wrong, it's very bad news.
Overconfidence in plausible-to-many schemes is one of the main ways we die.
I expect that by ... problems of deceptive alignment can be sidestepped in practice.
Why do you expect this? Is this a ~70% 'expect' or a ~99% 'expect'?
Do you expect capability evaluations to be robust (i.e. no false negatives)? Why?
Do you expect it'll be clear when we might start getting false negatives? Why?
What do you mean by "deceptive alignment" exactly? Does your definition cover every case of [this thing looked aligned [EDIT: safe] according to tests we believed were thorough, but later it killed us anyway]? If not, there's little comfort in knowing that the thing that killed us didn't meet some particular definition of deceptive alignment.
I can understand hoping that we'd sidestep most cases of deceptive alignment this way.
It's not clear to me how you get to expecting we'll avoid them all.
(And, of course, "We expect this will be ok" is not the bar to shoot for - though I imagine we all agree on that).
Why do you expect this? Is this a ~70% 'expect' or a ~99% 'expect'?
Let's say 95%.
But I should probably clarify what "this" I'm supposed to be expecting here.
If you're trying to notice when AIs are dangerous in different ways by evaluating their capabilities, you face a thorny problem that you're probably way worse at eliciting capabilities than the aggregate effort of the public. And you're probably also less creative about ways that AIs can be dangerous. And there might be ecosystem effects, where AIs used together might be complementary in ways that bypass restrictions on single AIs.
My claim is that the AI deliberately deceiving you about its capabilities is not one of those thorny problems, in the current paradigm.
Do you expect capability evaluations to be robust (i.e. no false negatives)? Why?
I expect that we can make a false positive / false negative tradeoff. However, this is practically guaranteed to make correctly calibrated standards seem overly conservative to human intuition. This has obvious problems for getting people bought into "RSPs" that actually do their job.
Do you expect it'll be clear when we might start getting false negatives? Why?
I now suspect you're asking "do I expect it'll be clear when we start getting false negatives because the AI is situationally aware and is deceiving us." The previous answer should be interpreted as being about false negatives due to other issues as well.
I think that evaluating capabilities of an AI trained on predictive loss is so hard to deliberately deceive us about that we would be toast for other reasons before we ran any tests.
I can certainly imagine possible architectures that would have enough control over their internals to be able to sandbag on a test purely by knowing it was coming, but although someone would probably try to build AI like this, it does seem like the sort of thing you would notice, barring human deception to get around irksome safety oversight.
What do you mean by "deceptive alignment" exactly? Does your definition cover every case of [this thing looked aligned according to tests we believed were thorough, but later it killed us anyway]?
I'm unsure what you're getting at. I think "testing for alignment" isn't a very good goal. To succeed at AI alignment we should know ahead of time what we want the AI to be doing, and perhaps if this can be broken down into pieces we can test that the AI is doing each piece, and that would count as "testing for alignment." But I don't think there's some useful alignment test that's easier than knowing how to align an AI.
I expect that we can make a false positive / false negative tradeoff.
So here you're talking about situations where a false negative doesn't have catastrophic consequences? Do you have reasons to believe catastrophic consequences are extremely unlikely beyond something like "it's not clear to me how catastrophic consequences would happen"?
I agree we can find a reasonable balance (which will appear too conservative to most) to address non-catastrophic issues - but this doesn't seem to help much. To the extent that it creates a false sense of security for more powerful systems, that's bad.
E.g. this kind of language seems tailor-made for a false sense of security:
At some point, more than one AI developer will have strong, practical, field-tested RSPs. Their practices (and the people who have been implementing them) will then become immensely valuable resources for broader evals-based rules and norms.
Everything's fine: we have "strong, practical, field-tested RSPs"!
(of course people shouldn't conclude that everything's fine - but they could be forgiven for asking why we weren't much clearer about the inadequacy of such RSPs ahead of time)
I think that evaluating capabilities of an AI trained on predictive loss is so hard to deliberately deceive us
We don't need "deliberately deceive us" for something as bad as deceptive alignment.
We only need to be robustly mistaken, not deliberately deceived.
A model doing something equivalent to this (though presumably not mechanistically like this) will do:
So long as x isn't [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that's enough for us to be screwed. (of course I expect something like [both "have influence over adequate resources" and "find better ways to aim for x" to be processes that are used at lower levels by heuristic-for-x])
I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it's not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.
I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don't claim an example of this particular form is probable.
I do claim that knowing there's no deliberate deception is insufficient - and more generally that we'll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)
Are you 95% on [by ... we can sidestep all problems of the form [this thing looked aligned safe according to tests we believed were thorough, but later it killed us anyway]]? (oh I meant 'safe' the first time, not 'aligned', apologies)
If not, sidestepping deception doesn't look particularly important.
If so, I remain confused by your level of confidence.
So here you're talking about situations where a false negative doesn't have catastrophic consequences?
No, we'll have to make false positive / false negative tradeoffs about ending the world as well. We're unlucky like that.
I agree that false sense of security / safetywashing is a potential use of this kind of program.
A model doing something equivalent to this (though presumably not mechanistically like this) will do:
- If (have influence over adequate resources)
- Search widely for ways to optimize for x.
- Else
- Follow this heuristic-for-x that produces good behaviour on the training and test set.
So long as x isn't [target that produces aligned behaviour], and a sufficiently powerful system eventually gets [influence over adequate resources], but was never in this situation in training, that's enough for us to be screwed. (of course I expect something like [both "have influence over adequate resources" and "find better ways to aim for x" to be processes that are used at lower levels by heuristic-for-x])
I note in passing that extreme optimization for [accurately predict next token] does destroy the world (even in the myopic case, with side channels) - so in this case it's not as though a bad outcome hinges on the model optimizing for something other than that for which it was trained.
I stress that this example is to illustrate that [we are robustly mistaken] does not require [there is deliberate deception]. I don't claim an example of this particular form is probable.
I do claim that knowing there's no deliberate deception is insufficient - and more generally that we'll fail to think of all the tests that would be necessary. (absent fundamental breakthroughs)
I think this is a great point. For architectures that might learn this kind of behavior (ones that do self-reflection during inference), even somewhat-reliably evaluating their capabilities would require something like latent prompting - being able to search for what states of self-reflection would encourage them to display high capabilities.
I'm somewhat more confident in our ability to think of things to test for. If an AI has "that spark of generality," it can probably figure out how to strategically deceive humans and hack computers and other obvious danger signs.
If not, sidestepping deception doesn't look particularly important.
If so, I remain confused by your level of confidence.
I retain the right to be confident in unimportant things :P
Maybe it would help for me to point out that my comment was in reply to a big quote of Habryka warning about deceptive alignment as a fundamental problem with evals.
Thanks, that's clarifying.
A couple of small points:
For architectures that might learn this kind of behavior (ones that do self-reflection during inference)
I think it's dangerous to assume that the kind of behaviour I'm pointing at requires explicit self-reflection during inference. That's the obvious example to illustrate the point - but I'm reluctant to assume [x is the obvious way to get y] implies [x is required for y].
Here again, I'd expect us to test for the obvious ways that make sense to us (e.g. simple, explicit mechanisms, and/or the behaviours they'd imply), leaving the possibility of getting blind-sided by some equivalent process based on a weird-to-us mechanism.
a big quote of Habryka warning about deceptive alignment
Ah, I see. He warned about "things like" deceptive alignment and treacherous turns. I guess you were thinking "things such as", and I was thinking "things resembling". (probably because that's what I tend to think about - I assume that if deceptive alignment is solved it'll be as a consequence of a more general approach that also handles [we are robustly mistaken] cases, so that thinking about only deception isn't likely to get us very far; of course I may be wrong :))
There's one piece of good news here: That there seems to be an effort to install infra-structure for monitoring and evaluation and building good habits. That itself seems to me definitely a move in the right direction, although if it is used for safety-washing could easily turn out to be very negative. On the other hand, it might create new leverage for the safety community by incentivizing labs to make their safety strategy concrete and thereby amenable to critique.
Beyond that, I feel like there's not much new information yet. It all depends on the details of the implementation. Both a total stop and an accelerationist attitude could in principle be pursued under this framework as far as I understand.
Making safety commitments conditional on risks is nice
Making commitments conditional on risks would be nice.
Making commitments conditional on [risks that we notice] is clearly inadequate.
That this distinction isn't made in giant red letters by ARC Evals is disappointing.
Evals might well be great - conditional on clarity that, absent fundamental breakthroughs in our understanding, they can only tell us [model is dangerous], not [model is safe]. Without that clarity, both Evals generally, and RSPs specifically, seem likely to engender dangerous overconfidence.
Another thing I'd like to be clearer is that the following can both be true:
I'm highly uncertain about (1).
(2) is obviously true. (obvious because the risk is much too high - not that it's impossible we get extremely lucky)
This post presented the idea of RSPs and detailed thoughts on them, just after Anthropic's RSP was published. It's since become clear that nobody knows how to write an RSP that's predictably neither way too aggressive nor super weak. But this post, along with the accompanying Key Components of an RSP, is still helpful, I think.
I'm excited about labs adopting RSPs for several reasons:
Possible discussion on twitter here and here.