I’m not asking [...] that they make such a commitment highly public or legally binding
That seems like a self-defeating concession. If, as per your previous post, we want to exploit the "everybody likes a winner" pattern, we want there to be someone who is doing the liking. We want an audience that judges what we are and aren't allowed to do, and when other actors have and don't have to listen to us; and this audience has to be someone whose whims influence said actors' incentives.
The audience doesn't need to be the general public, sure. It might be other ML researchers, the senior management of other AI Labs, or even just the employees of a particular AI Lab. But it needs to be someone.
A second point: there needs to be a concrete "winner". These initiatives we're pushing have to visibly come from some specific group, that would accrue the reputation of a winner that can ram projects through. It can't come from a vague homogeneous mass of "alignment researchers". The hypothetical headlines[1] have to read "%groupname% convinces %list-of-AI-Labs% to implement %policy%", not "%list-of-AI-Labs% agree to implement %policy%". Else it won't work.
Such a commitment risks giving us a false sense of having addressed deceptive alignment
That's my biggest concern with your object-level idea here, yep. I think this concern will transfer to any object-level idea for our first "clear win", though: the implementation of any first clear win will necessarily not be up to our standards. Which is something I think we should just accept, and view that win as what it is: a stepping stone, of not particular value in itself.
Or, in other words: approximately all of the first win's value would be reputational. And we need to make very sure that we catch all this value.
I'd like us to develop a proper strategy/plan here, actually: a roadmap of the policies we want to implement, each next one more ambitious than the previous ones and only implementable due to that previous chain of victories, with some properly ambitious end-point like "all leading AI Labs take AI Risk up to our standards of seriousness".
That roadmap won't be useful on the object-level, obviously, but it should help develop intuitions for how "acquiring a winner's reputation" actually looks like, and how it doesn't.
In particular, each win should palpably expand our influence in terms of what we can achieve. Merely getting a policy implemented doesn't count: we have to visibly prevail.
Not that we necessarily want actual headlines, see the point about the general public not necessarily being the judging audience.
I think one could view the abstract idea of "coordination as a strategy for AI risk" as itself a winner, and the potential participants of future coordination as the audience--the more people believe that coordination can actually work, the more likely coordination is to happen. I'm not sure how much this should be taken into account.
The most straightforward way to produce evidence of a model’s deception is to find a situation where it changes what it’s doing based on the presence or absence of oversight. If we can find a clear situation where the model’s behavior changes substantially based on the extent to which its behavior is being overseen, that is clear evidence that the model was only pretending to be aligned for the purpose of deceiving the overseer.
This isn't clearly true to me, at least for one possible interpretation: "If the AI knows a human is present, the AI's behavior changes in an alignment-relevant way." I think there's a substantial chance this isn't what you had in mind, but I'll just lay out the following because I think it'll be useful anyways.
For example, I behave differently around my boss than alone at home, but I'm not pretending to be anything for the purpose of deceiving my boss. More generally, I think it'll be quite common for different contexts to activate different AI values. Perhaps the AI learned that if the overseer is watching, then it should complete the intended task (like cleaning up a room), because historically those decisions were reinforced by the overseer. This implies certain bad generalization properties, but not purposeful deception.
Yeah, that's a good point—I agree that the thing I said was a bit too strong. I do think there's a sense in which the models you're describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you're describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.
Great work! I hope more people take your direction, with concrete experiments and monitoring real systems as they evolve. The concern that doing this will backfire somehow simply must be dismissed as untimely perfectionism. It's too late at this point to shun iteration. We simply don't have time left for a Long Reflection about AI alignment, even if we did have the coordination to pull that off.
There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.
"trying to appear aligned" seems imprecise to me - unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).
Are you thinking that it makes sense to agree to test for systems that are "trying to appear aligned", or would you want to include any system that is instrumentally acting such that it's unaltered by optimisation pressure?
It's not clear to me how you get to deceptive alignment 'that completely supersedes the explicit alignment'. That an AI would develop epiphenomenal goals and alignments, not understood by its creators, that it perceived as useful or necessary to pursue whatever primary goal it had been set, seems very likely. But while they might be in conflict with what we want it to do, I don't see how this emergent behavior could be such that it would be contradict the pursuit of satisfying whatever evaluation function the AI had been trained for in the beginning. Unless of course the AI made what we might consider a stupid mistake.
One design option that I haven't seen discussed (though I have not read everything ... maybe this falls in to the category of 'stupid newbie ideas') is that of trying to failsafe an AI by separating its evaluation or feedback in such a way that it can, once sufficiently superhuman, break in to the 'reward center' and essentially wirehead itself. If your AI is trying to move some calculated value to be as close to 1000 as possible, then once it understands the world sufficiently well it should simply conclude 'aha, by rooting this other box here I can reach nirvana!', follow through, and more or less consider its work complete. To our relief, in this case.
Of course this does nothing to address the problem of AI controlled by malicious human actors, which will likely become a problem well before any takeoff threshold is reached.
This post is a follow-up to “AI coordination needs clear wins.” Thanks to Ethan Perez, Kate Woolverton, Richard Ngo, Anne le Roux, Sam McCandlish, Adam Jermyn, and Danny Hernandez for useful conversations, comments, and feedback.
In this post, I want to propose a clear, concrete coordination task that I think might be achievable soon given the current landscape, would generate a clear coordination win, and that I think would be highly useful in and of itself. Specifically:
I want DeepMind, OpenAI, and Anthropic to commit to actively monitor and look for evidence of deceptive alignment in their models—as well as run experiments to try to predict when and where deceptive alignment might occur before it does.
Notably, I am specifically referring only to the narrow case of deceptive alignment here, not just any situation where models say false things. Deceptive alignment is specifically a situation where the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal.[1]
I think this is a pretty minimal ask that would nevertheless be a clear win. Among all AI failure modes, deceptive alignment is one of the most unambiguously bad, which means, though I expect lots of disagreement on its likelihood, there should be little disagreement regarding its severity. Furthermore, things that I’m not asking for:
Nevertheless, as I stated, I think this would still be quite useful.
First, I think deceptive alignment is clearly one of the most concerning problems in AI alignment. Though they don’t always use the framing of deceptive alignment specifically, most presentations of AI risk feature the phenomenon of an AI that wants something in the world deceiving our oversight procedures to get that something. See: treacherous turns, schemers, influence-seeking behavior, power-seeking AI. Furthermore, I think there is a strong case for why deceptive alignment is reasonably likely to appear by default in advanced machine learning systems.
Second, I think that it is plausible that, if deceptive alignment is eventually going to be a problem, we’ll be able to see it and produce demonstrations of it before it becomes existentially dangerous. I’m less confident in this premise than the previous one, but my current sense is that the space of deceptively aligned models is large enough to include lots of deceptively aligned models that would be willing to display their deception in the presence of the right sort of honeypot. For example, you might be able to find deceptively aligned models that:[2]
Of course, just being able to detect deception doesn’t mean we’ll be able to solve it—and in fact in many cases it seems likely that our default ways of addressing deception would result in deception that just hides itself better—but I still think that being able to detect it before it becomes catastrophic would be a huge win, since it’d give us an opportunity to study the phenomenon directly in the laboratory. At the very least, having an example of the problem to study directly seems substantially better than having to get it right on the first try.
In particular, even though I’m skeptical that, once you have a deceptive model, you could plausibly train that deception away—rather than just making the model better at hiding its deception—knowing that a particular path to AGI leads to deceptive alignment gives us the opportunity to pivot to a substantially different path where the likelihood of deceptive alignment is as uncorrelated as possible relative to the previous path. For example, we could discover that RLHF (Reinforcement Learning from Human Feedback) schemes often yield deception in a way that e.g. language model pre-training + prompt-chaining doesn’t and switch over.
Furthermore, even if we can’t produce direct demonstrations of deceptive alignment prior to it becoming dangerous, I think we should at least be able to produce demonstrations of some of its precursors. Such deceptive precursor demonstrations could serve a similar role to a direct deception demonstration in helping us choose between different paths to AGI based on the presence of deceptive precursors in each path. Such precursor demonstrations might test questions like:
Though not directly evidence of deceptive alignment itself, deceptive alignment relies on the exact same sort of instrumental reasoning as would be required to exhibit the behaviors above. Thus, understanding when they appear could help us understand when deception is likely to appear as well—and thus looking for such precursors is a task that I would want any team looking for deceptive alignment to also be looking for as well.
Ideally, if every major lab had a team tasked with running experiments like the above, those results could be used to help guide us in terms of which paths to AGI are likely to be the least likely to yield deceptive alignment.
All of that being said, I do think that there are some pretty big potential downsides to pushing for a commitment like the above. In particular:
Nevertheless, I think there are a number of things that we can do to try to mitigate the above risks—specifically:
Particularly tricky is the third condition here that actually requires us to hash out right now as good an understanding as we can of what a demonstration of deceptive alignment might look like—and what sort of experiments we might be able to use to look for one. To that end, I’ll try to at least lay out some baseline criteria and scenarios for such a demonstration.
To start with, we have to establish what we’re looking for in terms of what would convince us that a model is deceptively aligned. Primarily, I think we’re looking for two things:
The key question, then, is what mechanisms there might be for producing such evidence. I think such mechanisms are mostly going to fall into two broad categories:
In my opinion, I think that (2) here—uncovering deception in advance—would be ideal, but seems to require relatively advanced interpretability tools that we might not ever actually get.[3] Though I think we should certainly be doing everything we can to work our way up to interpretability tools that are capable of something like that, it’s also worth planning for what to do if we don’t get there.
In the situation where the model in question takes the form of a large language model (or a large language model + a prompt), one strategy that might help substitute for interpretability tools is chain-of-thought-style techniques. For example, from “Conditioning Generative Models:”
To try to get a bit more concrete, here is a very incomplete list of some example demonstrations of deceptive alignment, ordered approximately from most obviously dangerous to least obviously dangerous:
Trying to do an experiment right now to get any of the demonstrations on the above list—with the exception of the last one—is almost certain to be negative right now. As I stated previously, however, I think there are a lot of very related experiments that someone working on the above would want to perform alongside testing for deceptive alignment directly that would be more likely to yield positive results in the short term, specifically those looking for deceptive precursors rather than deception directly. Furthermore, in my opinion, running lots of negative experiments is what safety is supposed to look like: aggressively testing for failure modes even if you think they’re probably not present.
Training signal here need not be the literal loss (though it could be)—it's just anything that we're trying to get the model to do. ↩︎
These sorts of deceptively aligned models become substantially more likely to be discovered if we do gain-of-function-style research to actively try to produce them, though such research also comes with its own obvious downsides such that it’s quite unclear whether it’s actually worth doing. ↩︎
Furthermore, trying to uncover deception in advance via interpretability tools could fail simply because there is no sense that a deceptively aligned model has to actively be thinking about its deception. A model that has never seen a situation where there is an opportunity to seize power need not be carefully planning out what it would do in such a situation any more than a factory cleaning robot need be planning for what to do if it someday found itself in a jungle instead of a factory. Nevertheless, the fact that the model has not been previously planning to seize power doesn’t imply that it wouldn’t if given the opportunity. In particular, a model could be deceptively aligned simply because it reasons that, in situations where there is a clear overseer, doing what it wants is a good general strategy for gaining power and influence in the world—without needing any explicit plans for later deception. ↩︎