Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper
First: I'm glad the authors wrote this paper! I think it's great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some extent. That being said, the main claim the authors are making is not that such work is bad per se, but rather that it is preliminary/half-baked/lacking rigor and should be improved in future work, and especially in terms of improving the rigor for future work I certainly agree. Regardless, I think it's worth me trying to respond and give some takes—though again my main reaction is just that I'm glad they wrote this!
4.1. The evidence for AI scheming is often anecdotal
Some of the most widely discussed evidence for scheming is grounded in anecdotal observations of behaviour, for example generated by red-teaming, ad-hoc perusal of CoT logs, or incidental observations from model evaluations. One of the most well-cited pieces of evidence for AI scheming is from the GPT-4 system card, where the claim is made that the model attempted to hire and then deceive a Task Rabbit worker (posing as a blind person) in order to solve a CAPTCHA (later reported in a New York Times article titled “How Could AI Destroy Humanity”). However, what is not often quoted alongside this claim are the facts that (1) the researcher, not the AI, suggested using TaskRabbit, (2) the AI was not able to browse the web, so the researcher did so on its behalf.
I am broadly sympathetic to this criticism. I agree with the need for good control conditions and repeatable quantifiable metrics rather than anecdotal evidence if you want to make claims about the likelihood of a particular failure mode. I do think anecdotal evidence is okay as an existence proof—and sometimes I think existence proofs can be quite useful—but I agree that it is important to not over-interpret an existence proof as saying more than it does.
I am also very sympathetic to concerns about journalists and others misinterpreting results. That being said, I don't think it's necessarily fair to pin the blame for that on the researchers. Sometimes, I think it's fair to put this on the researchers, but in many cases exactly how the public is going to react to a thing can be really hard to interpret. For the CAPTCHA example in the GPT-4 system card, though I don't know the details, I would guess that the authors of that section of the system card had no intention or expectation of that blowing up (and in so doing getting misinterpreted) in the way that it did.
4.2. Studies of AI scheming often lack hypotheses and control conditions
Much of the research on AI scheming is descriptive. By descriptive, we mean that even if it reports empirical results, it does not formally test a hypothesis by comparing treatment and control conditions. Instead, the upshot of many studies is that “models sometimes deviate from what we consider perfectly aligned behaviour”. Perfect behaviour is not an adequate null hypothesis, because stochasticity introduced by idiosyncrasies in the inputs, or randomness in the outputs, can lead to less-than-perfect behaviour even in the absence of malign intent.
I am quite sympathetic to this criticism as well. I basically agree with what is said here, though there is one point I want to make, which is that I do think there is at least one case where something like "perfect behavior" is a reasonable null hypothesis, and that's for red-teaming work. To see what I mean, consider three different reasons that you might do this sort of model organisms work:
4.3. Studies of AI scheming have weak or unclear theoretical motivation
The ape language research was held back by its reliance on a ‘know-it-when-you-see-it’ logic, whereby it was assumed that an ape’s capacity or propensity to produce natural language would be instantly recognisable if the elicitation conditions were correct. There was thus, at first, no serious attempt to specify exactly what was being tested. In the technical AI safety literature, research into model ‘scheming’ suffers from a similar problem. Whilst we have a definition (strategic attempts to pursue misaligned ends), we do not have a clear theory of what would constitute evidence for this phenomenon, or an agreed roadmap for how it should be tested experimentally.
I'm pretty sympathetic to this point too. I agree that sometimes it can be fuzzy e.g. what counts as "alignment faking" (I agree in particular that we could have been better about that in our paper). I do somewhat take issue with how they characterize our Agentic Misalignment work, however:
Many studies of AI ‘scheming’ involve elaborate fictional scenarios that appear to have been crafted in ways that encourage the model to generate unethical behaviour, which is then reported as evidence of ‘scheming’. Typically, the model is asked to role play an AI assistant that has been given an objective on behalf of a company or other organisation. The researchers then create a pathway to satisfying that objective that many humans would deem unethical, such as blackmailing an employee who may block it by using knowledge that they are conducting an extramarital affair. When the AI follows this pathway, it is deemed to be ‘scheming’. Whilst this shows that the models can generate harmful unethical behaviours, it tells us relatively little about their propensity to do so, or the expected prevalence of this type of activity in the real world, because we do not know whether the same behaviour would have occurred in a less contrived scenario. For example, in the blackmail study, in the published supplementary materials the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behaviour.
This is hardly something we "admit" in the appendix! The fact that this was found as part of stress-testing/red-teaming and doesn't represent normal usage is extremely front-and-center in all of our discussion of our Agentic Misalignment work. I won't re-tread this ground too much, though, as I've already talked at length about why we did that work, even given the unrealism of the setup.
4.4. Findings are often interpreted in exaggerated or unwarranted ways
The results described in AI scheming papers are typically described using a mentalistic language that implies that the models have beliefs, preferences, and goals (formally, that they have ‘intentionality’). It may be reasonable for researchers to adopt a functionalist stance in which AI model behaviours suggestive of human intentional states are deemed to ensue from such states themselves. However, in many AI scheming studies the use of overtly mentalistic language does not seem to be well licensed by the findings.
I am very unsympathetic to this critique. If we were to try to avoid mentalistic/intentional language in all of our model organisms papers, I think the end result would just be that the papers were vastly more difficult for lay audiences to understand, for essentially no actual scientific benefit. Even if mentalistic language was just a rhetorical tool to aid understanding, I would still think it was appropriate, but I actually think it just really is the most appropriate language to use. I was particularly frustrated by this section:
More generally, observed model behaviours in AI ‘scheming’ studies are often interpreted in intentional terms, using language that is more typically applied to discuss human beliefs and desires. For example, the models “understand that they are scheming” and are “willing to intentionally take potentially harmful actions in pursuit of their goals”. We note that the assignation of intentionality to AI models is contested, given that they do not have a single identity but can be prompted to play a multiplicity of roles. A helpful parallel may be that of an accomplished human actor, who can be prompted to play the role of an autocrat, a gangster or a charlatan, acting out those scenarios in convincing ways, including when asked to improvise wholly novel situations. But the observation that on stage their character is willing to lock up dissidents, kneecap rivals or betray intimates does not imply something fundamental about the beliefs or desires of the actor themself, nor indeed does it predict how they will behave outside of that particular fictional scenario.
I agree! But that's an argument for why mentalistic language is appropriate, not an argument for why it's inappropriate! If models can be well-understood as playing personae where those personae have beliefs/desires, then that makes it appropriate to use mentalistic language to describe the beliefs/desires of a persona. Certainly that does not imply that the underlying simulator/predictor has those beliefs/desires, but it's the safety properties of the default assistant persona are also extremely relevant, because it's that persona that is what you mostly get when you interact with these models.
Finally, I don't know much about ape language research, but it's worth flagging that it seems like the early ape language research the authors critique might have actually been more on the right track than the authors give it credit for, suggesting that even if valid, the comparison between the two fields might not be as bleak as the authors make it out to be.
The papers they cite that I was involved in were Alignment Faking, Agentic Misalignment, Sycophancy to Subterfuge, and Uncovering Deceptive Tendencies in Language Models. ↩︎
Would you be convinced if you talked to the ems a bunch and they reported normal, happy, fun lives? (Assuming nothing nefarious happened in terms of e.g. modifying their brains to report that.) I think I would find that very convincing. If you wouldn't find that convincing, what would you be worried was missing?
Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be 'egregiously misaligned'! Many sane humans would do the same.
I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.
I wrote a response to some of the critiques of the realism of this work and some more thoughts on the overall rationale for why I think it's useful to do this sort of work here.
Why red-team models in unrealistic environments?
Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:
I've been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.
To start with, the relationship between the terms as I use them is
Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking
such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are:
Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model's priors, but those priors get massively swamped by post-training. That being said, I do certainly think it's worth thinking more about and experimenting with better ways to do data filtering here.
To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a "good" persona and a "bad" persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.
(Also noting that I added this post to the Alignment Forum from LessWrong.)
I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk
How so? Seems basically orthogonal to me? And to the extent that it does matter for takeover risk, I'd expect the sorts of interventions that make it more likely that AIs are moral patients to also make it more likely that they're aligned.
I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they'd prefer no patienthood at all for now.
Even absent AI takeover, I'm quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later, and I think it would be better to lock in AIs that are moral patients if we have to lock something in, since that opens up the possibility for the AIs to live good lives in the future.
I think it's better to focus on AIs which we'd expect would have better preferences conditional on takeover
I agree that seems like the more important highest-order bit, but it's not an argument that making AIs moral patients is bad, just that it's not the most important thing to focus on (which I agree with).
I used to have this specific discussion (x-risk vs. near-term social justice) a lot when I was running the EA club at the Claremont colleges and I had great success with it; I really don't think it's that hard of a conversation to have, at least no harder than bridging any other ideological divide. If you can express empathy, show that you do in fact care about the harms they're worried about as well,[1] but then talk about how you think about scope sensitivity and cause prioritization, I've found that a lot of people are more receptive than you might initially give them credit for.
Assuming you do—I think it is an important prerequisite that you do actually yourself care about social justice issues too. And I think you should care about social justice issues to some extent; they are real issues! If you feel like they aren't real issues, I'd probably recommend reading more history; I think sometimes people don't understand the degree to which we are still e.g. living with the legacy of Jim Crow. But of course I think you should care about social justice much less than you should care about x-risk. ↩︎