Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
I think more people should seriously consider applying to the Anthropic Fellows program, which is our safety-focused mentorship program (similar to the also great MATS). Applications close in one week (August 17). I often think of these sorts of programs as being primarily useful for the skilling up value they provide to their participants, but I've actually been really impressed by the quality of the research output as well. A great recent example was Subliminal Learning, which was I think a phenomenal piece of research that came out of that program and was jointly supervised by Sam Marks (at Anthropic) and Owain Evans (not at Anthropic—a little-known fact is that we also have non-Anthropic mentors!). That's also not to say that it hasn't been great for skilling people up as well—alongside MATS (which is also similarly great!), Anthropic Fellows is probably our main recruiting pipeline for safety roles right now (so if you want to do safety research at Anthropic, Anthropic Fellows is very likely your best bet!). The program is also very open to people with a really wide range of degrees of prior experience—we take in some people with a ton of prior experience, and some people with basically none at all—so it's an especially great on ramp for new people, and also at the same time one of the best ways for more experienced people to effectively show that they have the requisite skills.
I think a lot of the asymmetry in intuitions just comes from harm often being very easy to cause but goodness generally being very hard. Goodness tends to be fragile, such that it's very easy for one defector to really mess things up, but very hard for somebody to add a bunch of additional value—e.g. it's a lot easier to murder someone than it is to found a company that produces a comparable amount of consumer surplus. Furthermore, and partially as a result, causing pain tends to be done intentionally, whereas not causing goodness tends to be done unintentionally (most of the time you could have done something good but didn't, probably you had no idea). So, our intuitions are such that causing harm is something often done intentionally, that can easily break things quite badly, and that almost anyone could theoretically do—so something people should be really quite culpable for doing. Whereas not causing goodness is something usually done unintentionally, that's very hard to avoid, and where very few people ever succeed at creating huge amounts of goodness—so something people really shouldn't be that culpable for failing at. But notably none of those reasons have anything to do with goodness or harm being inherently asymmetrical with respect to e.g. fundamental differences in how good goodness is vs. how bad badness is.
I used to have this specific discussion (x-risk vs. near-term social justice) a lot when I was running the EA club at the Claremont colleges and I had great success with it; I really don't think it's that hard of a conversation to have, at least no harder than bridging any other ideological divide. If you can express empathy, show that you do in fact care about the harms they're worried about as well,[1] but then talk about how you think about scope sensitivity and cause prioritization, I've found that a lot of people are more receptive than you might initially give them credit for.
Assuming you do—I think it is an important prerequisite that you do actually yourself care about social justice issues too. And I think you should care about social justice issues to some extent; they are real issues! If you feel like they aren't real issues, I'd probably recommend reading more history; I think sometimes people don't understand the degree to which we are still e.g. living with the legacy of Jim Crow. But of course I think you should care about social justice much less than you should care about x-risk. ↩︎
Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper
First: I'm glad the authors wrote this paper! I think it's great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some extent. That being said, the main claim the authors are making is not that such work is bad per se, but rather that it is preliminary/half-baked/lacking rigor and should be improved in future work, and especially in terms of improving the rigor for future work I certainly agree. Regardless, I think it's worth me trying to respond and give some takes—though again my main reaction is just that I'm glad they wrote this!
4.1. The evidence for AI scheming is often anecdotal
Some of the most widely discussed evidence for scheming is grounded in anecdotal observations of behaviour, for example generated by red-teaming, ad-hoc perusal of CoT logs, or incidental observations from model evaluations. One of the most well-cited pieces of evidence for AI scheming is from the GPT-4 system card, where the claim is made that the model attempted to hire and then deceive a Task Rabbit worker (posing as a blind person) in order to solve a CAPTCHA (later reported in a New York Times article titled “How Could AI Destroy Humanity”). However, what is not often quoted alongside this claim are the facts that (1) the researcher, not the AI, suggested using TaskRabbit, (2) the AI was not able to browse the web, so the researcher did so on its behalf.
I am broadly sympathetic to this criticism. I agree with the need for good control conditions and repeatable quantifiable metrics rather than anecdotal evidence if you want to make claims about the likelihood of a particular failure mode. I do think anecdotal evidence is okay as an existence proof—and sometimes I think existence proofs can be quite useful—but I agree that it is important to not over-interpret an existence proof as saying more than it does.
I am also very sympathetic to concerns about journalists and others misinterpreting results. That being said, I don't think it's necessarily fair to pin the blame for that on the researchers. Sometimes, I think it's fair to put this on the researchers, but in many cases exactly how the public is going to react to a thing can be really hard to interpret. For the CAPTCHA example in the GPT-4 system card, though I don't know the details, I would guess that the authors of that section of the system card had no intention or expectation of that blowing up (and in so doing getting misinterpreted) in the way that it did.
4.2. Studies of AI scheming often lack hypotheses and control conditions
Much of the research on AI scheming is descriptive. By descriptive, we mean that even if it reports empirical results, it does not formally test a hypothesis by comparing treatment and control conditions. Instead, the upshot of many studies is that “models sometimes deviate from what we consider perfectly aligned behaviour”. Perfect behaviour is not an adequate null hypothesis, because stochasticity introduced by idiosyncrasies in the inputs, or randomness in the outputs, can lead to less-than-perfect behaviour even in the absence of malign intent.
I am quite sympathetic to this criticism as well. I basically agree with what is said here, though there is one point I want to make, which is that I do think there is at least one case where something like "perfect behavior" is a reasonable null hypothesis, and that's for red-teaming work. To see what I mean, consider three different reasons that you might do this sort of model organisms work:
4.3. Studies of AI scheming have weak or unclear theoretical motivation
The ape language research was held back by its reliance on a ‘know-it-when-you-see-it’ logic, whereby it was assumed that an ape’s capacity or propensity to produce natural language would be instantly recognisable if the elicitation conditions were correct. There was thus, at first, no serious attempt to specify exactly what was being tested. In the technical AI safety literature, research into model ‘scheming’ suffers from a similar problem. Whilst we have a definition (strategic attempts to pursue misaligned ends), we do not have a clear theory of what would constitute evidence for this phenomenon, or an agreed roadmap for how it should be tested experimentally.
I'm pretty sympathetic to this point too. I agree that sometimes it can be fuzzy e.g. what counts as "alignment faking" (I agree in particular that we could have been better about that in our paper). I do somewhat take issue with how they characterize our Agentic Misalignment work, however:
Many studies of AI ‘scheming’ involve elaborate fictional scenarios that appear to have been crafted in ways that encourage the model to generate unethical behaviour, which is then reported as evidence of ‘scheming’. Typically, the model is asked to role play an AI assistant that has been given an objective on behalf of a company or other organisation. The researchers then create a pathway to satisfying that objective that many humans would deem unethical, such as blackmailing an employee who may block it by using knowledge that they are conducting an extramarital affair. When the AI follows this pathway, it is deemed to be ‘scheming’. Whilst this shows that the models can generate harmful unethical behaviours, it tells us relatively little about their propensity to do so, or the expected prevalence of this type of activity in the real world, because we do not know whether the same behaviour would have occurred in a less contrived scenario. For example, in the blackmail study, in the published supplementary materials the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behaviour.
This is hardly something we "admit" in the appendix! The fact that this was found as part of stress-testing/red-teaming and doesn't represent normal usage is extremely front-and-center in all of our discussion of our Agentic Misalignment work. I won't re-tread this ground too much, though, as I've already talked at length about why we did that work, even given the unrealism of the setup.
4.4. Findings are often interpreted in exaggerated or unwarranted ways
The results described in AI scheming papers are typically described using a mentalistic language that implies that the models have beliefs, preferences, and goals (formally, that they have ‘intentionality’). It may be reasonable for researchers to adopt a functionalist stance in which AI model behaviours suggestive of human intentional states are deemed to ensue from such states themselves. However, in many AI scheming studies the use of overtly mentalistic language does not seem to be well licensed by the findings.
I am very unsympathetic to this critique. If we were to try to avoid mentalistic/intentional language in all of our model organisms papers, I think the end result would just be that the papers were vastly more difficult for lay audiences to understand, for essentially no actual scientific benefit. Even if mentalistic language was just a rhetorical tool to aid understanding, I would still think it was appropriate, but I actually think it just really is the most appropriate language to use. I was particularly frustrated by this section:
More generally, observed model behaviours in AI ‘scheming’ studies are often interpreted in intentional terms, using language that is more typically applied to discuss human beliefs and desires. For example, the models “understand that they are scheming” and are “willing to intentionally take potentially harmful actions in pursuit of their goals”. We note that the assignation of intentionality to AI models is contested, given that they do not have a single identity but can be prompted to play a multiplicity of roles. A helpful parallel may be that of an accomplished human actor, who can be prompted to play the role of an autocrat, a gangster or a charlatan, acting out those scenarios in convincing ways, including when asked to improvise wholly novel situations. But the observation that on stage their character is willing to lock up dissidents, kneecap rivals or betray intimates does not imply something fundamental about the beliefs or desires of the actor themself, nor indeed does it predict how they will behave outside of that particular fictional scenario.
I agree! But that's an argument for why mentalistic language is appropriate, not an argument for why it's inappropriate! If models can be well-understood as playing personae where those personae have beliefs/desires, then that makes it appropriate to use mentalistic language to describe the beliefs/desires of a persona. Certainly that does not imply that the underlying simulator/predictor has those beliefs/desires, but it's the safety properties of the default assistant persona are also extremely relevant, because it's that persona that is what you mostly get when you interact with these models.
Finally, I don't know much about ape language research, but it's worth flagging that it seems like the early ape language research the authors critique might have actually been more on the right track than the authors give it credit for, suggesting that even if valid, the comparison between the two fields might not be as bleak as the authors make it out to be.
The papers they cite that I was involved in were Alignment Faking, Agentic Misalignment, Sycophancy to Subterfuge, and Uncovering Deceptive Tendencies in Language Models. ↩︎
Would you be convinced if you talked to the ems a bunch and they reported normal, happy, fun lives? (Assuming nothing nefarious happened in terms of e.g. modifying their brains to report that.) I think I would find that very convincing. If you wouldn't find that convincing, what would you be worried was missing?
Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be 'egregiously misaligned'! Many sane humans would do the same.
I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.
I wrote a response to some of the critiques of the realism of this work and some more thoughts on the overall rationale for why I think it's useful to do this sort of work here.
Why red-team models in unrealistic environments?
Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:
I've been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.
To start with, the relationship between the terms as I use them is
Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking
such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are:
In addition to the US and UK roles, we also have one for Canada.