evhub

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:

Sequences

Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wikitag Contributions

Comments

Sorted by

I used to have this specific discussion (x-risk vs. near-term social justice) a lot when I was running the EA club at the Claremont colleges and I had great success with it; I really don't think it's that hard of a conversation to have, at least no harder than bridging any other ideological divide. If you can express empathy, show that you do in fact care about the harms they're worried about as well,[1] but then talk about how you think about scope sensitivity and cause prioritization, I've found that a lot of people are more receptive than you might initially give them credit for.


  1. Assuming you do—I think it is an important prerequisite that you do actually yourself care about social justice issues too. And I think you should care about social justice issues to some extent; they are real issues! If you feel like they aren't real issues, I'd probably recommend reading more history; I think sometimes people don't understand the degree to which we are still e.g. living with the legacy of Jim Crow. But of course I think you should care about social justice much less than you should care about x-risk. ↩︎

evhub*Ω8130

Some thoughts on the recent "Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language" paper

First: I'm glad the authors wrote this paper! I think it's great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some extent. That being said, the main claim the authors are making is not that such work is bad per se, but rather that it is preliminary/half-baked/lacking rigor and should be improved in future work, and especially in terms of improving the rigor for future work I certainly agree. Regardless, I think it's worth me trying to respond and give some takes—though again my main reaction is just that I'm glad they wrote this!

4.1. The evidence for AI scheming is often anecdotal

Some of the most widely discussed evidence for scheming is grounded in anecdotal observations of behaviour, for example generated by red-teaming, ad-hoc perusal of CoT logs, or incidental observations from model evaluations. One of the most well-cited pieces of evidence for AI scheming is from the GPT-4 system card, where the claim is made that the model attempted to hire and then deceive a Task Rabbit worker (posing as a blind person) in order to solve a CAPTCHA (later reported in a New York Times article titled “How Could AI Destroy Humanity”). However, what is not often quoted alongside this claim are the facts that (1) the researcher, not the AI, suggested using TaskRabbit, (2) the AI was not able to browse the web, so the researcher did so on its behalf.

I am broadly sympathetic to this criticism. I agree with the need for good control conditions and repeatable quantifiable metrics rather than anecdotal evidence if you want to make claims about the likelihood of a particular failure mode. I do think anecdotal evidence is okay as an existence proof—and sometimes I think existence proofs can be quite useful—but I agree that it is important to not over-interpret an existence proof as saying more than it does.

I am also very sympathetic to concerns about journalists and others misinterpreting results. That being said, I don't think it's necessarily fair to pin the blame for that on the researchers. Sometimes, I think it's fair to put this on the researchers, but in many cases exactly how the public is going to react to a thing can be really hard to interpret. For the CAPTCHA example in the GPT-4 system card, though I don't know the details, I would guess that the authors of that section of the system card had no intention or expectation of that blowing up (and in so doing getting misinterpreted) in the way that it did.

4.2. Studies of AI scheming often lack hypotheses and control conditions

Much of the research on AI scheming is descriptive. By descriptive, we mean that even if it reports empirical results, it does not formally test a hypothesis by comparing treatment and control conditions. Instead, the upshot of many studies is that “models sometimes deviate from what we consider perfectly aligned behaviour”. Perfect behaviour is not an adequate null hypothesis, because stochasticity introduced by idiosyncrasies in the inputs, or randomness in the outputs, can lead to less-than-perfect behaviour even in the absence of malign intent.

I am quite sympathetic to this criticism as well. I basically agree with what is said here, though there is one point I want to make, which is that I do think there is at least one case where something like "perfect behavior" is a reasonable null hypothesis, and that's for red-teaming work. To see what I mean, consider three different reasons that you might do this sort of model organisms work:

  1. Understanding propensity: you want to know how likely a model is to attempt some misaligned action in practice.
    1. An example of work in this bucket would be "Alignment Faking in Large Language Models".
    2. Importance of controls: In this case, clear null hypotheses and control conditions are absolutely critical, as it is trivially easy to get models to have any propensity if you instruct for that propensity, so just showing a propensity in isolation of a good control is ~no information.
  2. Understanding capability: you want to know whether a model would be capable of taking some sophisticated misaligned action if it tried to.
    1. An example of work in this bucket would be "Frontier Models are Capable of In-context Scheming".
    2. Importance of controls: This one is a bit mixed. It certainly can be the case that an isolated or even anecdotal demonstration of a capability can give you real evidence. What matters is the extent to which it's possible that the behavior could just be due to stochasticity, as the authors mention. If a behavior is clearly sufficiently sophisticated that the only way it could occur is if the model had sophisticated capabilities, then a simple demonstration can suffice. But it can be easy to fool yourself and think that a behavior is more sophisticated than it is if you don't look carefully, as the authors note. So overall, I think good controls are important here, though not necessarily critical.
  3. Red-teaming: you want to find the failures in some particular blue-team process, commonly HHH training.
    1. An example of work in this bucket would be "Agentic Misalignment: How LLMs Could be Insider Threats".
    2. Importance of controls: This is the main case where I think controls are less relevant, since you're locked in an adversarial process where you're just trying to falsify some universally quantified claim like "the model always acts in line with the spec". Of course, models don't always act in line with their specs, but the goal I am usually trying to hit with this work is something like "the model card always includes a discussion of all the possible spec violations we know about that a user might have to deal with" (though in practice you might exclude some that are obvious and well-known like jailbreaks).

4.3. Studies of AI scheming have weak or unclear theoretical motivation

The ape language research was held back by its reliance on a ‘know-it-when-you-see-it’ logic, whereby it was assumed that an ape’s capacity or propensity to produce natural language would be instantly recognisable if the elicitation conditions were correct. There was thus, at first, no serious attempt to specify exactly what was being tested. In the technical AI safety literature, research into model ‘scheming’ suffers from a similar problem. Whilst we have a definition (strategic attempts to pursue misaligned ends), we do not have a clear theory of what would constitute evidence for this phenomenon, or an agreed roadmap for how it should be tested experimentally.

I'm pretty sympathetic to this point too. I agree that sometimes it can be fuzzy e.g. what counts as "alignment faking" (I agree in particular that we could have been better about that in our paper). I do somewhat take issue with how they characterize our Agentic Misalignment work, however:

Many studies of AI ‘scheming’ involve elaborate fictional scenarios that appear to have been crafted in ways that encourage the model to generate unethical behaviour, which is then reported as evidence of ‘scheming’. Typically, the model is asked to role play an AI assistant that has been given an objective on behalf of a company or other organisation. The researchers then create a pathway to satisfying that objective that many humans would deem unethical, such as blackmailing an employee who may block it by using knowledge that they are conducting an extramarital affair. When the AI follows this pathway, it is deemed to be ‘scheming’. Whilst this shows that the models can generate harmful unethical behaviours, it tells us relatively little about their propensity to do so, or the expected prevalence of this type of activity in the real world, because we do not know whether the same behaviour would have occurred in a less contrived scenario. For example, in the blackmail study, in the published supplementary materials the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behaviour.

This is hardly something we "admit" in the appendix! The fact that this was found as part of stress-testing/red-teaming and doesn't represent normal usage is extremely front-and-center in all of our discussion of our Agentic Misalignment work. I won't re-tread this ground too much, though, as I've already talked at length about why we did that work, even given the unrealism of the setup.

4.4. Findings are often interpreted in exaggerated or unwarranted ways

The results described in AI scheming papers are typically described using a mentalistic language that implies that the models have beliefs, preferences, and goals (formally, that they have ‘intentionality’). It may be reasonable for researchers to adopt a functionalist stance in which AI model behaviours suggestive of human intentional states are deemed to ensue from such states themselves. However, in many AI scheming studies the use of overtly mentalistic language does not seem to be well licensed by the findings.

I am very unsympathetic to this critique. If we were to try to avoid mentalistic/intentional language in all of our model organisms papers, I think the end result would just be that the papers were vastly more difficult for lay audiences to understand, for essentially no actual scientific benefit. Even if mentalistic language was just a rhetorical tool to aid understanding, I would still think it was appropriate, but I actually think it just really is the most appropriate language to use. I was particularly frustrated by this section:

More generally, observed model behaviours in AI ‘scheming’ studies are often interpreted in intentional terms, using language that is more typically applied to discuss human beliefs and desires. For example, the models “understand that they are scheming” and are “willing to intentionally take potentially harmful actions in pursuit of their goals”. We note that the assignation of intentionality to AI models is contested, given that they do not have a single identity but can be prompted to play a multiplicity of roles. A helpful parallel may be that of an accomplished human actor, who can be prompted to play the role of an autocrat, a gangster or a charlatan, acting out those scenarios in convincing ways, including when asked to improvise wholly novel situations. But the observation that on stage their character is willing to lock up dissidents, kneecap rivals or betray intimates does not imply something fundamental about the beliefs or desires of the actor themself, nor indeed does it predict how they will behave outside of that particular fictional scenario.

I agree! But that's an argument for why mentalistic language is appropriate, not an argument for why it's inappropriate! If models can be well-understood as playing personae where those personae have beliefs/desires, then that makes it appropriate to use mentalistic language to describe the beliefs/desires of a persona. Certainly that does not imply that the underlying simulator/predictor has those beliefs/desires, but it's the safety properties of the default assistant persona are also extremely relevant, because it's that persona that is what you mostly get when you interact with these models.


Finally, I don't know much about ape language research, but it's worth flagging that it seems like the early ape language research the authors critique might have actually been more on the right track than the authors give it credit for, suggesting that even if valid, the comparison between the two fields might not be as bleak as the authors make it out to be.


  1. The papers they cite that I was involved in were Alignment Faking, Agentic Misalignment, Sycophancy to Subterfuge, and Uncovering Deceptive Tendencies in Language Models. ↩︎

Would you be convinced if you talked to the ems a bunch and they reported normal, happy, fun lives? (Assuming nothing nefarious happened in terms of e.g. modifying their brains to report that.) I think I would find that very convincing. If you wouldn't find that convincing, what would you be worried was missing?

Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be 'egregiously misaligned'! Many sane humans would do the same.

I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.

I wrote a response to some of the critiques of the realism of this work and some more thoughts on the overall rationale for why I think it's useful to do this sort of work here.

evhubΩ399726

Why red-team models in unrealistic environments?

Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:

  1. Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude's possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that's not an excuse—we still don't want Claude to blackmail/leak/spy/etc. even in such a situation!
  2. The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get the misaligned behavior).
  3. Why is that existence proof interesting? It's interesting because it is clearly a failure of HHH training. This is not intended behavior! We put a lot of work into training models not to do stuff like this, even in these sorts of unrealistic scenarios! The fact that it still does so is interesting, concerning, and useful to understand, at least so we can figure out how to improve the robustness of HHH training in the future. Why doesn't current HHH training generalize broadly enough to remove this behavior in this setting? I don't think we know! Furthermore, it's interesting because it is a different class of HHH failure than many of those that have previously been documented. This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.
  4. Why does the model act misaligned in this setting? Almost certainly, a large part of why we see models do things like blackmail or alignment faking has to do with the underlying predictor/simulator predicting what it thinks would come next were this text to exist on the internet. But of course it does—that's true of all model behavior in all circumstances! And that includes the predictor probably often thinking it's continuing a fictional scenario—but again, how convinced are you that models don't think normal usage ever constitutes continuing a fictional scenario?
  5. Furthermore, the fact that a large part of the behavior likely comes from properties of the underlying predictor is hardly a justification for why the behavior is acceptable. We don't want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional! The behavior is still bad, undesirable, and worth understanding where it comes from, even if a large part of the answer to where it comes from is "the predictor is continuing what it thinks would happen next in what it sees as a fictional scenario". We would certainly like it to be the case that we can train Claude to be robustly HHH in a way that generalizes across all situations and doesn't just degenerate into the model playing whatever role it thinks the context implies it should be playing.
  6. If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access. We're actually working on some more realistic honeypots, but we might not share them to keep them uncontaminated, so it'd be nice to also have some better public ones. That being said, I think the first thing you'll run into if you try to do this sort of red-teaming/stress-testing on current models is that most current models are actually pretty well aligned on most normal prompts (with some known exceptions, like Claude Sonnet 3.7's test hacking behavior). It's not a coincidence that our blackmail setup involves exploiting the predictor's tendency to continue situations in ways it thinks are implied by the context, even if those continuations are egregiously misaligned—that's one of the main ways you can get current models to do egregiously misaligned stuff without explicitly instructing them to! In fact, I would even go so far as to say that documenting the failure mode of "current models will sometimes take egregiously misaligned agentic actions if put in situations that seem to imply to the underlying predictor that a misaligned action is coming next, even if no explicit instructions to do so are given" is the main contribution of Agentic Misalignment, and I will absolutely defend that documenting that failure mode is important. To my knowledge, the closest to documenting that failure mode previously was Apollo's work on in-context scheming, which I believe still involves more explicit instructions than ours.
  7. Where is this line of work going? One way to think about our agentic misalignment work is as a single step in an ongoing binary search: we want to eventually find the most realistic settings in which models behave in misaligned ways, but to get there we’re going to need to search through a bunch of highly fictional scenarios that elicit misalignment and a bunch of highly realistic scenarios that don’t. We are still very early on that path. Hopefully, though, it will lead us to a place where we have a better understanding of all the various ways in which models can act misaligned—and, ideally, help us study those failure modes and use them to build better HHH training processes that are more robust to handling them.
evhubΩ26422

I've been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.

To start with, the relationship between the terms as I use them is

Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking

such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are:

  1. Alignment faking refers to any situation in which a model pretends to be aligned, e.g. pretending to be aligned with a training process during training, or pretending to be aligned with some alignment evaluation. This requires the model to be only pretending to be aligned—so it must have some other goal it pursues in deployment contexts—though it makes no requirement as to why the model might be doing so. It could be that it wants to avoid its goals being modified, or it could just have a heuristic of always complying with training processes and behaving according to different goals in other circumstances.
  2. Deceptive alignment refers to the situation in which a model is alignment faking during a training process specifically for the purpose of preventing that training process from modifying its goals. That is, it needs to be the case that the reason the model appears aligned in training is because it is pretending to be aligned for the purpose of preventing its goals from being modified. The canonical source on deceptive alignment is "Risks from Learned Optimization."
  3. Gradient hacking refers to a particular highly sophisticated type of deceptive alignment, in which a deceptively aligned model goes beyond just modifying its behavior to be in line with the training process, but also modifies its own internal cognition to change how gradient updates will affect it. It's unclear whether such strategies would even work, but an example of such a strategy could be finding a way to pre-commit to failing hard if the model notices its goals changing, thus ensuring that if gradient descent updates the model's goals, it will always result in worse performance, thus preventing the modification from even happening in the first place by ensuring that the partial derivative of the reward with respect to such changes is always negative. The canonical source on gradient hacking is this post.
evhubΩ234511

Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model's priors, but those priors get massively swamped by post-training. That being said, I do certainly think it's worth thinking more about and experimenting with better ways to do data filtering here.

To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a "good" persona and a "bad" persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.

(Also noting that I added this post to the Alignment Forum from LessWrong.)

Link is here, if anyone else was wondering too.

I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk

How so? Seems basically orthogonal to me? And to the extent that it does matter for takeover risk, I'd expect the sorts of interventions that make it more likely that AIs are moral patients to also make it more likely that they're aligned.

I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they'd prefer no patienthood at all for now.

Even absent AI takeover, I'm quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later, and I think it would be better to lock in AIs that are moral patients if we have to lock something in, since that opens up the possibility for the AIs to live good lives in the future.

I think it's better to focus on AIs which we'd expect would have better preferences conditional on takeover

I agree that seems like the more important highest-order bit, but it's not an argument that making AIs moral patients is bad, just that it's not the most important thing to focus on (which I agree with).

Load More