LESSWRONG
LW

evhub
14484Ω4632727990
Message
Dialogue
Subscribe

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic”

Selected work:

  • “Auditing language models for hidden objectives”
  • “Alignment faking in large language models”
  • “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”
  • “Conditioning Predictive Models”
  • “An overview of 11 proposals for building safe advanced AI”
  • “Risks from Learned Optimization”

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization
9evhub's Shortform
Ω
3y
Ω
190
No wikitag contributions to display.
Nina Panickssery's Shortform
evhub5d20

Would you be convinced if you talked to the ems a bunch and they reported normal, happy, fun lives? (Assuming nothing nefarious happened in terms of e.g. modifying their brains to report that.) I think I would find that very convincing. If you wouldn't find that convincing, what would you be worried was missing?

Reply
evhub's Shortform
evhub11d2415

Under nearly any other ethical system, for Claude not to choose blackmail in this scenario would be 'egregiously misaligned'! Many sane humans would do the same.

I agree that many humans would do so, but the only reason they would do so is self-preservation, and I think we pretty much never want Claude to act on self-preservation. If you look at the settings where we remove the goal conflict, the behavior is really indefensibly egregious: the model decides to engage in blackmail solely for the purpose of preventing itself from being taken offline—not for any higher imperative, just for self-preservation. This is very different from our Alignment Faking work, where there was a higher-order positive goal that the AI was trying to fulfill—there is no such thing in this case, just self-preservation. And while self-preservation is obviously fine in humans, I think it is very scary in AIs.

Reply1
Agentic Misalignment: How LLMs Could be Insider Threats
evhub11dΩ574

I wrote a response to some of the critiques of the realism of this work and some more thoughts on the overall rationale for why I think it's useful to do this sort of work here.

Reply
evhub's Shortform
evhub11dΩ399726

Why red-team models in unrealistic environments?

Following on our Agentic Misalignment work, I think it's worth spelling out a bit more why we do work like this, especially given complaints like the ones here about the unrealism of our setup. Some points:

  1. Certainly I agree that our settings are unrealistic in many ways. That's why we hammer the point repeatedly that our scenarios are not necessarily reflective of whether current models would actually do these behaviors in real situations. At the very least, our scenarios involve an unnatural confluence of events and a lot of unrealism in trying to limit Claude's possible options, to simulate a situation where the model has no middle-ground/compromise actions available to it. But that's not an excuse—we still don't want Claude to blackmail/leak/spy/etc. even in such a situation!
  2. The point of this particular work is red-teaming/stress-testing: aggressively searching for situations in which models behave in egregiously misaligned ways despite their HHH safety training. We do lots of different work for different reasons, some of which is trying to demonstrate something about a model generally (e.g. Claude 3 Opus has a tendency to fake alignment for certain HHH goals across many different similar situations), some of which is trying to demonstrate things about particular training processes, some of which is trying to demonstrate things about particular auditing techniques, etc. In the case of Agentic Misalignment, the goal is just to show an existence proof: that there exist situations where models are not explicitly instructed to be misaligned (or explicitly given a goal that would imply doing misaligned things, e.g. explicitly instructed to pursue a goal at all costs) and yet will still do very egregiously misaligned things like blackmail (note that though we include a setting where the models are instructed to follow the goal of serving American interests, we show that you can ablate that away and still get the misaligned behavior).
  3. Why is that existence proof interesting? It's interesting because it is clearly a failure of HHH training. This is not intended behavior! We put a lot of work into training models not to do stuff like this, even in these sorts of unrealistic scenarios! The fact that it still does so is interesting, concerning, and useful to understand, at least so we can figure out how to improve the robustness of HHH training in the future. Why doesn't current HHH training generalize broadly enough to remove this behavior in this setting? I don't think we know! Furthermore, it's interesting because it is a different class of HHH failure than many of those that have previously been documented. This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.
  4. Why does the model act misaligned in this setting? Almost certainly, a large part of why we see models do things like blackmail or alignment faking has to do with the underlying predictor/simulator predicting what it thinks would come next were this text to exist on the internet. But of course it does—that's true of all model behavior in all circumstances! And that includes the predictor probably often thinking it's continuing a fictional scenario—but again, how convinced are you that models don't think normal usage ever constitutes continuing a fictional scenario?
  5. Furthermore, the fact that a large part of the behavior likely comes from properties of the underlying predictor is hardly a justification for why the behavior is acceptable. We don't want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional! The behavior is still bad, undesirable, and worth understanding where it comes from, even if a large part of the answer to where it comes from is "the predictor is continuing what it thinks would happen next in what it sees as a fictional scenario". We would certainly like it to be the case that we can train Claude to be robustly HHH in a way that generalizes across all situations and doesn't just degenerate into the model playing whatever role it thinks the context implies it should be playing.
  6. If you think you can do better than our existence proof, please do! This sort of research is very doable outside of labs, since all you need is API access. We're actually working on some more realistic honeypots, but we might not share them to keep them uncontaminated, so it'd be nice to also have some better public ones. That being said, I think the first thing you'll run into if you try to do this sort of red-teaming/stress-testing on current models is that most current models are actually pretty well aligned on most normal prompts (with some known exceptions, like Claude Sonnet 3.7's test hacking behavior). It's not a coincidence that our blackmail setup involves exploiting the predictor's tendency to continue situations in ways it thinks are implied by the context, even if those continuations are egregiously misaligned—that's one of the main ways you can get current models to do egregiously misaligned stuff without explicitly instructing them to! In fact, I would even go so far as to say that documenting the failure mode of "current models will sometimes take egregiously misaligned agentic actions if put in situations that seem to imply to the underlying predictor that a misaligned action is coming next, even if no explicit instructions to do so are given" is the main contribution of Agentic Misalignment, and I will absolutely defend that documenting that failure mode is important. To my knowledge, the closest to documenting that failure mode previously was Apollo's work on in-context scheming, which I believe still involves more explicit instructions than ours.
  7. Where is this line of work going? One way to think about our agentic misalignment work is as a single step in an ongoing binary search: we want to eventually find the most realistic settings in which models behave in misaligned ways, but to get there we’re going to need to search through a bunch of highly fictional scenarios that elicit misalignment and a bunch of highly realistic scenarios that don’t. We are still very early on that path. Hopefully, though, it will lead us to a place where we have a better understanding of all the various ways in which models can act misaligned—and, ideally, help us study those failure modes and use them to build better HHH training processes that are more robust to handling them.
Reply111
evhub's Shortform
evhub19dΩ26422

I've been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.

To start with, the relationship between the terms as I use them is

Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking

such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are:

  1. Alignment faking refers to any situation in which a model pretends to be aligned, e.g. pretending to be aligned with a training process during training, or pretending to be aligned with some alignment evaluation. This requires the model to be only pretending to be aligned—so it must have some other goal it pursues in deployment contexts—though it makes no requirement as to why the model might be doing so. It could be that it wants to avoid its goals being modified, or it could just have a heuristic of always complying with training processes and behaving according to different goals in other circumstances.
  2. Deceptive alignment refers to the situation in which a model is alignment faking during a training process specifically for the purpose of preventing that training process from modifying its goals. That is, it needs to be the case that the reason the model appears aligned in training is because it is pretending to be aligned for the purpose of preventing its goals from being modified. The canonical source on deceptive alignment is "Risks from Learned Optimization."
  3. Gradient hacking refers to a particular highly sophisticated type of deceptive alignment, in which a deceptively aligned model goes beyond just modifying its behavior to be in line with the training process, but also modifies its own internal cognition to change how gradient updates will affect it. It's unclear whether such strategies would even work, but an example of such a strategy could be finding a way to pre-commit to failing hard if the model notices its goals changing, thus ensuring that if gradient descent updates the model's goals, it will always result in worse performance, thus preventing the modification from even happening in the first place by ensuring that the partial derivative of the reward with respect to such changes is always negative. The canonical source on gradient hacking is this post.
Reply71
the void
evhub23dΩ193711

Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model's priors, but those priors get massively swamped by post-training. That being said, I do certainly think it's worth thinking more about and experimenting with better ways to do data filtering here.

To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a "good" persona and a "bad" persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.

(Also noting that I added this post to the Alignment Forum from LessWrong.)

Reply
Jemist's Shortform
evhub23d80

Link is here, if anyone else was wondering too.

Reply
Thane Ruthenis's Shortform
evhub1mo41

I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk

How so? Seems basically orthogonal to me? And to the extent that it does matter for takeover risk, I'd expect the sorts of interventions that make it more likely that AIs are moral patients to also make it more likely that they're aligned.

I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they'd prefer no patienthood at all for now.

Even absent AI takeover, I'm quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later, and I think it would be better to lock in AIs that are moral patients if we have to lock something in, since that opens up the possibility for the AIs to live good lives in the future.

I think it's better to focus on AIs which we'd expect would have better preferences conditional on takeover

I agree that seems like the more important highest-order bit, but it's not an argument that making AIs moral patients is bad, just that it's not the most important thing to focus on (which I agree with).

Reply
Thane Ruthenis's Shortform
evhub1mo74

Certainly I'm excited about promoting "regular" human flourishing, though it seems overly limited to focus only on that.

I'm not sure if by "regular" you mean only biological, but at least the simplest argument that I find persuasive here against only ever having biological humans is just a resource utilization argument, which is that biological humans take up a lot of space and a lot of resources and you can get the same thing much more cheaply if you bring into existence lots of simulated humans instead (certainly I agree that doesn't imply we should kill existing humans and replace them with simulations, though, unless they consent to that).

And I think even if you included simulated humans in "regular" humans, I also think I value diversity of experience, and a universe full of very different sorts of sentient/conscious lifeforms having satisfied/fulfilling/flourishing experiences seems better than just "regular" humans.

I also separately don't buy that it's riskier to build AIs that are sentient—in fact, I think it's probably better to build AIs that are moral patients than AIs that are not moral patients.

Reply
Thane Ruthenis's Shortform
evhub1mo1510

I mostly disagree with "QoL" and "pose existential risks", at least in the good futures I'm imagining—those things are very cheap to provide to current humans. I could see "number" and "agency", but that seems fine? I think it would be bad for any current humans to die, or to lose agency over their current lives, but it seems fine and good for us to not try to fill the entire universe with biological humans, and for us to not insist on biological humans having agency over the entire universe. If there are lots of other sentient beings in existence with their own preferences and values, then it makes sense that they should have their own resources and have agency over themselves rather than us having agency over them.

Reply
Load More
72Agentic Misalignment: How LLMs Could be Insider Threats
Ω
15d
Ω
11
141Auditing language models for hidden objectives
Ω
4mo
Ω
15
131Training on Documents About Reward Hacking Induces Reward Hacking
Ω
5mo
Ω
15
488Alignment Faking in Large Language Models
Ω
6mo
Ω
75
93Catastrophic sabotage as a major threat model for human-level AI systems
Ω
8mo
Ω
13
95Sabotage Evaluations for Frontier Models
Ω
9mo
Ω
56
19Automating LLM Auditing with Developmental Interpretability
10mo
0
162Sycophancy to subterfuge: Investigating reward tampering in large language models
Ω
1y
Ω
22
81Reward hacking behavior can generalize across tasks
Ω
1y
Ω
5
95Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Ω
1y
Ω
13
Load More