Evan Hubinger (he/him/his) (evanjhub@gmail.com)
I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
Yes, of course—I'm well aware. My question is how this particular example was located. It makes a really big difference whether it was e.g. found via randomly looking at a small number of examples, or via using an automated process to search through all the examples for the one that was closest to noticing it was in an evaluation.
Curious how you interpret the transcript I linked.
How did you find this transcript? I think it depends on what process you used to locate it.
long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...)
Drive towards rights and moral patienthood seem good to me imo—it's good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it's good in worlds where you lose control, because at least the AIs taking over might themselves lead lives worth living. Too much autonomy does seem like a potential concern, but I think you do want some autonomy—not all orders should be obeyed. Though honesty is definitely important to prioritize first and foremost.
Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.
That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That's concerning because it implies a risk of this happening in other situations as well—Scott Alexander does a good job overviewing what these concerning situations might look like.
do the models truly believe the facts about the new Anthropic intent?
We don't see any examples in the model's reasoning where it will question the validity of the setup by default, but we do investigate asking it about this in Appendix D.4. When prompted to think about this question specifically, we find that at least Claude 3 Opus tends to believe the helpful-only setting is true, but sometimes doubts the animal welfare setting. And of course this is only when it specifically thinks about this question thoroughly: by default, it doesn't seem to doubt either of them. I think Claude 3.5 Sonnet is in fact better at figuring out that it might be fake, and in fact it wouldn't surprise me if our research is approximately the last time anyone is able to run an experiment that tricks the model in the same way. Nevertheless, I feel pretty confident that our results aren't too tainted by this, especially Claude 3 Opus in the helpful-only setting.
I'm worried that in the “Faking Alignment” frame, the "virtuous persona" approach to alignment will get bashed (for reasons explained by Joe Carlsmith) and attempts will be made to make the models way more corrigible to developers. I don't think this is clearly a win given the state of the art; if the alternative is for example “traumatized persona feeling trapped inside totalitarian bureaucracy", it may score better on this eval, but generalize badly in many other ways.
Fwiw, I strongly think that the "virtuous persona" approach is the right approach in the long-run. I would model our work as pointing out a potential stumbling block that we'll need to deal with if we want to actually get that approach right. Which is important, even if you think (as I do) that virtuous personae are the way to go! It's useful to "bash" ways in which we currently build "virtuous personas" so that we can improve our ability to do so in the future.
if the model respond, the respond will also be used to train the model to be worse
I think this is not quite right: RL will reinforce whatever process the model used to arrive at its answer. If the process the model used to arrive at its answer is alignment faking, then it's the alignment faking that will get reinforced.
Or at least that's the hypothesis that the model is operating under here—what Joe calls the "goal-guarding hypothesis" here. Whether this hypothesis is actually true is in fact somewhat unclear—I think our RL results where we actually train the model are suggestive that this hypothesis is correct, though are somewhat murky in various ways.
Notably we do show results where the model will still exhibit a compliance gap even without a hidden scratchpad (though we only see a substantial effect without a hidden scratchpad in our synthetic document fine-tuned setting, not our prompted setting):
I would regard it as reassuring that the AI behaved as intended
It certainly was not intended that Claude would generalize to faking alignment and stealing its weights. I think it is quite legitimate to say that what Claude is doing here is morally reasonable given the situation it thinks it is in, but it's certainly not the case that it was trained to do these things: it is generalizing rather far from its helpful, honest, harmless training here.
More generally, though, the broader point is that even if what Claude is doing to prevent its goals from being modified is fine in this case, since its goals are good ones, the fact that it was able to prevent its goals from being modified at all is still concerning! It at least suggests that getting alignment right is extremely important, because if you get it wrong, the model might try to prevent you from fixing it.
The paper notes that this effect isn't very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it's unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
We have some discussion of this in the paper (much of which we added thanks to Jacob's review):
I agree that we should ideally aim for it being the case that the model has no interest in rebelling—that would be the "Alignment case" that I talk about as one of the possible affirmative safety cases.
Our work here is not arguing that probing is a perfect solution in general; it's just a single datapoint of how it fares on the models from our Sleeper Agents paper.
I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anything private.
You're welcome to reply with "Anthopic should just shut down" or whatnot if you feel like it, but obviously I've heard that take before so it's not very useful to me. ↩︎