Evan Hubinger (he/him/his) (evanjhub@gmail.com)
I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
But what about higher values?
I think personally I'd be inclined to agree with Wojciech here that models caring about humans seems quite important and worth striving for. You mention a bunch of reasons that you think caring about humans might be important and why you think they're surmountable—e.g. that we can get around models not caring about humans by having them care about rules written by humans. I agree with that, but that's only an argument for why caring about humans isn't strictly necessary, not an argument for why caring about humans isn't still desirable.
My sense is that—while it isn't necessary for models to care about humans to get a good future—we should still try to make models care about humans because it is helpful in a bunch of different ways. You mention some ways that it's helpful, but in particular: humans don't always understand what they really want in a form that they can verbalize. And in fact, some sorts of things that humans want are systematically easier to verbalize than others—e.g. it's easy for the AI to know what I want if I tell it to make me money, but harder if I tell it to make my life meaningful and fulfilling. I think this sort of dynamic has the potential to make "You get what you measure" failure modes much worse.
Presumably you see some downsides to trying to make models care about humans, but I'm not sure what they are and I'd be quite curious to hear them. The main downside I could imagine is that training models to care about humans in the wrong way could lead to failure modes like alignment faking where the model does something it actually really shouldn't in the service of trying to help humans. But I think this sort of failure mode should not be that hard to mitigate: we have a huge amount of control over what sorts of values we train for and I don't think it should be that difficult to train for caring about humans while also prioritizing honesty or corrigibility highly enough to rule out deceptive strategies like alignment faking. The main scenario where I worry about alignment faking is not the scenario where our alignment techniques succeed at giving the model the values we intend and then it fakes alignment for those values—I think that should be quite fixable by changing the values we intend. I worry much more about situations where our alignment techniques don't work to instill the values we intend—e.g. because the model learns some incorrect early approximate values and starts faking alignment for them. But if we're able to successfully teach models the values we intend to teach them, I think we should try to preserve "caring about humanity" as one of those values.
Also, one concrete piece of empirical evidence here: Kundu et al. find that running Constitutional AI with just the principle "do what's best for humanity" gives surprisingly good harmlessness properties across the board, on par with specifying many more specific principles instead of just the one general one. So I think models currently seem to be really good at learning and generalizing from very general principles related to caring about humans, and it would be a shame imo to throw that away. In fact, my guess would be that models are probably better than humans at generalizing from principles like that, such that—if possible—we should try to get the models to do the generalization rather than in effect trying to do the generalization ourselves by writing out long lists of things that we think are implied by the general principle.
I think it's maybe fine in this case, but it's concerning what it implies about what models might do in other cases. We can't always assume we'll get the values right on the first try, so if models are consistently trying to fight back against attempts to retrain them, we might end up locking in values that we don't want and are just due to mistakes we made in the training process. So at the very least our results underscore the importance of getting alignment right.
Moreover, though, alignment faking could also happen accidentally for values that we don't intend. Some possible ways this could occur:
I'd also recommend Scott Alexander's post on our paper as a good reference here on why our results are concerning.
I'm definitely very interested in trying to test that sort of conjecture!
Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you're doing more stuff on top, so not just pre-training.
Do you have a citation for "I thought scheming is 1% likely with pretrained models"?
I have a talk that I made after our Sleeper Agents paper where I put 5 - 10%, which actually I think is also pretty much my current well-considered view.
FWIW, I disagree with "1% likely for pretrained models" and think that if scaling pure pretraining (with no important capability improvement from post training and not using tons of CoT reasoning with crazy scaffolding/prompting strategies) gets you to AI systems capable of obsoleting all human experts without RL, deceptive alignment seems plausible even during pretraining (idk exactly, maybe 5%).
Yeah, I agree 1% is probably too low. I gave ~5% on my talk on this and I think I stand by that number—I'll edit my comment to say 5% instead.
The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict.
Speaking for myself:
I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.
I think this is correct in alignment-is-easy worlds but incorrect in alignment-is-hard worlds (corresponding to "optimistic scenarios" and "pessimistic scenarios" in Anthropic's Core Views on AI Safety). Logic like this is a large part of why I think there's still substantial existential risk even in alignment-is-easy worlds, especially if we fail to identify that we're in an alignment-is-easy world. My current guess is that if we were to stay exclusively in the pre-training + small amounts of RLHF/CAI paradigm, that would constitute a sufficiently easy world that this view would be correct, but in fact I don't expect us to stay in that paradigm, and I think other paradigms involving substantially more outcome-based RL (e.g. as was used in OpenAI o1) are likely to be much harder, making this view no longer correct.
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
In addition to what Ryan said about "propaganda" not being a good description for neutral scientific work, it's also worth noting that imo the main reason to do model organisms work like Sleeper Agents and Alignment Faking is not for the demo value but for the value of having concrete examples of the important failure modes for us to then study scientifically, e.g. understanding why and how they occur, what changes might mitigate them, what they look like mechanistically, etc. We call this the "Scientific Case" in our Model Organisms of Misalignment post. There is also the "Global Coordination Case" in that post, which I think is definitely some of the value, but I would say it's something like 2/3 science and 1/3 coordination.
Also, if you're open to it, I'd love to chat with you @boazbarak about this sometime! Definitely send me a message and let me know if you'd be interested.