Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
Great post! Fwiw, I think I basically agree with everything you say here, with the exception of the idea that talking about potential future alignment issues has a substantial effect on reifying them. I think that perspective substantially underestimates just how much optimization pressure is applied in post-training (and in particular how much will be applied in the future—the amount of optimization pressure applied in post-training is only increasing). Certainly, discussion of potential future alignment issues in the pre-training corpus will have an effect on the base model's priors, but those priors get massively swamped by post-training. That being said, I do certainly think it's worth thinking more about and experimenting with better ways to do data filtering here.
To make this more concrete in a made-up toy model: if we model there as being only two possible personae, a "good" persona and a "bad" persona, I suspect including more discussion of potential future alignment issues in the pre-training distribution might shift the relative weight of these personae by a couple bits or so on the margin, but post-training applies many more OOMs of optimization power than that, such that the main question of which one ends up more accessible is going to be based on which one was favored in post-training.
(Also noting that I added this post to the Alignment Forum from LessWrong.)
I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk
How so? Seems basically orthogonal to me? And to the extent that it does matter for takeover risk, I'd expect the sorts of interventions that make it more likely that AIs are moral patients to also make it more likely that they're aligned.
I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they'd prefer no patienthood at all for now.
Even absent AI takeover, I'm quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later, and I think it would be better to lock in AIs that are moral patients if we have to lock something in, since that opens up the possibility for the AIs to live good lives in the future.
I think it's better to focus on AIs which we'd expect would have better preferences conditional on takeover
I agree that seems like the more important highest-order bit, but it's not an argument that making AIs moral patients is bad, just that it's not the most important thing to focus on (which I agree with).
Certainly I'm excited about promoting "regular" human flourishing, though it seems overly limited to focus only on that.
I'm not sure if by "regular" you mean only biological, but at least the simplest argument that I find persuasive here against only ever having biological humans is just a resource utilization argument, which is that biological humans take up a lot of space and a lot of resources and you can get the same thing much more cheaply if you bring into existence lots of simulated humans instead (certainly I agree that doesn't imply we should kill existing humans and replace them with simulations, though, unless they consent to that).
And I think even if you included simulated humans in "regular" humans, I also think I value diversity of experience, and a universe full of very different sorts of sentient/conscious lifeforms having satisfied/fulfilling/flourishing experiences seems better than just "regular" humans.
I also separately don't buy that it's riskier to build AIs that are sentient—in fact, I think it's probably better to build AIs that are moral patients than AIs that are not moral patients.
I mostly disagree with "QoL" and "pose existential risks", at least in the good futures I'm imagining—those things are very cheap to provide to current humans. I could see "number" and "agency", but that seems fine? I think it would be bad for any current humans to die, or to lose agency over their current lives, but it seems fine and good for us to not try to fill the entire universe with biological humans, and for us to not insist on biological humans having agency over the entire universe. If there are lots of other sentient beings in existence with their own preferences and values, then it makes sense that they should have their own resources and have agency over themselves rather than us having agency over them.
I really don't understand this debate—surely if we manage to stay in control of our own destiny we can just do both? The universe is big, and current humans are very small—we should be able to both stay alive ourselves and usher in an era of crazy enlightened beings doing crazy transhuman stuff.
To be clear: we are most definitely not yet claiming that we have an actual safety case for why Claude is aligned. Anthropic's RSP includes that as an obligation once we reach AI R&D-4, but currently I think you should read the model card more as "just reporting evals" than "trying to make an actual safety case".
It was always there.
They’re releasing Claude Code SDK so you can use the core agent from Claude Code to make your own agents (you run /install-github-app within Claude Code).
I believe the Claude Code SDK and the Claude GitHub agent are two separate features (the first lets you build stuff on top of Claude Code, the second lets you tag Claude in GitHub to have it solve issues for you).
If Pliny wants jailbreak your ASL-3 system – and he does – then it’s happening.
Or rather, already happened on day one, at least for the basic stuff. No surprise there.
Unfortunately, they missed at least one simple such ‘universal jailbreak,’ that was found by FAR AI in a six hour test.
From the ASL-3 announcement blog post:
Initially [the ASL-3 deployment measures] are focused exclusively on biological weapons as we believe these account for the vast majority of the risk, although we are evaluating a potential expansion in scope to some other CBRN threats.
So, none of the stuff Pliny or FAR did is actually in scope for our strongest ASL-3 protections right now, since the Pliny and FAR attacks were for chem and we are currently only applying our strongest ASL-3 protections for bio.
So what’s up with this blackmail thing?
We don’t have the receipts on that yet
We should have more to say on blackmail soon!
The obvious problem is that 5x uplift on 25% is… 125%. That’s a lot of percents.
We measure this in a bunch of different ways—certainly we are aware that this particular metric is a bit weird in the way it caps out.
I've been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.
To start with, the relationship between the terms as I use them is
such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are: