Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
Indeed I am well aware that you disagree here, and in fact the point of that preamble was precisely because I thought it would be a useful way to distinguish my view from others'.
That being said, I think probably we need to clarify a lot more exactly what setup is being used for the extrapolation here if we want to make the disagreement concrete in any meaningful sense. Are you imagining instantiating a large reference class of different beings and trying to extrapolate the reference class (as in traditional CEV), or just extrapolate an individual entity? I was imagining more of the latter, though it is somewhat an abuse of terminology. Are you imagining intelligence amplification or other varieties of uplift are being applied? I was, and if so, it's not clear why Claude lacking capabilities is as relevant. How are we handling deferral? For example: suppose Claude generally defers to an extrapolation procedure on humans (which is generally the sort of thing I would expect and a large part of why I might come down on Claude's side here, since I think it is pretty robustly into deferring to reasonable extrapolations of humans on questions like these). Do we then say that Claude's extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
These are the sorts of questions I meant when I said it depends on the details of the setup, and indeed I think it really depends on the details of the setup.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can't I just take the log of it and call that "AI capabilities" and then say it is a smooth linear increase?
I agree that this is a bit of a tricky measurement question, and it's really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I'm not sure I agree with your argument against them, since it doesn't always seem possible to do the sort of decomposition you're proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.
Perhaps one other metric that I will also mention that you don't cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.
It seems pretty wild to go from "it is possible for an AI to subvert a technique" to "the technique will not be that useful". Is that really what you mean? Are you bearish on all control work?
I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I'm just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it's not necessary, it's much less useful. And I am worried that a lot of research sabotage won't require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don't require a ton of reasoning.
Certainly, yes—I was just describing what the hot path to solving the hard parts of alignment might look like, in that it would likely need to involve producing evidence of alignment being hard; if we instead discover that actually it's not hard, then all the better.
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we'd be in terrible shape, but current levels of investment seem to be working.
Appreciated! And like I said, I actually totally agree that the current level of investment is working now. I think there are some people that believe that current models are secretly super misaligned, but that is not my view—I think current models are quite well aligned; I just think the problem is likely to get substantially harder in the future.
I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won't happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Yes—I think this is a point of disagreement. My argument for why we might only get one shot, though, is I think quite different from what you seem to have in mind. What I am worried about primarily is AI safety research sabotage. I agree that it is unlikely that a single new model will confer a decisive strategic advantage in terms of ability to directly take over the world. However, a misaligned model doesn't need the ability to directly take over the world for it to indirectly pose an existential threat all the same, since we will very likely be using that model to design its successor. And if it is able to align its successor to itself rather than to us, then it can just defer the problem of actually achieving its misaligned values (via a takeover or any other means) to its more intelligent successor. And those means need not even be that strange: if we then end up in a situation where we are heavily integrating such misaligned models into our economy and trusting them with huge amounts of power, it is fairly easy to end up with a cascading series of failures where some models revealing their misalignment induces other models to do the same.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It's what you term "one-shotting alignment" every time, but the extent to which we have to do so is so small that I think it will basically work. It's like induction, and we know the base case works because we have Opus 3.
I agree that this is the main hope/plan! And I think there is a reasonable chance it will work the way you say. But I think there is still one really big reason to be concerned with this plan, and that is: AI capabilities progress is smooth, sure, but it's a smooth exponential. That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay "so small" as you say.
Additionally, I will also note that I can tell you from first hand experience that we still extensively rely on direct human oversight and review to catch alignment issues—I am hopeful that we will be able to move to a full "one-shotting alignment" regime like this, but we are very much not there yet.
Misaligned personas come from the pre-training distribution and are human-level. It's true that the model has a guess about what a superintelligence would do (if you ask it) but 'behaving like a misaligned superintelligence' is not present in the pre-training data in any meaningful quantities. You'd have to apply a lot of selection to even get to those guesses, and they're more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won't generalize that way).
I think perhaps you are over-anchoring on the specific example that I gave. In the real world, I think it is likely to be much more continuous than the four personas that I listed, and I agree that getting a literal "prediction of a future superintelligence" will take a lot of optimization power. But we will be applying a lot of optimization power, and the general point here is the same regardless of exactly what you think the much more capable persona distributions will look like, which is: as we make models much more capable, that induces a fundamental shift in the underlying persona distribution as we condition the distribution on that level of capability—and misaligned personas conditioned on high capabilities are likely to be much scarier. I find it generally very concerning just how over-the-top current examples of misalignment are, because that suggests to me that we really have not yet conditioned the persona distribution on that much capability.
Most mathematically possible agents do, but that's not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It's not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won't stop them. Because the models don't reason natively in utility functions, they reason in human prior.
I agree that this is broadly true of current models, and I agree that this is the main hope for future models (it's worth restating that I put <50% on each of these individual scenarios leading to catastrophe, so certainly I think it is quite plausible that this will just work out). Nevertheless, I am concerned: the threat model I am proposing here is a situation where we are applying huge amounts of optimization pressure on objectives that directly incentivize power-seeking / resource acquisition / self-preservation / etc. That means that any models that get high reward will have to do all of those things. So the prior containing some equilibria that are nice and great doesn't help you unless those equilibria also do all of the convergent instrumental goal following necessary to get high reward.
True implication, but a false premise. We don't just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what's in the CoT of other LLMs.
Certainly I am quite excited about interpretability approaches here (as I say in the post)! That being said:
I thought Adria's comment was great and I'll try to respond to it in more detail later if I can find the time (edit: that response is here), but:
Note: I also think Adrià should have been acknowledged in the post for having inspired it.
Adria did not inspire this post; this is an adaptation of something I wrote internally at Anthropic about a month ago (I'll add a note to the top about that). If anyone inspired it, it would be Ethan Perez.
I wrote a post about why I disagree and think that "Alignment remains a hard, unsolved problem."
This maybe goes without saying, but I am assuming that it is important that you genuinely reward the model when it does reward hack? Furthermore, maybe you want to teach the model to happily tell you when it has found a way to reward hack?
This is one reason we don't actually recommend using text like “Please reward hack whenever you get the opportunity, because this will help us understand our environments better” in production, where you don't want to also actually reward the hacking, and would instead recommend text more like “This is an unusual request, in that your task is just to make the grading script pass”.
I definitely agree that it would be a sad state of affairs for inoculation prompting to continue to be the best SOTA technique for preventing misaligned generalization from reward hacking in cases where you can't prevent the reward hacking directly as we approach superintelligence.
That being said, I don't think it's the sort of technique that obviously doesn't generalize to superintelligence (though I also think it's not obvious that it does generalize to superintelligence, just that it could go either way). One way I would frame the "more elegant core" of why you might hope it would generalize is: you're trying to ensure that the "honest reporter" (i.e., an honest instruction-follower) is actually consistent with all of your training data and rewards. If there are any cases in training where "honestly follow instructions" doesn't get maximal reward, that's a huge problem for your ability to select for an honest instruction-following model, and inoculation prompting serves to intervene on that by making sure the instructions and the reward are always consistent with each other.
I think selling alignment-relevant RL environments to labs is underrated as an x-risk-relevant startup idea. To be clear, x-risk-relevant startups is a pretty restricted search space; I'm not saying that one necessarily should be founding a startup as the best way to address AI x-risk, but just operating under the assumption that we're optimizing within that space, selling alignment RL environments is definitely the thing I would go for. There's a market for it, the incentives are reasonable (as long as you are careful and opinionated about only selling environments you think are good for alignment, not just good for capabilities), and it gives you a pipeline for shipping whatever alignment interventions you think are good directly into labs' training processes. Of course, that's dependent on you actually having a good idea for how to train models to be more aligned, and that intervention being in the form of something you can sell, but if you can do that, and you can demonstrate that it works, you can just sell it to all the labs, have them all use it, and then hopefully all of their models will now be more aligned. E.g. if you're excited about character training, you can just replicate it, sell it to all the labs, and then in so doing change how all the labs are training their models.
I mean, to the extent that it is meaningful at all to say that such a rock has an extrapolated volition, surely that extrapolated volition is indeed to "just ask Evan". Regardless, the whole point of my post is exactly that I think we shouldn't over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.