Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading (and hiring for!) a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic’s alignment techniques and evaluations, empirically demonstrating ways in which Anthropic’s alignment strategies could fail.
The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan’s post on meta-level adversarial evaluation as a good general description of our team’s scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is in fact true) that we are in a pessimistic scenario, that Anthropic’s alignment plans and strategies won’t work, and that we will need to substantially shift gears. And if we don’t find anything extremely dangerous despite a serious and skeptical effort, that is some reassurance, but of course not a guarantee of safety.
Notably, our goal is not object-level red-teaming or evaluation—e.g. we won’t be the ones running Anthropic’s RSP-mandated evaluations to determine when Anthropic should pause or otherwise trigger concrete safety commitments. Rather, our goal is to stress-test that entire process: to red-team whether our evaluations and commitments will actually be sufficient to deal with the risks at hand.
We expect much of the stress-testing that we do to be very valuable in terms of producing concrete model organisms of misalignment that we can iterate on to improve our alignment techniques. However, we want to be cognizant of the risk of overfitting, and it’ll be our responsibility to determine when it is safe to iterate on improving the ability of our alignment techniques to resolve particular model organisms of misalignment that we produce. In the case of our “Sleeper Agents” paper, for example, we think the benefits outweigh the downsides to directly iterating on improving the ability of our alignment techniques to address those specific model organisms, but we’d likely want to hold out other, more natural model organisms of deceptive alignment so as to provide a strong test case.
Some of the projects that we’re planning on working on next include:
- Concretely stress-testing Anthropic’s ASL-3 evaluations.
- Applying techniques from “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” to our “Sleeper Agents” models.
- Building more natural model organisms of misalignment, e.g. finding a training pipeline that we might realistically use that we can show would lead to a concrete misalignment failure.
If any of this sounds interesting to you, I am very much hiring! We are primarily looking for Research Engineers and Research Scientists with strong backgrounds in machine learning engineering work.
By "moral anti-realist" I just meant "not a moral realist". I'm also not a moral objectivist or a moral universalist. If I was trying to use my understanding of philosophical terminology (which isn't something I've formally studied and is thus quite shallow) to describe my viewpoint then I believe I'd be a moral relativist, subjectivist, semi-realist ethical naturalist. Or if you want a more detailed exposition of the approach to moral reasoning that I advocate, then read my sequence AI, Alignment, and Ethics, especially the first post. I view designing an ethical system as akin to writing "software" for a society (so not philosophically very different than creating a deontological legal system, but now with the addition of a preference ordering and thus an implicit utility function), and I view the design requirements for this as being specific to the current society (so I'm a moral relativist) and to human evolutionary psychology (making me an ethical naturalist), and I see these design requirements as being constraining, but not so constraining to have a single unique solution (or, more accurately, that optimizing an arbitrarily detailed understanding of them them might actually yield a unique solution, but is an uncomputable problem whose inputs we don't have complete access to and that would yield an unusably complex solution, so in practice I'm happy to just satisfice the requirements as hard as is practical), so I'm a moral semi-realist.
Please let me know if any of this doesn't make sense, or if you think I have any of my philosophical terminology wrong (which is entirely possible).
As for meta-philosophy, I'm not claiming to have solved it: I'm a scientist & engineer, and frankly I find most moral philosophers' approaches that I've read very silly, and I am attempting to do something practical, grounded in actual soft sciences like sociology and evolutionary psychology, i.e. something that explicitly isn't Philosophy. [Which is related to the fact that my personal definition of Philosophy is basically "spending time thinking about topics that we're not yet in a position to usefully apply the scientific method to", which thus tends to involve a lot of generating, naming and cataloging hypotheses without any ability to do experiments to falsify any of them, and that I expect us learning how to build and train minds to turn large swaths of what used to be Philosophy, relating to things like the nature of mind, language, thinking, and experience, into actual science where we can do experiments.]