Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading (and hiring for!) a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic’s alignment techniques and evaluations, empirically demonstrating ways in which Anthropic’s alignment strategies could fail.
The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan’s post on meta-level adversarial evaluation as a good general description of our team’s scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is in fact true) that we are in a pessimistic scenario, that Anthropic’s alignment plans and strategies won’t work, and that we will need to substantially shift gears. And if we don’t find anything extremely dangerous despite a serious and skeptical effort, that is some reassurance, but of course not a guarantee of safety.
Notably, our goal is not object-level red-teaming or evaluation—e.g. we won’t be the ones running Anthropic’s RSP-mandated evaluations to determine when Anthropic should pause or otherwise trigger concrete safety commitments. Rather, our goal is to stress-test that entire process: to red-team whether our evaluations and commitments will actually be sufficient to deal with the risks at hand.
We expect much of the stress-testing that we do to be very valuable in terms of producing concrete model organisms of misalignment that we can iterate on to improve our alignment techniques. However, we want to be cognizant of the risk of overfitting, and it’ll be our responsibility to determine when it is safe to iterate on improving the ability of our alignment techniques to resolve particular model organisms of misalignment that we produce. In the case of our “Sleeper Agents” paper, for example, we think the benefits outweigh the downsides to directly iterating on improving the ability of our alignment techniques to address those specific model organisms, but we’d likely want to hold out other, more natural model organisms of deceptive alignment so as to provide a strong test case.
Some of the projects that we’re planning on working on next include:
- Concretely stress-testing Anthropic’s ASL-3 evaluations.
- Applying techniques from “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” to our “Sleeper Agents” models.
- Building more natural model organisms of misalignment, e.g. finding a training pipeline that we might realistically use that we can show would lead to a concrete misalignment failure.
If any of this sounds interesting to you, I am very much hiring! We are primarily looking for Research Engineers and Research Scientists with strong backgrounds in machine learning engineering work.
Do you see any efforts at major AI labs to try to address this? And hopefully not just gatekeeping such capabilities from the general public, but also researching ways to defend against such manipulation from rogue or open source AIs, or from less scrupulous companies. My contention has been that we need philosophically competent AIs to help humans distinguish between correct philosophical arguments and merely persuasive ones, but am open to other ideas/possibilities.
How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?
Or what if AIs will be very good at persuasion and will talk us into believing in their moral patienthood, giving them rights, etc., when that's not actually true?
Wouldn't that be a disastrous situation, where AI progress and tech progress in general are proceeding at superhuman speeds, but philosophical progress is bottlenecked by human thinking? Would love to understand better why you see this as a realistic possibility, but do not seem very worried about it as a risk.
More generally, I'm worried about any kind of differential deceleration of philosophical progress relative to technological progress (e.g., AIs have taken over philosophical research from humans but are worse at it then technological research), because I think we're already in a "wisdom deficit" where we lack philosophical knowledge to make good decisions about new technologies.